rename

2026-05-10 18:39:03 -07:00 · 2025-03-13 16:56:36 -05:00
parent b2439eee3e
commit 046e80cdd1
27 changed files with 0 additions and 0 deletions
@@ -0,0 +1,48 @@
+# Model Training
+
+PyLingual's accuracy is dependent on having accurate segmentation and statement models [^1]. The segmentation model divides a list of bytecode instructions into groups for each source instruction. The statement model transforms each group of instructions into source code. The instructions for training these models is as follows:
+
+## Dataset generation
+
+First install [pyenv](https://github.com/pyenv/pyenv) and the required Python versions for the dataset. Create a dataset JSON file based off the sample (`sample_jsons/py36-sample-data.json`).
+
+The dataset directory should be structured like so, with only one `.py` file per directory:
+
+```
+dataset
+├── 0
+│   └── file.py
+├── 1
+│   └── file.py
+...
+├── 999
+│   └── file.py
+└── 1000
+    └── file.py
+```
+
+The names of the inner directories and files do not matter. Then create the dataset:
+
+```
+python prepare_dataset.py <path to JSON>
+```
+
+## Segmentation model
+
+Create a segmentation model JSON file based off the sample (`sample_jsons/py36-sample-segmentation.json`). Then train the model:
+
+```
+python train_models.py --segmentation <path to JSON>
+```
+
+## Statement model
+
+Create a statement model JSON file based off the sample (`sample_jsons/py36-sample-statement.json`). Then train the model:
+
+```
+python train_models.py --statement <path to JSON>
+```
+
+Once models are trained, update `../pylingual/decompiler_config.yaml` or create a separate config file by replacing the old models with the newly trained ones.
+
+[^1]: [pylingual models](https://huggingface.co/syssec-utd).