# Model Training PyLingual's accuracy is dependent on having accurate segmentation and statement models [^1]. The segmentation model divides a list of bytecode instructions into groups for each source instruction. The statement model transforms each group of instructions into source code. The instructions for training these models is as follows: ## Dataset generation First install [pyenv](https://github.com/pyenv/pyenv) and the required Python versions for the dataset. Create a dataset JSON file based off the sample (`sample_jsons/py36-sample-data.json`). The dataset directory should be structured like so, with only one `.py` file per directory: ``` dataset ├── 0 │   └── file.py ├── 1 │   └── file.py ... ├── 999 │   └── file.py └── 1000 └── file.py ``` The names of the inner directories and files do not matter. Then create the dataset: ``` python prepare_dataset.py ``` ## Segmentation model Create a segmentation model JSON file based off the sample (`sample_jsons/py36-sample-segmentation.json`). Then train the model: ``` python train_models.py --segmentation ``` ## Statement model Create a statement model JSON file based off the sample (`sample_jsons/py36-sample-statement.json`). Then train the model: ``` python train_models.py --statement ``` Once models are trained, update `../pylingual/decompiler_config.yaml` or create a separate config file by replacing the old models with the newly trained ones. [^1]: [pylingual models](https://huggingface.co/syssec-utd).