-
Notifications
You must be signed in to change notification settings - Fork 42
version_upgrade_process
The following guide details the steps followed in upgrading the LLVM version supported by IR2Vec
- Instructions for the same are present here
-
The git repo
IR2Vec-Version-Upgrade-Checkshas the required scripts to be run for this process. -
The repo is available here
-
Our relevant scripts and files will be present in the folder
collect_ir.collect_ir/spec/get_ll_files_list.pycollect_ir/boost/get_ll_files_list.pycollect_ir/spec/get_ll_spec.sh
-
We use the C++ Library Boost, and the CPU-SPEC source codes to generate training data.
- Download these source codes.
-
Compile the relevant Boost
.c*files with the relevant LLVM version.- The folder has the script
get_boost_ll.pyfor this purpose.
- The folder has the script
-
CPU/SPEC .c++ files compilation with the relevant LLVM version.
- For detailed instructions on this step, refer to here.
-
Collect the paths to all these compiled .ll / .o files in a single place using the scripts
collect_ir/spec/get_ll_files_list.pyandcollect_ir/boost/get_ll_files_list.py. -
Once we have compiled the list of all the
.llfile paths, we go to the seed_embedding folder in the main IR2Vec repository. Here, our process will have to involve the following tasks.- Generating Training Triplets
- Preprocessing the data
- Training on the data and generating a final embedding file.
- Using the embedding file to generate the test oracle.
- Running the testing to verify the validity of the entire upgrade process.
- Run the
triplets.shbash file with relevant changes to update the llvm version. Instructions to run the same are available atseed_embeddings/README.md.
- In this same README.md file, we also have the instructions to run this next step.
- The relevant file is present in the
openKEfolder, atIR2Vec/seed_embeddings/OpenKE/preprocess.py - Once the file has been run, we should have a
preprocessedfolder. Inside this folder, we should have the relevant preprocessed data generated. - Go ahead and create an empty
embeddingsfolder here. This will be relevant for the next step.
We have recently retrained the IR2Vec embeddings with a larger dataset. This is the ComPile dataset. This dataset is a collection of LLVM IR files from open-source projects, and is considerably larger than the current dataset used in the original IR2Vec paper. Further details about the retraining process can be found here.
Once the trainIDs, relations and entities files are generated, we can use them, as it is, in the training process as described before.
- The next file to run is the
generate_embeddings_ray.pyfile in theopenKEfolder.- Use the
openKE.yamlfile to create the conda environment. - This environment should have
Rayandtensorflowinstalled. - Modify the
_ray.pyfile with relevant training hyperparameters. - Run the file using the command
python3 generate_embeddings_ray.py. This will run the training, generate the best embedding file and record the results. - Once we have generated the embeddings files, we have to use the embeddings file to update the
oracleand get thetest_suiteworking. - Copy the best embedding file into
seedEmbedding.txtand move it to thevocabularyfolder. Remove any prior files present there.
- Use the
- For this, we now move to the
srcfolder. This folder contains thetest-suitefolder as well. - Here, two scripts are of importance.
generate_llfiles.shandgenerateOracle.sh. Run both of these files with the appropriate version ofllvm.- Modify
CMakeLists.txtin thetest-suitefolder with the appropriate changes. - Similarly, modify the
sanity_check.sh.cmakefile with the appropriate paths for vocabulary, llvm version, etc.
- Modify
- Go to the
buildfolder. Regenerate the contents using the CMAKE call from the build process - Run
make check. This should compile successfully.
- At this point, most of the code works as expected, however, for complete online testing and evaluation, we need to ensure that the appropriate llvm version is used throughout the code.
- For this, running the command
git grep ..helps. - For eg. Say, if we are changing from
llvm16tollvm17, we can run the following commands to spot any required version changes in the code.git grep 16git grep llvm16
- For the complete evaluation, we need to update the docker image as well.
- The docker images for running the Github tests are available here.
- The instructions to generate a new Docker image for the updated version are available here
- Update
test.ymlin github workflows. - Install
pre-commitin a fresh conda environment. - To test locally, run
pre-commitinstall, followed bypre-commit run --all-files.
Once all this is done, you should be able to push the commits without any test failures.