SPIN experiments on the DIBT 10k ranked prompts.
This repository contains the instructions to run SPIN on a subset of the DIBT/10k_prompts_ranked dataset: Those that have avg_rating>=4 and num_response>1, making a total of 1832 records (which will then be splitted in 1648 for training and 184 for testing).
It contains the references to all the scripts to generate the datasets, the configuration files used for the training process and the setup used to run the model. The dataset generation was done using distilabel==0.6.0.
SPIN needs a specific format for the data to do the training, where the "real" data is the reference for the model to improve. As the dataset was made of prompts, we decided to generate these responses using mistral-large. The different iterations of the "generated" datasets were created using distilabel with vllm, using 2 A100 GPUs (just for speed, it should work with less computer power, just need to update the --cuda-devices and --batch-size arguments accordingly).
This work shows the huge benefit of collecting high-quality prompts for LLM fine-tuning. If you want to support the OSS community with larger datasets, contribute to the Prompt Collective initiative.
Initially, we create the reference dataset with the real responses being generated from mistral-large, using the following script:
- 
generate_reference_spin.pyScript to generate the reference responses, usesmistral-large: 
The following are the steps to prepare the training data for SPIN, and the resulting datasets:
SPIN iter 0
- 
generate_iter_spin.pyScript to generate the initial "generated" responses, from the SFT model that will then be fine-tuned.Dataset: argilla/10k_prompts_ranked_sft_zephyr
Run the following:
python generate_iter_spin.py \ --hf-apikey $HF_API_TOKEN \ --source-dataset "DIBT/10k_prompts_ranked" \ --new-dataset "argilla/10k_prompts_ranked_sft_zephyr" \ --model-name "alignment-handbook/zephyr-7b-sft-full" \ --batch-size 128 \ --cuda-devices "0,1"
 - 
prepare_for_training.pyGenerates the dataset that will be directly ingested in theSPINTrainer.Dataset: argilla/10k_prompts_top_SPIN_iter0
Running the following python script:
python prepare_for_training.py \ --portion top \ --target-dataset argilla/10k_prompts_SPIN_iter0_zephyr_top
 
SPIN iter 1
- 
generate_iter_spin.pyRegenerates the "generated" responses from the model in the previous iteration:
python generate_iter_spin.py \ --hf-apikey $HF_API_TOKEN \ --source-dataset "argilla/10k_prompts_SPIN_iter0_zephyr_top" \ --new-dataset "argilla/10k_prompts_SPIN_iter1_zephyr_top_generated" \ --model-name "plaguss/zephyr-7b-spin-iter0-v0" \ --batch-size 128 \ --cuda-devices "0,1"
 - 
transform_iter_generated.pyThe script transforms the generated responses to the format expected by SPIN trainer:
python transform_iter_generated.py \ --real-dataset "argilla/10k_prompts_ranked_with_responses" \ --generated-dataset "argilla/10k_prompts_SPIN_iter1_zephyr_top_generated" \ --new-dataset "argilla/10k_prompts_SPIN_iter1_zephyr_top"
 
SPIN iter 2
- 
generate_iter_spin.pyRegenerates the "generated" responses from the model in the previous iteration:
python generate_iter_spin.py \ --hf-apikey $HF_API_TOKEN \ --source-dataset "argilla/10k_prompts_SPIN_iter0_zephyr_top" \ --new-dataset "argilla/10k_prompts_SPIN_iter2_zephyr_top_generated" \ --model-name "plaguss/zephyr-7b-spin-iter1-v0" \ --batch-size 128 \ --cuda-devices "0,1"
 - 
transform_iter_generated.pyThe script transforms the generated responses to the format expected by SPIN trainer:
python transform_iter_generated.py \ --real-dataset "argilla/10k_prompts_ranked_with_responses" \ --generated-dataset "argilla/10k_prompts_SPIN_iter2_zephyr_top_generated" \ --new-dataset "argilla/10k_prompts_SPIN_iter2_zephyr_top"
 
SPIN iter 3
- 
generate_iter_spin.pyRegenerates the "generated" responses from the model in the previous iteration:
python generate_iter_spin.py \ --hf-apikey $HF_API_TOKEN \ --source-dataset "argilla/10k_prompts_SPIN_iter0_zephyr_top" \ --new-dataset "argilla/10k_prompts_SPIN_iter3_zephyr_top_generated" \ --model-name "plaguss/zephyr-7b-spin-iter2-v0" \ --batch-size 128 \ --cuda-devices "0,1"
 - 
transform_iter_generated.pyThe script transforms the generated responses to the format expected by SPIN trainer:
python transform_iter_generated.py \ --real-dataset "argilla/10k_prompts_ranked_with_responses" \ --generated-dataset "argilla/10k_prompts_SPIN_iter3_zephyr_top_generated" \ --new-dataset "argilla/10k_prompts_SPIN_iter3_zephyr_top"
 
The following steps are almost a copy from the SPIN repository, take a look there for more information.
We used Runpod with the following setup:
- 4 A100 80Gb.
 - 500Gb container and volume.
 - Base image with CUDA 12.1.
 
These are the steps outlined in the SPIN repo, you can run them by running the script in scripts/setup.sh:
pip install torch==2.1.1 --index-url https://download.pytorch.org/whl/cu121Clone and install the repo from source:
git clone https://github.com/uclaml/SPIN.git && cd SPINInstall package and flash-attn
python -m pip install .
python -m pip install flash-attn==2.5.3 --no-build-isolationLog to huggingface:
huggingface-cli login --token $HF_API_TOKENLog to wandb:
pip install wandb
wandb login $WANDB_TOKENAnd update the WANDB variables to keep track of the experiments:
export WANDB_ENTITY="argilla-io"
export WANDB_PROJECT="dibt-spin-zephyr"
export WANDB_NAME="zephyr-7b-spin-iter0-v0"After the previous step, replace the config file of the model to run, and the finetune.sh script, and start the training process:
bash scripts/finetune.sh