Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions comps/finetuning/src/Dockerfile.xtune
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,7 @@ ENV PATH=$PATH:/home/user/.local/bin
RUN cd /home/user/comps/finetuning/src/integrations/xtune && git config --global user.name "test" && git config --global user.email "test" && bash prepare_xtune.sh

RUN python -m pip install --upgrade pip setuptools peft && \
python -m pip install -r /home/user/comps/finetuning/src/requirements.txt && \
python -m pip install --no-deps transformers==4.45.0 datasets==2.21.0 accelerate==0.34.2 peft==0.12.0
python -m pip install -r /home/user/comps/finetuning/src/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

Expand Down
143 changes: 133 additions & 10 deletions comps/finetuning/src/integrations/xtune/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,26 +36,21 @@ Run install_xtune.sh to prepare component.
conda create -n xtune python=3.10 -y
conda activate xtune
apt install -y rsync
# open webui as default
bash prepare_xtune.sh
# this way it will not open webui
# bash prepare_xtune.sh false
```

Blow command is in prepare_xtune.sh. You can ignore it if you don't want to update lib manually.

```bash
pip install -r requirements.txt
# if you want to run on NVIDIA GPU
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
# else run on A770
# You can refer to https://github.com/intel/intel-extension-for-pytorch for latest command
# You can refer to https://github.com/intel/intel-extension-for-pytorch for latest command to update lib
python -m pip install torch==2.5.1+cxx11.abi torchvision==0.20.1+cxx11.abi torchaudio==2.5.1+cxx11.abi --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

cd src/llamafactory/clip_finetune/dassl
python setup.py develop
cd ../../../..
pip install matplotlib
pip install -e ".[metrics]"
pip install --no-deps transformers==4.45.0 datasets==2.21.0 accelerate==0.34.2 peft==0.12.0
python -m pip install intel-extension-for-pytorch==2.5.10+xpu oneccl_bind_pt==2.5.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
python -m pip install intel-extension-for-pytorch==2.5.10+xpu oneccl_bind_pt==2.5.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```

### 2. Install xtune on docker
Expand Down Expand Up @@ -107,6 +102,13 @@ then make `dataset_info.json` in your dataset directory

## Fine-Tuning with LLaMA Board GUI (powered by [Gradio](https://github.com/gradio-app/gradio))

When run with prepare_xtune.sh, it will automatic run ZE_AFFINITY_MASK=0 llamafactory-cli webui.

If you see "server start successfully" in terminal.
You can access in web through http://localhost:7860/

The UI component information can be seen in doc/ui_component.md after run with prepare_xtune.sh.

```bash
Run with A100:
CUDA_VISIBLE_DEVICES=0 llamafactory-cli webui
Expand All @@ -116,6 +118,127 @@ then make `dataset_info.json` in your dataset directory
Then access in web through http://localhost:7860/
```

## Fine-Tuning with Shell instead of GUI

After run `prepare_xtune.sh`, it will download all related file. And open webui as default.

You can run `bash prepare_xtune.sh false` to close webui. Then you can run fine-tune with shell.

Below are examples.

### CLIP

Please see [doc](./doc/key_features_for_clip_finetune_tool.md) for how to config feature

```bash
cd src/llamafactory/clip_finetune
# Please see README.md in src/llamafactory/clip_finetune for detail
```

### AdaCLIP

```bash
cd src/llamafactory/adaclip_finetune
# Please see README.md in src/llamafactory/adaclip_finetune for detail
```

### DeepSeek-R1 Distillation(not main function)

Please see [doc](./doc/DeepSeek-R1_distillation_best_practice-v1.1.pdf) for details

#### Step 1: Download existing CoT synthetic dataset from huggingface

Dataset link: https://huggingface.co/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B

#### Step 2: Convert to sharegpt format

```bash
cd data
import json
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Magpie-Align/Magpie-Reasoning-V2-""250K-CoT-Deepseek-R1-Llama-70B")
dataset = dataset["train"]
# Filter dataset
## Change the filter conditions according to your needs
dataset = dataset.filter(lambda example: len(example['response']) <= 1024)
# Save as sharegpt format
with open("Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B-response1024.json",
'w') as f:
json.dump(list(dataset), f, ensure_ascii=False, indent=4)
```

#### Step 3: Register CoT dataset LLAMA-Factory dataset_info.json

```bash
cd data
vim dataset_info.json

# make sure the file is put under `xtune/data`
"deepseek-r1-distill-sample": {
"file_name": "Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B-response1024.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations"
}
}
```

#### Step 4: Use the accelerate command to enable training on XPU plugin

```
accelerate config

For Single GPU:
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:NO
Do you want to use XPU plugin to speed up training on XPU? [yes/NO]:yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]:
Do you wish to use mixed precision?
bf16
For Multi-GPU with FSDP:
Which type of machine are you using?
multi-XPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: NO
Do you want to use XPU plugin to speed up training on XPU? [yes/NO]:yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: yes
What should be your sharding strategy?
FULL_SHARD
Do you want to offload parameters and gradients to CPU? [yes/NO]: NO
What should be your auto wrap policy?
TRANSFORMER_BASED_WRAP
Do you want to use the model's `_no_split_modules` to wrap. Only applicable for Transformers [yes/NO]: yes
What should be your FSDP's backward prefetch policy?
BACKWARD_PRE
What should be your FSDP's state dict type?
SHARDED_STATE_DICT
Do you want to enable FSDP's forward prefetch policy? [yes/NO]: yes
Do you want to enable FSDP's `use_orig_params` feature? [YES/no]: yes
Do you want to enable CPU RAM efficient model loading? Only applicable for Transformers models. [YES/no]: yes
Do you want to enable FSDP activation checkpointing? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:2
Do you wish to use mixed precision?
bf16
```

#### Step 5: Run with train script as follows

```bash
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
MODEL_ID="microsoft/Phi-3-mini-4k-instruct"
EXP_NAME="Phi-3-mini-4k-instruct-r1-distill-finetuned"
DATASET_NAME="deepseek-r1-distill-sample"
export OUTPUT_DIR="where to put output"
accelerate launch src/train.py --stage sft --do_train --use_fast_tokenizer --new_special_tokens "<think>,</think>" --resize_vocab --flash_attn auto --model_name_or_path ${MODEL_ID} --dataset ${DATASET_NAME} --template phi --finetuning_type lora --lora_rank 8 --lora_alpha 16 --lora_target q_proj,v_proj,k_proj,o_proj --additional_target lm_head,embed_tokens --output_dir $OUTPUT_DIR --overwrite_cache --overwrite_output_dir --warmup_steps 100 --weight_decay 0.1 --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --ddp_timeout 9000 --learning_rate 5e-6 --lr_scheduler_type cosine --logging_steps 1 --save_steps 1000 --plot_loss --num_train_epochs 3 --torch_empty_cache_steps 10 --bf16
```

## `Xtune` Examples

See screenshot of running CLIP and AdaCLIP finetune on Intel Arc A770 in README_XTUNE.md.
Expand Down
Binary file not shown.
Loading
Loading