Skip to content

Commit 2e8cf9a

Browse files
authored
[perf] support padding-free training for VLMs (#61)
1 parent 2a0f95d commit 2e8cf9a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+1743
-1348
lines changed

Dockerfile

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,11 +37,15 @@ RUN pip config set global.index-url "${PIP_INDEX}" && \
3737
python -m pip install --upgrade pip
3838

3939
# Install torch-2.5.1 + vllm-0.7.3
40-
RUN pip install --no-cache-dir vllm==0.7.3 torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 tensordict \
40+
RUN pip install --no-cache-dir vllm==0.7.3 torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 tensordict torchdata \
4141
transformers>=4.49.0 accelerate datasets peft \
42-
ray codetiming hydra-core pandas pyarrow>=15.0.0 pylatexenc qwen-vl-utils
42+
ray codetiming hydra-core pandas pyarrow>=15.0.0 pylatexenc qwen-vl-utils wandb liger-kernel \
4343

4444
# Install flash_attn-2.7.4.post1
4545
RUN pip uninstall -y transformer-engine flash-attn && \
4646
wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl && \
4747
pip install --no-cache-dir flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
48+
49+
# Fix cv2
50+
RUN pip uninstall -y pynvml nvidia-ml-py && \
51+
pip install nvidia-ml-py>=12.560.30 opencv-python-headless==4.11.0.86 fastapi==0.115.6

README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,12 @@ EasyR1 is efficient and scalable due to the design of **[HybirdEngine](https://a
2929

3030
We provide a [Dockerfile](./Dockerfile) to easily build environments.
3131

32+
Use [pre-built docker image](https://hub.docker.com/r/hiyouga/verl):
33+
34+
```bash
35+
docker pull hiyouga/verl:ngc-th2.5.1-cu120-vllm0.7.3-rc1
36+
```
37+
3238
### Hardware Requirements
3339

3440
\* *estimated*
@@ -71,30 +77,26 @@ python3 scripts/model_merger.py --local_dir path_to_your_last_actor_checkpoint
7177
7278
## Custom Dataset
7379

74-
The dataset should strictly follow the example data format.
80+
Please refer to the example datasets to prepare your own dataset.
7581

7682
- Text dataset: https://huggingface.co/datasets/hiyouga/math12k
77-
- Required columns: problem, answer
78-
7983
- Vision-text dataset: https://huggingface.co/datasets/hiyouga/geometry3k
80-
- Required columns: images, problem, answer
8184

8285
## Other Baselines
8386

84-
- [CLEVR-70k-Counting](examples/run_qwen2_5_vl_2b_clevr.sh): Train the Qwen2.5-VL-3B-Instruct model on counting problem.
87+
- [CLEVR-70k-Counting](examples/run_qwen2_5_vl_3b_clevr.sh): Train the Qwen2.5-VL-3B-Instruct model on counting problem.
8588

8689
## TODO
8790

8891
- Support PPO, Reinforce++ and RLOO for VLMs.
89-
- Support padding-free training for VLMs.
9092
- Support ulysses parallelism for VLMs.
9193
- Support more VLM architectures.
9294

9395
### Known bugs
9496

9597
These features are temporarily disabled for now, we plan to fix them one-by-one in the future updates.
9698

97-
- Vision language models are not compatible with padding-free training and ulysses parallelism yet.
99+
- Vision language models are not compatible with ulysses parallelism yet.
98100
- Vision language models are not compatible with `enable_chunked_prefill` unless [vLLM v1](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) is supported.
99101

100102
## Discussion Group

examples/grpo_example.yaml

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ data:
22
train_files: hiyouga/math12k@train
33
val_files: hiyouga/math12k@test
44
prompt_key: problem
5+
answer_key: answer
6+
image_key: images
57
max_prompt_length: 1024
68
max_response_length: 1024
79
rollout_batch_size: 512
@@ -17,36 +19,38 @@ algorithm:
1719
worker:
1820
actor:
1921
global_batch_size: 128
20-
micro_batch_size_per_device_for_update: 1
21-
micro_batch_size_per_device_for_experience: 2
22+
micro_batch_size_per_device_for_update: 4
23+
micro_batch_size_per_device_for_experience: 16
2224
max_grad_norm: 1.0
2325
use_kl_loss: true
2426
kl_loss_coef: 1.0e-3
2527
kl_loss_type: low_var_kl
28+
padding_free: true
29+
ulysses_sequence_parallel_size: 1
2630
model:
2731
model_path: Qwen/Qwen2.5-7B-Instruct
2832
enable_gradient_checkpointing: true
2933
optim:
3034
lr: 1.0e-6
3135
weight_decay: 1.0e-2
3236
fsdp:
33-
param_offload: false
34-
optimizer_offload: false
35-
torch_dtype: null
37+
enable_full_shard: true
38+
enable_cpu_offload: false
39+
enable_rank0_init: false
3640
offload:
37-
param_offload: true
38-
optimizer_offload: true
41+
offload_params: false
42+
offload_optimizer: false
3943

4044
rollout:
4145
temperature: 1.0
4246
tensor_parallel_size: 2
43-
gpu_memory_utilization: 0.6
47+
gpu_memory_utilization: 0.5
4448
n: 5
4549
enable_chunked_prefill: true
4650

4751
ref:
4852
offload:
49-
param_offload: true
53+
offload_params: true
5054

5155
reward:
5256
reward_type: function
@@ -60,7 +64,8 @@ trainer:
6064
n_gpus_per_node: 8
6165
nnodes: 1
6266
save_freq: 5
63-
test_freq: 5
67+
val_freq: 5
6468
val_before_train: true
6569
val_only: false
70+
val_generations_to_log_to_wandb: 1
6671
save_checkpoint_path: null

examples/remax_example.yaml

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ data:
22
train_files: hiyouga/math12k@train
33
val_files: hiyouga/math12k@test
44
prompt_key: problem
5+
answer_key: answer
6+
image_key: images
57
max_prompt_length: 1024
68
max_response_length: 1024
79
rollout_batch_size: 512
@@ -17,36 +19,38 @@ algorithm:
1719
worker:
1820
actor:
1921
global_batch_size: 128
20-
micro_batch_size_per_device_for_update: 1
21-
micro_batch_size_per_device_for_experience: 2
22+
micro_batch_size_per_device_for_update: 4
23+
micro_batch_size_per_device_for_experience: 16
2224
max_grad_norm: 1.0
2325
use_kl_loss: true
2426
kl_loss_coef: 1.0e-3
2527
kl_loss_type: low_var_kl
28+
padding_free: true
29+
ulysses_sequence_parallel_size: 1
2630
model:
2731
model_path: Qwen/Qwen2.5-7B-Instruct
2832
enable_gradient_checkpointing: true
2933
optim:
3034
lr: 1.0e-6
3135
weight_decay: 1.0e-2
3236
fsdp:
33-
param_offload: false
34-
optimizer_offload: false
35-
torch_dtype: null
37+
enable_full_shard: true
38+
enable_cpu_offload: false
39+
enable_rank0_init: false
3640
offload:
37-
param_offload: true
38-
optimizer_offload: true
41+
offload_params: false
42+
offload_optimizer: false
3943

4044
rollout:
4145
temperature: 1.0
4246
tensor_parallel_size: 2
43-
gpu_memory_utilization: 0.6
47+
gpu_memory_utilization: 0.5
4448
n: 5
4549
enable_chunked_prefill: true
4650

4751
ref:
4852
offload:
49-
param_offload: true
53+
offload_params: true
5054

5155
reward:
5256
reward_type: function
@@ -56,11 +60,12 @@ trainer:
5660
total_episodes: 15
5761
logger: ["console", "wandb"]
5862
project_name: easy_r1
59-
experiment_name: qwen2_5_7b_remax_math
63+
experiment_name: qwen2_5_7b_math
6064
n_gpus_per_node: 8
6165
nnodes: 1
6266
save_freq: 5
63-
test_freq: 5
67+
val_freq: 5
6468
val_before_train: true
6569
val_only: false
70+
val_generations_to_log_to_wandb: 1
6671
save_checkpoint_path: null

examples/run_qwen2_5_7b_math.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,11 @@ export VLLM_ATTENTION_BACKEND=XFORMERS
44

55
MODEL_PATH=Qwen/Qwen2.5-7B-Instruct # replace it with your local file path
66

7+
SYSTEM_PROMPT="""You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
8+
The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."""
9+
710
python3 -m verl.trainer.main \
811
config=examples/grpo_example.yaml \
12+
data.system_prompt="${SYSTEM_PROMPT}" \
913
worker.actor.model.model_path=${MODEL_PATH} \
1014
trainer.n_gpus_per_node=4

examples/run_qwen2_5_7b_math_swanlab.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,12 @@ export VLLM_ATTENTION_BACKEND=XFORMERS
44

55
MODEL_PATH=Qwen/Qwen2.5-7B-Instruct # replace it with your local file path
66

7+
SYSTEM_PROMPT="""You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
8+
The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."""
9+
710
python3 -m verl.trainer.main \
811
config=examples/grpo_example.yaml \
12+
data.system_prompt="${SYSTEM_PROMPT}" \
913
worker.actor.model.model_path=${MODEL_PATH} \
1014
trainer.logger=['console','swanlab'] \
1115
trainer.n_gpus_per_node=4
File renamed without changes.

examples/run_qwen2_5_vl_3b_geo.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,14 @@ export VLLM_ATTENTION_BACKEND=XFORMERS
44

55
MODEL_PATH=Qwen/Qwen2.5-VL-3B-Instruct # replace it with your local file path
66

7+
SYSTEM_PROMPT="""You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
8+
The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."""
9+
710
python3 -m verl.trainer.main \
811
config=examples/grpo_example.yaml \
912
data.train_files=hiyouga/geometry3k@train \
1013
data.val_files=hiyouga/geometry3k@test \
14+
data.system_prompt="${SYSTEM_PROMPT}" \
1115
worker.actor.model.model_path=${MODEL_PATH} \
1216
worker.rollout.tensor_parallel_size=1 \
1317
worker.rollout.enable_chunked_prefill=false \

examples/run_qwen2_5_vl_7b_geo.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,15 @@ export VLLM_ATTENTION_BACKEND=XFORMERS
44

55
MODEL_PATH=Qwen/Qwen2.5-VL-7B-Instruct # replace it with your local file path
66

7+
SYSTEM_PROMPT="""You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
8+
The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."""
9+
710
python3 -m verl.trainer.main \
811
config=examples/grpo_example.yaml \
912
data.train_files=hiyouga/geometry3k@train \
1013
data.val_files=hiyouga/geometry3k@test \
14+
data.system_prompt="${SYSTEM_PROMPT}" \
1115
worker.actor.model.model_path=${MODEL_PATH} \
1216
worker.rollout.enable_chunked_prefill=false \
1317
trainer.experiment_name=qwen2_5_vl_7b_geo \
14-
trainer.n_gpus_per_node=4
18+
trainer.n_gpus_per_node=8

examples/run_qwen2_5_vl_7b_geo_swanlab.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,16 @@ export VLLM_ATTENTION_BACKEND=XFORMERS
44

55
MODEL_PATH=Qwen/Qwen2.5-VL-7B-Instruct # replace it with your local file path
66

7+
SYSTEM_PROMPT="""You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
8+
The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."""
9+
710
python3 -m verl.trainer.main \
811
config=examples/grpo_example.yaml \
912
data.train_files=hiyouga/geometry3k@train \
1013
data.val_files=hiyouga/geometry3k@test \
14+
data.system_prompt="${SYSTEM_PROMPT}" \
1115
worker.actor.model.model_path=${MODEL_PATH} \
1216
worker.rollout.enable_chunked_prefill=false \
1317
trainer.experiment_name=qwen2_5_vl_7b_geo \
1418
trainer.logger=['console','swanlab'] \
15-
trainer.n_gpus_per_node=4
19+
trainer.n_gpus_per_node=8

0 commit comments

Comments
 (0)