Skip to content

Commit 3abcc09

Browse files
KAMiPanwuxibin89
andauthored
[sglang, recipe] feat: add SGLang as rollout engine for one-step-off-policy (#3531)
### What does this PR do? This PR extends the one-step-off-policy recipe by adding SGLang as an alternative rollout engine to vLLM, allowing flexible backend selection and improving training efficiency. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: #3460 - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test To validate this solution, we adopted the existing experimental configuration from the recipe one-step-off-policy. The evaluation demonstrates that the proposed SGLang rollout engine integration achieves effective acceleration in one-step-off-policy asynchronous training, providing users with enhanced rollout engine options for diverse deployment scenarios. **Experimental Results** - **Machine Configuration**: 2 nodes with 16 H20 GPUs each - Generation: 4 GPUs - Training: 12 GPUs - **Model**: Qwen2.5-Math-7B - **Max Response Length**: 8,192 tokens - **Algorithm**: DAPO - **Rollout Engine**: vLLM, SGLang | training mode | engine | step | gen | wait_prev_gen | generate_sequences | old_log_prob | update_actor | total time | acc/best@32/mean | acc/maj@32/mean | |------------------------|----------------|------|-----|---------------|--------------------|--------------|--------------|---------------|------------------|-----------------| | colocate sync | SGLang+FSDP2 | 452 | 131 | - | 125 | 54 | 199 | 12h25m | 0.6560 | 0.4471 | | one-step-overlap async | SGLang+FSDP2 | 406 | - | 12 | 305 | 58 | 245 | 11h12m (+11%) | 0.6303 | 0.4443 | * colocate sync: step ≈ gen + old_log_prob + update_actor * one-step-overlap async: step ≈ max(wait_prev_gen + generate_sequences, old_log_prob + update_actor) <img width="1218" height="777" alt="image" src="https://github.com/user-attachments/assets/58734164-2534-492f-bf00-1e80faae0fe7" /> ### API and Usage Example **Configuration Example** ```bash # Using SGLang engine python3 -m recipe.one_step_off_policy.main_ppo \ actor_rollout_ref.rollout.name=sglang \ # ... other configuration parameters # Using vLLM engine python3 -m recipe.one_step_off_policy.main_ppo \ actor_rollout_ref.rollout.name=vllm \ # ... other configuration parameters ``` **Script Usage** ```bash # Using SGLang engine bash dapo_7b_math_fsdp2_sglang_4_12.sh bash dapo_7b_math_fsdp2_sglang_colocate.sh # Using vLLM engine bash dapo_7b_math_fsdp2_4_12.sh bash dapo_7b_math_fsdp2_colocate.sh ``` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: wuxibin <[email protected]>
1 parent 5d378b5 commit 3abcc09

File tree

7 files changed

+480
-14
lines changed

7 files changed

+480
-14
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
hydra:
2+
searchpath:
3+
- file://verl/trainer/config
4+
5+
defaults:
6+
- ppo_trainer
7+
- _self_
8+
9+
data:
10+
max_prompt_length: 1024
11+
max_response_length: 1024
12+
train_batch_size: 256
13+
return_raw_chat: True
14+
shuffle: False
15+
16+
actor_rollout_ref:
17+
hybrid_engine: True
18+
rollout:
19+
name: sglang
20+
multi_turn:
21+
enable: True
22+
max_assistant_turns: 2
23+
format: qwen

recipe/one_step_off_policy/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,6 @@ python3 -m recipe.one_step_off_policy.async_main_ppo \
293293
| Category | Support Situation |
294294
|--------------------|-----------------------------------------------------------------------------------------------------------------|
295295
| train engine | FSDP2 <br/> Megatron |
296-
| rollout engine | vLLM |
296+
| rollout engine | vLLM <br/> SGLang |
297297
| AdvantageEstimator | GRPO <br/> GRPO_PASSK <br/> REINFORCE_PLUS_PLUS <br/> RLOO <br/> OPO <br/> REINFORCE_PLUS_PLUS_BASELINE<br/>GPG |
298298
| Reward | all |
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
#!/usr/bin/env bash
2+
set -xeuo pipefail
3+
4+
project_name='DAPO'
5+
exp_name='DAPO-Qwen2.5-7b-MATH-0527a1-fsdp2-sglang-one-step-off-4-12'
6+
7+
adv_estimator=grpo
8+
9+
use_kl_in_reward=False
10+
kl_coef=0.0
11+
use_kl_loss=False
12+
kl_loss_coef=0.0
13+
14+
clip_ratio_low=0.2
15+
clip_ratio_high=0.28
16+
17+
max_prompt_length=$((1024 * 2))
18+
max_response_length=$((1024 * 8))
19+
enable_overlong_buffer=True
20+
overlong_buffer_len=$((1024 * 4))
21+
overlong_penalty_factor=1.0
22+
23+
loss_agg_mode="token-mean"
24+
25+
train_prompt_bsz=512
26+
n_resp_per_prompt=12
27+
train_prompt_mini_bsz=32
28+
29+
# Ray
30+
# RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
31+
# WORKING_DIR=${WORKING_DIR:-"${PWD}"}
32+
# RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
33+
NNODES=${NNODES:-2}
34+
NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
35+
36+
n_gpus_rollout=2
37+
n_gpus_training=$((NGPUS_PER_NODE - n_gpus_rollout))
38+
39+
# Paths
40+
RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
41+
# very important! please modify the max_position_embeddings in config.json to 32768 after downloading from huggingface
42+
MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-Math-7B"}
43+
CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
44+
TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
45+
TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
46+
47+
48+
# Algorithm
49+
temperature=1.0
50+
top_p=1.0
51+
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
52+
val_top_p=0.7
53+
54+
# Performance Related Parameter
55+
use_dynamic_bsz=True
56+
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
57+
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
58+
ref_offload=True
59+
actor_offload=False
60+
gen_tp=2
61+
sp_size=4
62+
fsdp_size=2
63+
64+
python3 -m recipe.one_step_off_policy.main_ppo \
65+
data.train_files="${TRAIN_FILE}" \
66+
data.val_files="${TEST_FILE}" \
67+
data.prompt_key=prompt \
68+
data.truncation='left' \
69+
data.max_prompt_length=${max_prompt_length} \
70+
data.max_response_length=${max_response_length} \
71+
data.train_batch_size=${train_prompt_bsz} \
72+
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
73+
algorithm.adv_estimator=${adv_estimator} \
74+
algorithm.use_kl_in_reward=${use_kl_in_reward} \
75+
algorithm.kl_ctrl.kl_coef=${kl_coef} \
76+
actor_rollout_ref.actor.strategy=fsdp2 \
77+
critic.strategy=fsdp2 \
78+
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
79+
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
80+
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
81+
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
82+
actor_rollout_ref.actor.clip_ratio_c=10.0 \
83+
actor_rollout_ref.model.use_remove_padding=True \
84+
actor_rollout_ref.hybrid_engine=False \
85+
+actor_rollout_ref.model.override_config.max_position_embeddings=32768 \
86+
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
87+
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
88+
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
89+
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
90+
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
91+
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
92+
actor_rollout_ref.model.path="${MODEL_PATH}" \
93+
actor_rollout_ref.actor.optim.lr=1e-6 \
94+
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
95+
actor_rollout_ref.actor.optim.weight_decay=0.1 \
96+
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
97+
actor_rollout_ref.actor.fsdp_config.param_offload=${actor_offload} \
98+
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${actor_offload} \
99+
actor_rollout_ref.actor.entropy_coeff=0 \
100+
actor_rollout_ref.actor.grad_clip=1.0 \
101+
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
102+
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
103+
actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
104+
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
105+
actor_rollout_ref.rollout.layered_summon=True \
106+
actor_rollout_ref.rollout.load_format=safetensors \
107+
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
108+
actor_rollout_ref.rollout.temperature=${temperature} \
109+
actor_rollout_ref.rollout.top_p=${top_p} \
110+
actor_rollout_ref.rollout.top_k=${top_k} \
111+
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
112+
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
113+
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
114+
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
115+
actor_rollout_ref.rollout.val_kwargs.n=1 \
116+
actor_rollout_ref.rollout.name=sglang \
117+
actor_rollout_ref.ref.fsdp_config.param_offload=${ref_offload} \
118+
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
119+
actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
120+
reward_model.reward_manager=dapo \
121+
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
122+
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
123+
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
124+
+reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
125+
+reward_model.reward_kwargs.max_resp_len=${max_response_length} \
126+
trainer.logger=['console','tensorboard'] \
127+
trainer.project_name="${project_name}" \
128+
trainer.experiment_name="${exp_name}" \
129+
trainer.val_before_train=True \
130+
trainer.test_freq=10 \
131+
trainer.save_freq=-1 \
132+
trainer.total_epochs=10 \
133+
trainer.total_training_steps=100 \
134+
trainer.default_local_dir="${CKPTS_DIR}" \
135+
trainer.resume_mode=auto \
136+
trainer.log_val_generations=10 \
137+
trainer.nnodes="${NNODES}" \
138+
trainer.n_gpus_per_node="${n_gpus_training}" \
139+
rollout.nnodes="${NNODES}" \
140+
rollout.n_gpus_per_node="${n_gpus_rollout}"
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
#!/usr/bin/env bash
2+
set -xeuo pipefail
3+
4+
project_name='DAPO'
5+
exp_name='DAPO-Qwen2.5-7b-MATH-0527a1-fsdp2-sglang-colocate'
6+
7+
adv_estimator=grpo
8+
9+
use_kl_in_reward=False
10+
kl_coef=0.0
11+
use_kl_loss=False
12+
kl_loss_coef=0.0
13+
14+
clip_ratio_low=0.2
15+
clip_ratio_high=0.28
16+
17+
max_prompt_length=$((1024 * 2))
18+
max_response_length=$((1024 * 8))
19+
enable_overlong_buffer=True
20+
overlong_buffer_len=$((1024 * 4))
21+
overlong_penalty_factor=1.0
22+
23+
loss_agg_mode="token-mean"
24+
25+
train_prompt_bsz=512
26+
n_resp_per_prompt=12
27+
train_prompt_mini_bsz=32
28+
29+
# Ray
30+
# RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
31+
# WORKING_DIR=${WORKING_DIR:-"${PWD}"}
32+
# RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
33+
NNODES=${NNODES:-2}
34+
NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
35+
# Paths
36+
RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
37+
# very important! please modify the max_position_embeddings in config.json to 32768 after downloading from huggingface
38+
MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-Math-7B"}
39+
CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
40+
TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
41+
TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
42+
# Algorithm
43+
temperature=1.0
44+
top_p=1.0
45+
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
46+
val_top_p=0.7
47+
48+
# Performance Related Parameter
49+
use_dynamic_bsz=True
50+
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
51+
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
52+
offload=True
53+
gen_tp=2
54+
sp_size=4
55+
fsdp_size=2
56+
57+
# reference run wandb: https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/runs/ow47vvon?nw=nwusertongyuxuan361
58+
59+
python3 -m verl.trainer.main_ppo \
60+
data.train_files="${TRAIN_FILE}" \
61+
data.val_files="${TEST_FILE}" \
62+
data.prompt_key=prompt \
63+
data.truncation='left' \
64+
data.max_prompt_length=${max_prompt_length} \
65+
data.max_response_length=${max_response_length} \
66+
data.train_batch_size=${train_prompt_bsz} \
67+
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
68+
algorithm.adv_estimator=${adv_estimator} \
69+
algorithm.use_kl_in_reward=${use_kl_in_reward} \
70+
algorithm.kl_ctrl.kl_coef=${kl_coef} \
71+
actor_rollout_ref.actor.strategy=fsdp2 \
72+
critic.strategy=fsdp2 \
73+
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
74+
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
75+
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
76+
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
77+
actor_rollout_ref.actor.clip_ratio_c=10.0 \
78+
actor_rollout_ref.model.use_remove_padding=True \
79+
+actor_rollout_ref.model.override_config.max_position_embeddings=32768 \
80+
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
81+
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
82+
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
83+
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
84+
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
85+
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
86+
actor_rollout_ref.model.path="${MODEL_PATH}" \
87+
actor_rollout_ref.model.enable_gradient_checkpointing=True \
88+
actor_rollout_ref.actor.optim.lr=1e-6 \
89+
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
90+
actor_rollout_ref.actor.optim.weight_decay=0.1 \
91+
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
92+
actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
93+
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
94+
actor_rollout_ref.actor.entropy_coeff=0 \
95+
actor_rollout_ref.actor.grad_clip=1.0 \
96+
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
97+
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
98+
actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
99+
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
100+
actor_rollout_ref.rollout.layered_summon=True \
101+
actor_rollout_ref.rollout.load_format=safetensors \
102+
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
103+
actor_rollout_ref.rollout.temperature=${temperature} \
104+
actor_rollout_ref.rollout.top_p=${top_p} \
105+
actor_rollout_ref.rollout.top_k=${top_k} \
106+
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
107+
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
108+
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
109+
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
110+
actor_rollout_ref.rollout.val_kwargs.n=1 \
111+
actor_rollout_ref.rollout.name=sglang \
112+
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
113+
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
114+
actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
115+
reward_model.reward_manager=dapo \
116+
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
117+
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
118+
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
119+
+reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
120+
+reward_model.reward_kwargs.max_resp_len=${max_response_length} \
121+
trainer.logger=['console','tensorboard'] \
122+
trainer.project_name="${project_name}" \
123+
trainer.experiment_name="${exp_name}" \
124+
trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
125+
trainer.nnodes="${NNODES}" \
126+
trainer.val_before_train=True \
127+
trainer.test_freq=10 \
128+
trainer.save_freq=-1 \
129+
trainer.total_epochs=10 \
130+
trainer.total_training_steps=100 \
131+
trainer.default_local_dir="${CKPTS_DIR}" \
132+
trainer.resume_mode=auto \
133+
trainer.log_val_generations=10

0 commit comments

Comments
 (0)