Skip to content

Commit d7e61d8

Browse files
authored
[perf] refactor: part 2 - Profiler ci test and fixes (volcengine#3001)
### What does this PR do? [perf] refactor part 2: Profiler ci test and fixes ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent 00472a8 commit d7e61d8

22 files changed

+111
-74
lines changed

.github/workflows/e2e_ppo_trainer_megatron_sglang.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -140,10 +140,16 @@ jobs:
140140
exp_name="deepseek-coder-1.3b-instruct-megatron-gsm8k-minimal"
141141
python -m verl.model_merger test --backend megatron --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
142142
python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
143-
- name: Running GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Deepseek)
143+
- name: Profiling GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Deepseek)
144144
run: |
145145
ray stop --force
146-
ENGINE=sglang ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
146+
PROFILE_ENABLE=True ENGINE=sglang ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
147+
if [ -z "$( ls -A '/tmp/ray/session_latest/logs/nsight/' )" ]; then
148+
echo "[ERROR] not found any profiling files"
149+
exit 1
150+
else
151+
echo "[SUCCESS] profile success"
152+
fi
147153
- name: clean up
148154
run: |
149155
rm -rf checkpoints

docs/ascend_tutorial/ascend_profiling.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ Last updated: 07/24/2025.
1818

1919
通过 ppo_trainer.yaml 中的参数控制采集步数和模式:
2020

21-
- profiler: 控制采集的rank和模式
21+
- global_profiler: 控制采集的rank和模式
2222

2323
- tool: 使用的采集工具,选项有 nsys、npu、torch、torch_memory。
2424
- steps: 此参数可以设置为包含采集步数的列表,例如 [2, 4],表示将采集第2步和第4步。如果设置为 null,则不进行采集。
2525
- save_path: 保存采集数据的路径。默认值为 "outputs/profile"。
2626

27-
通过 ``profiler.tool_config.npu`` 中的参数控制具体采集行为:
27+
通过 ``global_profiler.global_tool_config.npu`` 中的参数控制具体采集行为:
2828

2929
- level: 采集级别—选项有 level_none、level0、level1 和 level2
3030

@@ -63,15 +63,15 @@ Last updated: 07/24/2025.
6363

6464
.. code:: yaml
6565
66-
profiler:
66+
global_profiler:
6767
steps: null # disable profile
6868
6969
端到端采集
7070
~~~~~~~~~~~~~~~~~~~~~
7171

7272
.. code:: yaml
7373
74-
profiler:
74+
global_profiler:
7575
steps: [1, 2, 5]
7676
discrete: False
7777
actor_rollout_ref:
@@ -87,7 +87,7 @@ Last updated: 07/24/2025.
8787

8888
.. code:: yaml
8989
90-
profiler:
90+
global_profiler:
9191
discrete: True
9292
9393

docs/ascend_tutorial/ascend_profiling_en.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Global collection control
2020
Use parameters in ppo_trainer.yaml to control the collection mode
2121
and steps.
2222

23-
- profiler: Control the ranks and mode of profiling
23+
- global_profiler: Control the ranks and mode of profiling
2424

2525
- tool: The profiling tool to use, options are nsys, npu, torch,
2626
torch_memory.
@@ -30,7 +30,7 @@ and steps.
3030
- save_path: The path to save the collected data. Default is
3131
"outputs/profile".
3232

33-
Use parameters in ``profiler.tool_config.npu`` to control npu profiler behavior:
33+
Use parameters in ``global_profiler.global_tool_config.npu`` to control npu profiler behavior:
3434

3535
- level: Collection level—options are level_none, level0, level1, and
3636
level2
@@ -77,15 +77,15 @@ Disabling collection
7777

7878
.. code:: yaml
7979
80-
profiler:
80+
global_profiler:
8181
steps: null # disable profile
8282
8383
End-to-End collection
8484
~~~~~~~~~~~~~~~~~~~~~
8585

8686
.. code:: yaml
8787
88-
profiler:
88+
global_profiler:
8989
steps: [1, 2, 5]
9090
discrete: False
9191
actor_rollout_ref:
@@ -100,7 +100,7 @@ Discrete Mode Collection
100100

101101
.. code:: yaml
102102
103-
profiler:
103+
global_profiler:
104104
discrete: True
105105
106106

docs/perf/nsight_profiling.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,24 +16,23 @@ Nsight Systems version is important, please reference `docker/Dockerfile.vllm.sg
1616

1717
verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id.
1818

19-
In `profiler`, three new config entries control the profiler behaviors:
19+
In `global_profiler`, three new config entries control the profiler behaviors:
2020

21-
* **`profiler.steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
21+
* **`global_profiler.steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
2222

23-
* **`profiler.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profiler.steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
23+
* **`global_profiler.profile_continuous_steps`**. If true, and the following `global_profiler.discrete==False`, then the continuous steps in `global_profiler.steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
2424

25-
Nsys options in controller nodes and worker nodes are configured in `trainer`:
25+
Nsys options in controller nodes and worker nodes are configured in `global_profiler.global_tool_config.nsys`:
2626

27-
* **`trainer.controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
28-
* **`trainer.worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
27+
* **`global_profiler.global_tool_config.nsys.controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
28+
* **`global_profiler.global_tool_config.nsys.worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
2929

3030
### Worker process profiling
3131

3232
Verl manages mulitiple RL roles, _Actor_, _Ref_, _Rollout_, _Critic_, _Reward_, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, `profiler`, which consists of three fields:
3333

3434
* **`all_ranks` and `ranks`**. When `all_ranks` is set `True` then all ranks will be profiled; when set `False`, `ranks` will be profiled. By default, verl profiles the whole training process in a series ` worker_process_<PID>.<RID>.nsys-rep` files for each process rank. PID is the process ID; RID is the capture range ID.
3535
* **`discrete`**. When set `False`, all the roles actions in one training step will be dumped in one database. When set `True`, the actions annotated by `DistProfiler.annotate` will be dumped into a discrete database. In this case, each role's action occupies one `<RID>`.
36-
* **`actor_rollout_ref`**. This Worker can be configured to contain at most 3 roles and executes together. So `actor_rollout_ref` has a `profiler` config and all the inside roles inherit it.
3736
* **Verl collocate mode**. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent `discrete`. The Nsight Systems profiler uses a `torch.cuda.profiler.start()` and `stop()` pair to dump a `<step>` database anyway.
3837

3938
### where to find the profiling data
@@ -56,7 +55,7 @@ To enable profiling for specific components and steps, modify your ppo_trainer.y
5655
### Enable profiler and one database for one training step
5756
5857
```yaml
59-
profiler:
58+
global_profiler:
6059
steps: [1, 2, 5]
6160
discrete: False
6261
actor_rollout_ref:

examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,11 @@ python3 -m verl.trainer.main_ppo \
5555
trainer.test_freq=5 \
5656
trainer.total_epochs=5 \
5757
trainer.device=npu \
58-
profiler.tool=npu \
59-
profiler.steps=$PROFILE_STEPS \
60-
profiler.save_path=$SAVE_PATH \
61-
profiler.tool_config.npu.discrete=$DISCRETE \
62-
profiler.tool_config.npu.contents=$CONTENTS \
63-
profiler.tool_config.npu.level=$LEVEL \
64-
profiler.tool_config.npu.analysis=$ANALYSIS
58+
global_profiler.tool=npu \
59+
global_profiler.steps=$PROFILE_STEPS \
60+
global_profiler.save_path=$SAVE_PATH \
61+
global_profiler.global_tool_config.npu.discrete=$DISCRETE \
62+
global_profiler.global_tool_config.npu.contents=$CONTENTS \
63+
global_profiler.global_tool_config.npu.level=$LEVEL \
64+
global_profiler.global_tool_config.npu.analysis=$ANALYSIS
6565
$@

examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,11 @@ python3 -m verl.trainer.main_ppo \
5454
trainer.test_freq=5 \
5555
trainer.total_epochs=5 \
5656
trainer.device=npu \
57-
profiler.tool=npu \
58-
profiler.steps=$PROFILE_STEPS \
59-
profiler.save_path=$SAVE_PATH \
60-
profiler.tool_config.npu.discrete=$DISCRETE \
61-
profiler.tool_config.npu.contents=$CONTENTS \
62-
profiler.tool_config.npu.level=$LEVEL \
63-
profiler.tool_config.npu.analysis=$ANALYSIS \
57+
global_profiler.tool=npu \
58+
global_profiler.steps=$PROFILE_STEPS \
59+
global_profiler.save_path=$SAVE_PATH \
60+
global_profiler.global_tool_config.npu.discrete=$DISCRETE \
61+
global_profiler.global_tool_config.npu.contents=$CONTENTS \
62+
global_profiler.global_tool_config.npu.level=$LEVEL \
63+
global_profiler.global_tool_config.npu.analysis=$ANALYSIS
6464
$@

examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
6060
trainer.test_freq=-1 \
6161
trainer.total_epochs=100 \
6262
trainer.total_training_steps=1 \
63-
profiler.tool=nsys \
64-
profiler.steps=$PROFILE_STEPS \
65-
profiler.tool_config.nsys.discrete=$DISCRETE $@
63+
global_profiler.tool=nsys \
64+
global_profiler.steps=$PROFILE_STEPS \
65+
global_profiler.global_tool_config.nsys.discrete=$DISCRETE $@

examples/ppo_trainer/run_qwen2-7b_rm_seq_balance_nsys.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ python3 -m verl.trainer.main_ppo \
7575
trainer.test_freq=-1 \
7676
trainer.total_epochs=15 \
7777
trainer.total_training_steps=6 \
78-
profiler.profile_continuous_steps=True \
79-
profiler.tool=nsys \
80-
profiler.steps=$PROFILE_STEPS \
81-
profiler.tool_config.nsys.discrete=$DISCRETE $@
78+
global_profiler.profile_continuous_steps=True \
79+
global_profiler.tool=nsys \
80+
global_profiler.steps=$PROFILE_STEPS \
81+
global_profiler.global_tool_config.nsys.discrete=$DISCRETE $@

recipe/one_step_off_policy/ray_trainer.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -234,11 +234,12 @@ def init_workers(self):
234234
wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
235235
if OmegaConf.select(self.config.global_profiler, "steps") is not None:
236236
wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "steps")
237-
assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
238-
"worker_nsight_options must be set when profile_steps is set"
239-
)
237+
assert (
238+
OmegaConf.select(self.config.global_profiler.global_tool_config.nsys, "worker_nsight_options")
239+
is not None
240+
), "worker_nsight_options must be set when profile_steps is set"
240241
wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
241-
OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
242+
OmegaConf.select(self.config.global_profiler.global_tool_config.nsys, "worker_nsight_options")
242243
)
243244

244245
for resource_pool, class_dict in self.resource_pool_to_cls.items():

tests/special_e2e/run_ppo_trainer_megatron.sh

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,12 @@ fi
140140

141141
OPTIM_MEMORY_EFFICIENT=${OPTIM_MEMORY_EFFICIENT:-False}
142142

143+
PROFILE_ENABLE=${PROFILE_ENABLE:-False}
144+
PROFILE_STEPS=${PROFILE_STEPS:-[1]}
145+
PROFILE_RANKS_ALL=${PROFILE_RANKS_ALL:-True}
146+
PROFILE_RANKS=${PROFILE_RANKS:-[0,1,2,3]}
147+
DISCRETE=${DISCRETE:-True} # or True
148+
143149
python3 -m verl.trainer.main_ppo --config-path=config \
144150
--config-name='ppo_megatron_trainer.yaml'\
145151
algorithm.adv_estimator="${ADV_ESTIMATOR}" \
@@ -176,6 +182,9 @@ python3 -m verl.trainer.main_ppo --config-path=config \
176182
actor_rollout_ref.actor.kl_loss_coef=0.001 \
177183
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
178184
actor_rollout_ref.actor.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
185+
actor_rollout_ref.actor.profiler.enable=$PROFILE_ENABLE \
186+
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
187+
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
179188
actor_rollout_ref.rollout.name="${ENGINE}" ${ROLLOUT_MODE_ARG}\
180189
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
181190
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
@@ -214,6 +223,9 @@ python3 -m verl.trainer.main_ppo --config-path=config \
214223
critic.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
215224
critic.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
216225
critic.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
226+
critic.profiler.enable=$PROFILE_ENABLE \
227+
critic.profiler.ranks=$PROFILE_RANKS \
228+
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
217229
reward_model.enable=True \
218230
reward_model.model.path="${MODEL_PATH}" \
219231
reward_model.micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
@@ -227,6 +239,9 @@ python3 -m verl.trainer.main_ppo --config-path=config \
227239
reward_model.megatron.param_offload=${RM_PARAM_OFFLOAD} \
228240
reward_model.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
229241
reward_model.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
242+
reward_model.profiler.enable=$PROFILE_ENABLE \
243+
reward_model.profiler.ranks=$PROFILE_RANKS \
244+
reward_model.profiler.all_ranks=$PROFILE_RANKS_ALL \
230245
algorithm.use_kl_in_reward=False \
231246
algorithm.kl_penalty=kl \
232247
algorithm.kl_ctrl.kl_coef=0.001 \
@@ -241,4 +256,8 @@ python3 -m verl.trainer.main_ppo --config-path=config \
241256
trainer.save_freq="${SAVE_FREQ}" \
242257
trainer.resume_mode="${RESUME_MODE}" \
243258
trainer.total_epochs=2 \
244-
trainer.total_training_steps="${TOTAL_TRAIN_STEPS}" $@
259+
trainer.total_training_steps="${TOTAL_TRAIN_STEPS}" \
260+
global_profiler.profile_continuous_steps=True \
261+
global_profiler.tool=nsys \
262+
global_profiler.steps=$PROFILE_STEPS \
263+
global_profiler.global_tool_config.nsys.discrete=$DISCRETE $@

0 commit comments

Comments
 (0)