Skip to content

Commit ec3a2a0

Browse files
tongtong0613DDVD233
authored andcommitted
[perf] fix: fix npu profiler and add mstx UT (volcengine#3052)
### What does this PR do? - fix the parameter passing error for profile_level - fix the error when creating npu profiler in discrete mode - modify the execution script - modify ascend profiling doc - add the discrete parameter in tool_config - add mstx_profile UT ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent fb6c93e commit ec3a2a0

File tree

10 files changed

+385
-59
lines changed

10 files changed

+385
-59
lines changed

.github/workflows/e2e_ascend.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,4 +149,8 @@ jobs:
149149
USE_DIST_CKPT=True bash tests/special_npu/run_qwen2_5_05b_grpo_mindspeed.sh
150150
rm -rf $HOME/dist_ckpt/qwen2_5_05b_grpo_mindspeed
151151
rm -rf $HOME/ckpts
152+
- name: Running NPU profiling unit tests
153+
run: |
154+
ray stop --force
155+
pytest -s -x tests/utils/test_special_mstx_profile.py
152156

docs/ascend_tutorial/ascend_profiling_en.rst

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Data collection based on FSDP backend on Ascend devices(en)
22
==========================================================================================
33

4-
Last updated: 07/24/2025.
4+
Last updated: 08/14/2025.
55

66
This is a tutorial for data collection using the GRPO or DAPO algorithm
77
based on FSDP on Ascend devices.
@@ -30,7 +30,18 @@ and steps.
3030
- save_path: The path to save the collected data. Default is
3131
"outputs/profile".
3232

33-
Use parameters in ``global_profiler.global_tool_config.npu`` to control npu profiler behavior:
33+
34+
Role collection control
35+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36+
37+
In each role's ``profiler`` field, you can control the collection mode for that role.
38+
39+
- enable: Whether to enable profiling for this role.
40+
- all_ranks: Whether to collect data from all ranks.
41+
- ranks: A list of ranks to collect data from. If empty, no data is collected.
42+
- tool_config: Configuration for the profiling tool used by this role.
43+
44+
Use parameters in each role's ``profiler.tool_config.npu`` to control npu profiler behavior:
3445

3546
- level: Collection level—options are level_none, level0, level1, and
3647
level2
@@ -56,17 +67,7 @@ Use parameters in ``global_profiler.global_tool_config.npu`` to control npu prof
5667
- stack: Whether to record operator call stack information.
5768

5869
- analysis: Enables automatic data parsing.
59-
60-
61-
Role collection control
62-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
63-
64-
In each role's ``profile`` field, you can control the collection mode for that role.
65-
66-
- enable: Whether to enable profiling for this role.
67-
- all_ranks: Whether to collect data from all ranks.
68-
- ranks: A list of ranks to collect data from. If empty, no data is collected.
69-
- tool_config: Configuration for the profiling tool used by this role.
70+
- discrete: Whether to enable discrete mode.
7071

7172

7273
Examples
@@ -87,12 +88,15 @@ End-to-End collection
8788
8889
global_profiler:
8990
steps: [1, 2, 5]
90-
discrete: False
9191
actor_rollout_ref:
9292
actor:
9393
profiler:
9494
enable: True
9595
all_ranks: True
96+
tool_config:
97+
npu:
98+
discrete: False
99+
# rollout & ref follow actor settings
96100
97101
98102
Discrete Mode Collection
@@ -101,7 +105,16 @@ Discrete Mode Collection
101105
.. code:: yaml
102106
103107
global_profiler:
104-
discrete: True
108+
steps: [1, 2, 5]
109+
actor_rollout_ref:
110+
actor:
111+
profiler:
112+
enable: True
113+
all_ranks: True
114+
tool_config:
115+
npu:
116+
discrete: True
117+
# rollout & ref follow actor settings
105118
106119
107120
Visualization

docs/ascend_tutorial/ascend_profiling_zh.rst

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Data collection based on FSDP backend on Ascend devices(zh)
33

44
在昇腾设备上基于FSDP后端进行数据采集
55

6-
Last updated: 07/24/2025.
6+
Last updated: 08/14/2025.
77

88
这是一份在昇腾设备上基于FSDP后端使用GRPO或DAPO算法进行数据采集的教程。
99

@@ -26,7 +26,17 @@ Last updated: 07/24/2025.
2626
- steps: 此参数可以设置为包含采集步数的列表,例如 [2, 4],表示将采集第2步和第4步。如果设置为 null,则不进行采集。
2727
- save_path: 保存采集数据的路径。默认值为 "outputs/profile"。
2828

29-
通过 ``global_profiler.global_tool_config.npu`` 中的参数控制具体采集行为:
29+
角色profiler控制
30+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
31+
32+
在每个角色的 ``profiler`` 字段中,您可以控制该角色的采集模式。
33+
34+
- enable: 是否为此角色启用性能分析。
35+
- all_ranks: 是否从所有rank收集数据。
36+
- ranks: 要收集数据的rank列表。如果为空,则不收集数据。
37+
- tool_config: 此角色使用的性能分析工具的配置。
38+
39+
通过每个角色的 ``profiler.tool_config.npu`` 中的参数控制具体采集行为:
3040

3141
- level: 采集级别—选项有 level_none、level0、level1 和 level2
3242

@@ -46,16 +56,7 @@ Last updated: 07/24/2025.
4656
- stack: 是否记录算子调用栈信息。
4757

4858
- analysis: 启用自动数据解析。
49-
50-
角色profile控制
51-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52-
53-
在每个角色的 ``profile`` 字段中,您可以控制该角色的采集模式。
54-
55-
- enable: 是否为此角色启用性能分析。
56-
- all_ranks: 是否从所有rank收集数据。
57-
- ranks: 要收集数据的rank列表。如果为空,则不收集数据。
58-
- tool_config: 此角色使用的性能分析工具的配置。
59+
- discrete: 使用离散模式。
5960

6061
示例
6162
----
@@ -75,12 +76,14 @@ Last updated: 07/24/2025.
7576
7677
global_profiler:
7778
steps: [1, 2, 5]
78-
discrete: False
7979
actor_rollout_ref:
8080
actor:
81-
profile:
81+
profiler:
8282
enable: True
8383
all_ranks: True
84+
tool_config:
85+
npu:
86+
discrete: False
8487
# rollout & ref follow actor settings
8588
8689
@@ -90,7 +93,16 @@ Last updated: 07/24/2025.
9093
.. code:: yaml
9194
9295
global_profiler:
93-
discrete: True
96+
steps: [1, 2, 5]
97+
actor_rollout_ref:
98+
actor:
99+
profiler:
100+
enable: True
101+
all_ranks: True
102+
tool_config:
103+
npu:
104+
discrete: True
105+
# rollout & ref follow actor settings
94106
95107
96108
可视化

examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ python3 -m verl.trainer.main_ppo \
1616
algorithm.adv_estimator=grpo \
1717
data.train_files=$HOME/data/gsm8k/train.parquet \
1818
data.val_files=$HOME/data/gsm8k/test.parquet \
19-
data.train_batch_size=1024 \
19+
data.train_batch_size=32 \
2020
data.max_prompt_length=1024 \
2121
data.max_response_length=1024 \
2222
data.filter_overlong_prompts=True \
@@ -25,8 +25,8 @@ python3 -m verl.trainer.main_ppo \
2525
actor_rollout_ref.model.enable_gradient_checkpointing=True \
2626
actor_rollout_ref.model.use_remove_padding=False \
2727
actor_rollout_ref.actor.optim.lr=5e-8 \
28-
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
29-
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
28+
actor_rollout_ref.actor.ppo_mini_batch_size=2 \
29+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
3030
actor_rollout_ref.actor.use_kl_loss=True \
3131
actor_rollout_ref.actor.entropy_coeff=0 \
3232
actor_rollout_ref.actor.kl_loss_coef=0.001 \
@@ -36,30 +36,30 @@ python3 -m verl.trainer.main_ppo \
3636
actor_rollout_ref.actor.profiler.enable=True \
3737
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
3838
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
39-
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
39+
actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE \
40+
actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS \
41+
actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL \
42+
actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS \
43+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
4044
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
4145
actor_rollout_ref.rollout.name=vllm \
4246
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
43-
actor_rollout_ref.rollout.n=5 \
47+
actor_rollout_ref.rollout.n=4 \
4448
actor_rollout_ref.rollout.enable_chunked_prefill=False \
45-
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
49+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
4650
actor_rollout_ref.ref.fsdp_config.param_offload=True \
4751
algorithm.use_kl_in_reward=False \
4852
trainer.critic_warmup=0 \
4953
trainer.logger=console \
5054
trainer.project_name='verl_grpo_example_gsm8k' \
5155
trainer.experiment_name='qwen2_5_7b_function_rm' \
52-
trainer.n_gpus_per_node=16 \
56+
trainer.n_gpus_per_node=8 \
5357
trainer.nnodes=1 \
5458
trainer.save_freq=-1 \
5559
trainer.test_freq=5 \
5660
trainer.total_epochs=5 \
5761
trainer.device=npu \
5862
global_profiler.tool=npu \
5963
global_profiler.steps=$PROFILE_STEPS \
60-
global_profiler.save_path=$SAVE_PATH \
61-
global_profiler.global_tool_config.npu.discrete=$DISCRETE \
62-
global_profiler.global_tool_config.npu.contents=$CONTENTS \
63-
global_profiler.global_tool_config.npu.level=$LEVEL \
64-
global_profiler.global_tool_config.npu.analysis=$ANALYSIS
64+
global_profiler.save_path=$SAVE_PATH
6565
$@

examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ python3 -m verl.trainer.main_ppo \
1515
algorithm.adv_estimator=grpo \
1616
data.train_files=$HOME/data/gsm8k/train.parquet \
1717
data.val_files=$HOME/data/gsm8k/test.parquet \
18-
data.train_batch_size=1024 \
18+
data.train_batch_size=32 \
1919
data.max_prompt_length=1024 \
2020
data.max_response_length=1024 \
2121
data.filter_overlong_prompts=True \
@@ -24,41 +24,40 @@ python3 -m verl.trainer.main_ppo \
2424
actor_rollout_ref.actor.optim.lr=5e-8 \
2525
actor_rollout_ref.model.use_remove_padding=False \
2626
actor_rollout_ref.model.enable_gradient_checkpointing=True \
27-
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
28-
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
27+
actor_rollout_ref.actor.ppo_mini_batch_size=2 \
28+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
2929
actor_rollout_ref.actor.use_kl_loss=True \
3030
actor_rollout_ref.actor.entropy_coeff=0 \
3131
actor_rollout_ref.actor.kl_loss_coef=0.001 \
3232
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
3333
actor_rollout_ref.actor.profiler.enable=True \
34-
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
3534
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
35+
actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE \
36+
actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS \
37+
actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL \
38+
actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS \
3639
actor_rollout_ref.actor.fsdp_config.param_offload=False \
3740
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
38-
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
41+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
3942
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
4043
actor_rollout_ref.rollout.name=vllm \
4144
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
42-
actor_rollout_ref.rollout.n=5 \
45+
actor_rollout_ref.rollout.n=4 \
4346
actor_rollout_ref.rollout.enable_chunked_prefill=False \
44-
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
47+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
4548
actor_rollout_ref.ref.fsdp_config.param_offload=True \
4649
algorithm.use_kl_in_reward=False \
4750
trainer.critic_warmup=0 \
4851
trainer.logger=console \
4952
trainer.project_name='verl_grpo_example_gsm8k' \
5053
trainer.experiment_name='qwen2_5_7b_function_rm' \
51-
trainer.n_gpus_per_node=16 \
54+
trainer.n_gpus_per_node=8 \
5255
trainer.nnodes=1 \
5356
trainer.save_freq=-1 \
5457
trainer.test_freq=5 \
5558
trainer.total_epochs=5 \
5659
trainer.device=npu \
5760
global_profiler.tool=npu \
5861
global_profiler.steps=$PROFILE_STEPS \
59-
global_profiler.save_path=$SAVE_PATH \
60-
global_profiler.global_tool_config.npu.discrete=$DISCRETE \
61-
global_profiler.global_tool_config.npu.contents=$CONTENTS \
62-
global_profiler.global_tool_config.npu.level=$LEVEL \
63-
global_profiler.global_tool_config.npu.analysis=$ANALYSIS
62+
global_profiler.save_path=$SAVE_PATH
6463
$@

0 commit comments

Comments
 (0)