Skip to content

Commit 9681934

Browse files
committed
fix npu profiler and add mstx UT
1 parent 55e3c5b commit 9681934

File tree

10 files changed

+383
-57
lines changed

10 files changed

+383
-57
lines changed

.github/workflows/e2e_ascend.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,4 +149,8 @@ jobs:
149149
USE_DIST_CKPT=True bash tests/special_npu/run_qwen2_5_05b_grpo_mindspeed.sh
150150
rm -rf $HOME/dist_ckpt/qwen2_5_05b_grpo_mindspeed
151151
rm -rf $HOME/ckpts
152+
- name: Running NPU profiling unit tests
153+
run: |
154+
ray stop --force
155+
pytest -s -x tests/utils/test_special_mstx_profile.py
152156

docs/ascend_tutorial/ascend_profiling_en.rst

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Data collection based on FSDP backend on Ascend devices(en)
22
==========================================================================================
33

4-
Last updated: 07/24/2025.
4+
Last updated: 08/14/2025.
55

66
This is a tutorial for data collection using the GRPO or DAPO algorithm
77
based on FSDP on Ascend devices.
@@ -30,7 +30,18 @@ and steps.
3030
- save_path: The path to save the collected data. Default is
3131
"outputs/profile".
3232

33-
Use parameters in ``global_profiler.global_tool_config.npu`` to control npu profiler behavior:
33+
34+
Role collection control
35+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36+
37+
In each role's ``profiler`` field, you can control the collection mode for that role.
38+
39+
- enable: Whether to enable profiling for this role.
40+
- all_ranks: Whether to collect data from all ranks.
41+
- ranks: A list of ranks to collect data from. If empty, no data is collected.
42+
- tool_config: Configuration for the profiling tool used by this role.
43+
44+
Use parameters in each role's ``profiler.tool_config.npu`` to control npu profiler behavior:
3445

3546
- level: Collection level—options are level_none, level0, level1, and
3647
level2
@@ -56,17 +67,7 @@ Use parameters in ``global_profiler.global_tool_config.npu`` to control npu prof
5667
- stack: Whether to record operator call stack information.
5768

5869
- analysis: Enables automatic data parsing.
59-
60-
61-
Role collection control
62-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
63-
64-
In each role's ``profile`` field, you can control the collection mode for that role.
65-
66-
- enable: Whether to enable profiling for this role.
67-
- all_ranks: Whether to collect data from all ranks.
68-
- ranks: A list of ranks to collect data from. If empty, no data is collected.
69-
- tool_config: Configuration for the profiling tool used by this role.
70+
- discrete: Whether to enable discrete mode.
7071

7172

7273
Examples
@@ -87,12 +88,15 @@ End-to-End collection
8788
8889
global_profiler:
8990
steps: [1, 2, 5]
90-
discrete: False
9191
actor_rollout_ref:
9292
actor:
9393
profiler:
9494
enable: True
9595
all_ranks: True
96+
tool_config:
97+
npu:
98+
discrete: False
99+
# rollout & ref follow actor settings
96100
97101
98102
Discrete Mode Collection
@@ -101,7 +105,16 @@ Discrete Mode Collection
101105
.. code:: yaml
102106
103107
global_profiler:
104-
discrete: True
108+
steps: [1, 2, 5]
109+
actor_rollout_ref:
110+
actor:
111+
profiler:
112+
enable: True
113+
all_ranks: True
114+
tool_config:
115+
npu:
116+
discrete: True
117+
# rollout & ref follow actor settings
105118
106119
107120
Visualization

docs/ascend_tutorial/ascend_profiling_zh.rst

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Data collection based on FSDP backend on Ascend devices(zh)
33

44
在昇腾设备上基于FSDP后端进行数据采集
55

6-
Last updated: 07/24/2025.
6+
Last updated: 08/14/2025.
77

88
这是一份在昇腾设备上基于FSDP后端使用GRPO或DAPO算法进行数据采集的教程。
99

@@ -26,7 +26,17 @@ Last updated: 07/24/2025.
2626
- steps: 此参数可以设置为包含采集步数的列表,例如 [2, 4],表示将采集第2步和第4步。如果设置为 null,则不进行采集。
2727
- save_path: 保存采集数据的路径。默认值为 "outputs/profile"。
2828

29-
通过 ``global_profiler.global_tool_config.npu`` 中的参数控制具体采集行为:
29+
角色profiler控制
30+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
31+
32+
在每个角色的 ``profiler`` 字段中,您可以控制该角色的采集模式。
33+
34+
- enable: 是否为此角色启用性能分析。
35+
- all_ranks: 是否从所有rank收集数据。
36+
- ranks: 要收集数据的rank列表。如果为空,则不收集数据。
37+
- tool_config: 此角色使用的性能分析工具的配置。
38+
39+
通过每个角色的 ``profiler.tool_config.npu`` 中的参数控制具体采集行为:
3040

3141
- level: 采集级别—选项有 level_none、level0、level1 和 level2
3242

@@ -46,16 +56,7 @@ Last updated: 07/24/2025.
4656
- stack: 是否记录算子调用栈信息。
4757

4858
- analysis: 启用自动数据解析。
49-
50-
角色profile控制
51-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52-
53-
在每个角色的 ``profile`` 字段中,您可以控制该角色的采集模式。
54-
55-
- enable: 是否为此角色启用性能分析。
56-
- all_ranks: 是否从所有rank收集数据。
57-
- ranks: 要收集数据的rank列表。如果为空,则不收集数据。
58-
- tool_config: 此角色使用的性能分析工具的配置。
59+
- discrete: 使用离散模式。
5960

6061
示例
6162
----
@@ -75,12 +76,14 @@ Last updated: 07/24/2025.
7576
7677
global_profiler:
7778
steps: [1, 2, 5]
78-
discrete: False
7979
actor_rollout_ref:
8080
actor:
81-
profile:
81+
profiler:
8282
enable: True
8383
all_ranks: True
84+
tool_config:
85+
npu:
86+
discrete: False
8487
# rollout & ref follow actor settings
8588
8689
@@ -90,7 +93,16 @@ Last updated: 07/24/2025.
9093
.. code:: yaml
9194
9295
global_profiler:
93-
discrete: True
96+
steps: [1, 2, 5]
97+
actor_rollout_ref:
98+
actor:
99+
profiler:
100+
enable: True
101+
all_ranks: True
102+
tool_config:
103+
npu:
104+
discrete: True
105+
# rollout & ref follow actor settings
94106
95107
96108
可视化

examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ python3 -m verl.trainer.main_ppo \
1616
algorithm.adv_estimator=grpo \
1717
data.train_files=$HOME/data/gsm8k/train.parquet \
1818
data.val_files=$HOME/data/gsm8k/test.parquet \
19-
data.train_batch_size=1024 \
19+
data.train_batch_size=32 \
2020
data.max_prompt_length=1024 \
2121
data.max_response_length=1024 \
2222
data.filter_overlong_prompts=True \
@@ -25,8 +25,8 @@ python3 -m verl.trainer.main_ppo \
2525
actor_rollout_ref.model.enable_gradient_checkpointing=True \
2626
actor_rollout_ref.model.use_remove_padding=False \
2727
actor_rollout_ref.actor.optim.lr=5e-8 \
28-
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
29-
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
28+
actor_rollout_ref.actor.ppo_mini_batch_size=2 \
29+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
3030
actor_rollout_ref.actor.use_kl_loss=True \
3131
actor_rollout_ref.actor.entropy_coeff=0 \
3232
actor_rollout_ref.actor.kl_loss_coef=0.001 \
@@ -36,13 +36,17 @@ python3 -m verl.trainer.main_ppo \
3636
actor_rollout_ref.actor.profiler.enable=True \
3737
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
3838
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
39-
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
39+
actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE \
40+
actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS \
41+
actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL \
42+
actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS \
43+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
4044
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
4145
actor_rollout_ref.rollout.name=vllm \
4246
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
43-
actor_rollout_ref.rollout.n=5 \
47+
actor_rollout_ref.rollout.n=4 \
4448
actor_rollout_ref.rollout.enable_chunked_prefill=False \
45-
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
49+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
4650
actor_rollout_ref.ref.fsdp_config.param_offload=True \
4751
algorithm.use_kl_in_reward=False \
4852
trainer.critic_warmup=0 \
@@ -57,9 +61,5 @@ python3 -m verl.trainer.main_ppo \
5761
trainer.device=npu \
5862
global_profiler.tool=npu \
5963
global_profiler.steps=$PROFILE_STEPS \
60-
global_profiler.save_path=$SAVE_PATH \
61-
global_profiler.global_tool_config.npu.discrete=$DISCRETE \
62-
global_profiler.global_tool_config.npu.contents=$CONTENTS \
63-
global_profiler.global_tool_config.npu.level=$LEVEL \
64-
global_profiler.global_tool_config.npu.analysis=$ANALYSIS
64+
global_profiler.save_path=$SAVE_PATH
6565
$@

examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ python3 -m verl.trainer.main_ppo \
1515
algorithm.adv_estimator=grpo \
1616
data.train_files=$HOME/data/gsm8k/train.parquet \
1717
data.val_files=$HOME/data/gsm8k/test.parquet \
18-
data.train_batch_size=1024 \
18+
data.train_batch_size=32 \
1919
data.max_prompt_length=1024 \
2020
data.max_response_length=1024 \
2121
data.filter_overlong_prompts=True \
@@ -24,24 +24,27 @@ python3 -m verl.trainer.main_ppo \
2424
actor_rollout_ref.actor.optim.lr=5e-8 \
2525
actor_rollout_ref.model.use_remove_padding=False \
2626
actor_rollout_ref.model.enable_gradient_checkpointing=True \
27-
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
28-
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
27+
actor_rollout_ref.actor.ppo_mini_batch_size=2 \
28+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
2929
actor_rollout_ref.actor.use_kl_loss=True \
3030
actor_rollout_ref.actor.entropy_coeff=0 \
3131
actor_rollout_ref.actor.kl_loss_coef=0.001 \
3232
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
3333
actor_rollout_ref.actor.profiler.enable=True \
34-
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
3534
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
35+
actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE \
36+
actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS \
37+
actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL \
38+
actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS \
3639
actor_rollout_ref.actor.fsdp_config.param_offload=False \
3740
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
38-
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
41+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
3942
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
4043
actor_rollout_ref.rollout.name=vllm \
4144
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
42-
actor_rollout_ref.rollout.n=5 \
45+
actor_rollout_ref.rollout.n=4 \
4346
actor_rollout_ref.rollout.enable_chunked_prefill=False \
44-
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
47+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
4548
actor_rollout_ref.ref.fsdp_config.param_offload=True \
4649
algorithm.use_kl_in_reward=False \
4750
trainer.critic_warmup=0 \
@@ -56,9 +59,5 @@ python3 -m verl.trainer.main_ppo \
5659
trainer.device=npu \
5760
global_profiler.tool=npu \
5861
global_profiler.steps=$PROFILE_STEPS \
59-
global_profiler.save_path=$SAVE_PATH \
60-
global_profiler.global_tool_config.npu.discrete=$DISCRETE \
61-
global_profiler.global_tool_config.npu.contents=$CONTENTS \
62-
global_profiler.global_tool_config.npu.level=$LEVEL \
63-
global_profiler.global_tool_config.npu.analysis=$ANALYSIS
62+
global_profiler.save_path=$SAVE_PATH
6463
$@

0 commit comments

Comments
 (0)