DDVD233
diff --git a/‎.github/workflows/e2e_ascend.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/e2e_ascend.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/ascend_tutorial/ascend_profiling_en.rst‎
Lines changed: 28 additions & 15 deletions b/‎docs/ascend_tutorial/ascend_profiling_en.rst‎
Lines changed: 28 additions & 15 deletions
diff --git a/‎docs/ascend_tutorial/ascend_profiling_zh.rst‎
Lines changed: 27 additions & 15 deletions b/‎docs/ascend_tutorial/ascend_profiling_zh.rst‎
Lines changed: 27 additions & 15 deletions
diff --git a/‎examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh‎
Lines changed: 12 additions & 12 deletions b/‎examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh‎
Lines changed: 12 additions & 13 deletions b/‎examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh‎
Lines changed: 12 additions & 13 deletions
@@ -149,4 +149,8 @@ jobs:
           USE_DIST_CKPT=True bash tests/special_npu/run_qwen2_5_05b_grpo_mindspeed.sh
           rm -rf $HOME/dist_ckpt/qwen2_5_05b_grpo_mindspeed
           rm -rf $HOME/ckpts
+      - name: Running NPU profiling unit tests
+        run: |
+          ray stop --force
+          pytest -s -x tests/utils/test_special_mstx_profile.py
           
@@ -1,7 +1,7 @@
 Data collection based on FSDP backend on Ascend devices(en)
 ==========================================================================================
 
-Last updated: 07/24/2025.
+Last updated: 08/14/2025.
 
 This is a tutorial for data collection using the GRPO or DAPO algorithm
 based on FSDP on Ascend devices.
@@ -30,7 +30,18 @@ and steps.
    -  save_path: The path to save the collected data. Default is
       "outputs/profile".
 
-Use parameters in ``global_profiler.global_tool_config.npu`` to control npu profiler behavior:
+
+Role collection control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In each role's ``profiler`` field, you can control the collection mode for that role.
+
+-  enable: Whether to enable profiling for this role.
+-  all_ranks: Whether to collect data from all ranks.
+-  ranks: A list of ranks to collect data from. If empty, no data is collected.
+-  tool_config: Configuration for the profiling tool used by this role.
+
+Use parameters in each role's ``profiler.tool_config.npu`` to control npu profiler behavior:
 
 -  level: Collection level—options are level_none, level0, level1, and
    level2
@@ -56,17 +67,7 @@ Use parameters in ``global_profiler.global_tool_config.npu`` to control npu prof
    -  stack: Whether to record operator call stack information.
 
 -  analysis: Enables automatic data parsing.
-
-
-Role collection control
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In each role's ``profile`` field, you can control the collection mode for that role.
-
--  enable: Whether to enable profiling for this role.
--  all_ranks: Whether to collect data from all ranks.
--  ranks: A list of ranks to collect data from. If empty, no data is collected.
--  tool_config: Configuration for the profiling tool used by this role.
+-  discrete: Whether to enable discrete mode.
 
 
 Examples
@@ -87,12 +88,15 @@ End-to-End collection
 
       global_profiler:
          steps: [1, 2, 5]
-         discrete: False
       actor_rollout_ref:
          actor:
             profiler:
                enable: True
                all_ranks: True
+               tool_config:
+                  npu:
+                     discrete: False
+        # rollout & ref follow actor settings
 
 
 Discrete Mode Collection
@@ -101,7 +105,16 @@ Discrete Mode Collection
 .. code:: yaml
 
       global_profiler:
-         discrete: True
+         steps: [1, 2, 5]
+      actor_rollout_ref:
+         actor:
+            profiler:
+               enable: True
+               all_ranks: True
+               tool_config:
+                  npu:
+                     discrete: True
+        # rollout & ref follow actor settings
 
 
 Visualization
 
@@ -3,7 +3,7 @@ Data collection based on FSDP backend on Ascend devices(zh)
 
 在昇腾设备上基于FSDP后端进行数据采集
 
-Last updated: 07/24/2025.
+Last updated: 08/14/2025.
 
 这是一份在昇腾设备上基于FSDP后端使用GRPO或DAPO算法进行数据采集的教程。
 
@@ -26,7 +26,17 @@ Last updated: 07/24/2025.
    -  steps: 此参数可以设置为包含采集步数的列表，例如 [2, 4]，表示将采集第2步和第4步。如果设置为 null，则不进行采集。
    -  save_path: 保存采集数据的路径。默认值为 "outputs/profile"。
 
-通过 ``global_profiler.global_tool_config.npu`` 中的参数控制具体采集行为：
+角色profiler控制
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+在每个角色的 ``profiler`` 字段中，您可以控制该角色的采集模式。
+
+-  enable: 是否为此角色启用性能分析。
+-  all_ranks: 是否从所有rank收集数据。
+-  ranks: 要收集数据的rank列表。如果为空，则不收集数据。
+-  tool_config: 此角色使用的性能分析工具的配置。
+
+通过每个角色的 ``profiler.tool_config.npu`` 中的参数控制具体采集行为：
 
 -  level: 采集级别—选项有 level_none、level0、level1 和 level2
 
@@ -46,16 +56,7 @@ Last updated: 07/24/2025.
    -  stack: 是否记录算子调用栈信息。
 
 -  analysis: 启用自动数据解析。
-
-角色profile控制
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-在每个角色的 ``profile`` 字段中，您可以控制该角色的采集模式。
-
--  enable: 是否为此角色启用性能分析。
--  all_ranks: 是否从所有rank收集数据。
--  ranks: 要收集数据的rank列表。如果为空，则不收集数据。
--  tool_config: 此角色使用的性能分析工具的配置。
+-  discrete: 使用离散模式。
 
 示例
 ----
@@ -75,12 +76,14 @@ Last updated: 07/24/2025.
 
       global_profiler:
          steps: [1, 2, 5]
-         discrete: False
       actor_rollout_ref:
          actor:
-            profile:
+            profiler:
                enable: True
                all_ranks: True
+               tool_config:
+                  npu:
+                     discrete: False
         # rollout & ref follow actor settings
 
 
@@ -90,7 +93,16 @@ Last updated: 07/24/2025.
 .. code:: yaml
 
       global_profiler:
-         discrete: True
+         steps: [1, 2, 5]
+      actor_rollout_ref:
+         actor:
+            profiler:
+               enable: True
+               all_ranks: True
+               tool_config:
+                  npu:
+                     discrete: True
+        # rollout & ref follow actor settings
 
 
 可视化
 
@@ -16,7 +16,7 @@ python3 -m verl.trainer.main_ppo \
     algorithm.adv_estimator=grpo \
     data.train_files=$HOME/data/gsm8k/train.parquet \
     data.val_files=$HOME/data/gsm8k/test.parquet \
-    data.train_batch_size=1024 \
+    data.train_batch_size=32 \
     data.max_prompt_length=1024 \
     data.max_response_length=1024 \
     data.filter_overlong_prompts=True \
@@ -25,8 +25,8 @@ python3 -m verl.trainer.main_ppo \
     actor_rollout_ref.model.enable_gradient_checkpointing=True \
     actor_rollout_ref.model.use_remove_padding=False \
     actor_rollout_ref.actor.optim.lr=5e-8 \
-    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
-    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=2 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
     actor_rollout_ref.actor.use_kl_loss=True \
     actor_rollout_ref.actor.entropy_coeff=0 \
     actor_rollout_ref.actor.kl_loss_coef=0.001 \
@@ -36,30 +36,30 @@ python3 -m verl.trainer.main_ppo \
     actor_rollout_ref.actor.profiler.enable=True \
     actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
     actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE \
+    actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS \
+    actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL \
+    actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
     actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
     actor_rollout_ref.rollout.name=vllm \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
-    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.rollout.n=4 \
     actor_rollout_ref.rollout.enable_chunked_prefill=False \
-    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
     actor_rollout_ref.ref.fsdp_config.param_offload=True \
     algorithm.use_kl_in_reward=False \
     trainer.critic_warmup=0 \
     trainer.logger=console \
     trainer.project_name='verl_grpo_example_gsm8k' \
     trainer.experiment_name='qwen2_5_7b_function_rm' \
-    trainer.n_gpus_per_node=16 \
+    trainer.n_gpus_per_node=8 \
     trainer.nnodes=1 \
     trainer.save_freq=-1 \
     trainer.test_freq=5 \
     trainer.total_epochs=5 \
     trainer.device=npu \
     global_profiler.tool=npu \
     global_profiler.steps=$PROFILE_STEPS \
-    global_profiler.save_path=$SAVE_PATH \
-    global_profiler.global_tool_config.npu.discrete=$DISCRETE \
-    global_profiler.global_tool_config.npu.contents=$CONTENTS \
-    global_profiler.global_tool_config.npu.level=$LEVEL \
-    global_profiler.global_tool_config.npu.analysis=$ANALYSIS
+    global_profiler.save_path=$SAVE_PATH
     $@
@@ -15,7 +15,7 @@ python3 -m verl.trainer.main_ppo \
     algorithm.adv_estimator=grpo \
     data.train_files=$HOME/data/gsm8k/train.parquet \
     data.val_files=$HOME/data/gsm8k/test.parquet \
-    data.train_batch_size=1024 \
+    data.train_batch_size=32 \
     data.max_prompt_length=1024 \
     data.max_response_length=1024 \
     data.filter_overlong_prompts=True \
@@ -24,41 +24,40 @@ python3 -m verl.trainer.main_ppo \
     actor_rollout_ref.actor.optim.lr=5e-8 \
     actor_rollout_ref.model.use_remove_padding=False \
     actor_rollout_ref.model.enable_gradient_checkpointing=True \
-    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
-    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=2 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
     actor_rollout_ref.actor.use_kl_loss=True \
     actor_rollout_ref.actor.entropy_coeff=0 \
     actor_rollout_ref.actor.kl_loss_coef=0.001 \
     actor_rollout_ref.actor.kl_loss_type=low_var_kl \
     actor_rollout_ref.actor.profiler.enable=True \
-    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
     actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
+    actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE \
+    actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS \
+    actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL \
+    actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS \
     actor_rollout_ref.actor.fsdp_config.param_offload=False \
     actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
-    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
     actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
     actor_rollout_ref.rollout.name=vllm \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
-    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.rollout.n=4 \
     actor_rollout_ref.rollout.enable_chunked_prefill=False \
-    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
     actor_rollout_ref.ref.fsdp_config.param_offload=True \
     algorithm.use_kl_in_reward=False \
     trainer.critic_warmup=0 \
     trainer.logger=console \
     trainer.project_name='verl_grpo_example_gsm8k' \
     trainer.experiment_name='qwen2_5_7b_function_rm' \
-    trainer.n_gpus_per_node=16 \
+    trainer.n_gpus_per_node=8 \
     trainer.nnodes=1 \
     trainer.save_freq=-1 \
     trainer.test_freq=5 \
     trainer.total_epochs=5 \
     trainer.device=npu \
     global_profiler.tool=npu \
     global_profiler.steps=$PROFILE_STEPS \
-    global_profiler.save_path=$SAVE_PATH \
-    global_profiler.global_tool_config.npu.discrete=$DISCRETE \
-    global_profiler.global_tool_config.npu.contents=$CONTENTS \
-    global_profiler.global_tool_config.npu.level=$LEVEL \
-    global_profiler.global_tool_config.npu.analysis=$ANALYSIS
+    global_profiler.save_path=$SAVE_PATH
     $@