[perf] feat: Padding before batch post-process in agent-loop to save time #2773

PopSoda2002 · 2025-07-27T21:00:22Z

What does this PR do?

From issue here:
#2677

Try to pad the prompt, response & mask before batch post-processing to save time
Main idea:

# prompt_ids: left padded with zeros (e.g., [0,0,0,0,1,2,3,4])
# response_ids: right padded with zeros (e.g., [5,6,7,8,0,0,0,0])
# input_ids: concatenation of prompt + response
# Mask:
# For example, if the prompt is [1,2,3,4] and the response is [5,6,7,(tool start)8,9(tool end),10,11,12]
# - prompt_attention_mask: 0s for padding, 1s for tokens
#   e.g., [0,0,0,0,1,1,1,1]
# - response_attention_mask: 0s for padding, 1s for tokens
#   e.g., [1,1,1,1,1,1,1,1,1,1,1,0,0,0,0]
# attention_mask: concatenation of prompt_attention_mask and response_attention_mask
#   e.g., [0,0,0,0,1,1,1,1(prompt),1,1,1,1,1,1,1,1,1,1,1,0,0,0,0(response)]
# - response_mask: 1s for LLM generated tokens, 0 for tool response/padding tokens
#   e.g., [1,1,1,1,1,1,1,(tool start),0,0(tool end),1,1,0,0,0,0]
# - position_ids: sequential positions for tokens, starting at 0
#   e.g., [0,0,0,0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,0,0,0,0]

Test

Environment setup: follow this tutorial
Test config in 4 * H100

#!/bin/bash
# run on 8xH100 with optimizations for stability
# make sure your current working directory is the root of the project

set -x

ulimit -n 65535

# 增加网络稳定性环境变量
export CUDA_HOME=/usr/local/cuda
export CUDA_VISIBLE_DEVICES=4,5,6,7

PROJECT_DIR="$(pwd)"
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"

python3 -m verl.trainer.main_ppo \
    --config-path="$CONFIG_PATH" \
    --config-name='gsm8k_multiturn_grpo' \
    algorithm.adv_estimator=grpo \
    data.train_batch_size=256 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.return_raw_chat=True \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.n=16 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='gsm8k_async_rl' \
    trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-sgl-multi-w-tool-verify-n16-agent-loop-v1' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=20 \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.update_weights_bucket_megabytes=128 \
    actor_rollout_ref.rollout.trace.backend=weave \
    actor_rollout_ref.rollout.trace.token2text=True \
    actor_rollout_ref.rollout.mode=async \
    actor_rollout_ref.rollout.multi_turn.enable=true

Before(v1) & After(v2)

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always

CLAassistant · 2025-07-27T21:00:29Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces padding before batch post-processing in the agent loop to improve performance. It also addresses potential serialization issues by converting numpy int64 values to standard Python integers.

verl/experimental/agent_loop/agent_loop.py

wuxibin89 · 2025-07-28T02:26:09Z

verl/experimental/agent_loop/agent_loop.py

    """Number of chat turns, including user, assistant, tool."""
    metrics: AgentLoopMetrics
    """Auxiliary performance metrics"""
+    processed_tensors: dict = None


We can save padded ids directly in prompt_ids, response_ids, response_mask.

but what about tensor like attention_mask?

my fault, they all can be computed within the three one, should we move the calculation in the postprocess ?

Yes, we should move pad logic of prompt_ids, response_ids, response_mask in _postprocess to _run_agent_loop.

Thx! I will do this

wuxibin89 · 2025-07-29T03:47:34Z

verl/experimental/agent_loop/agent_loop.py

+            attention_mask = torch.cat([prompt_output["attention_mask"], response_output["attention_mask"]], dim=1)
+
+            # Overwrite with padded data, converted to lists for safe serialization.
+            output.prompt_ids = prompt_output["input_ids"].squeeze(0).tolist()


We can keep tensor as it is.

_run_agent_loop can return other class instead of AgentLoopOutput

wuxibin89 · 2025-07-29T07:01:45Z

LGTM, let's merge it after CI passed.

zhaochenyang20 · 2025-07-30T02:36:21Z

Great job!

GuanxingLu · 2025-07-30T07:36:23Z

Great job!

…time (volcengine#2773) ### What does this PR do? From issue here: volcengine#2677 Try to pad the `prompt`, `response` & `mask` before batch post-processing to save time Main idea: <img width="1978" height="916" alt="image" src="https://github.com/user-attachments/assets/bf16d45b-9da8-4d07-aab4-d8773e5ab705" /> ```python # prompt_ids: left padded with zeros (e.g., [0,0,0,0,1,2,3,4]) # response_ids: right padded with zeros (e.g., [5,6,7,8,0,0,0,0]) # input_ids: concatenation of prompt + response # Mask: # For example, if the prompt is [1,2,3,4] and the response is [5,6,7,(tool start)8,9(tool end),10,11,12] # - prompt_attention_mask: 0s for padding, 1s for tokens # e.g., [0,0,0,0,1,1,1,1] # - response_attention_mask: 0s for padding, 1s for tokens # e.g., [1,1,1,1,1,1,1,1,1,1,1,0,0,0,0] # attention_mask: concatenation of prompt_attention_mask and response_attention_mask # e.g., [0,0,0,0,1,1,1,1(prompt),1,1,1,1,1,1,1,1,1,1,1,0,0,0,0(response)] # - response_mask: 1s for LLM generated tokens, 0 for tool response/padding tokens # e.g., [1,1,1,1,1,1,1,(tool start),0,0(tool end),1,1,0,0,0,0] # - position_ids: sequential positions for tokens, starting at 0 # e.g., [0,0,0,0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,0,0,0,0] ``` ### Test Environment setup: follow this [tutorial](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/tool_examples/agent_loop.md) Test config in 4 * H100 ```bash #!/bin/bash # run on 8xH100 with optimizations for stability # make sure your current working directory is the root of the project set -x ulimit -n 65535 # 增加网络稳定性环境变量 export CUDA_HOME=/usr/local/cuda export CUDA_VISIBLE_DEVICES=4,5,6,7 PROJECT_DIR="$(pwd)" CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config" python3 -m verl.trainer.main_ppo \ --config-path="$CONFIG_PATH" \ --config-name='gsm8k_multiturn_grpo' \ algorithm.adv_estimator=grpo \ data.train_batch_size=256 \ data.max_prompt_length=1024 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.return_raw_chat=True \ actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=128 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \ actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ actor_rollout_ref.rollout.name=sglang \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.n=16 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='gsm8k_async_rl' \ trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-sgl-multi-w-tool-verify-n16-agent-loop-v1' \ trainer.n_gpus_per_node=4 \ trainer.nnodes=1 \ trainer.save_freq=-1 \ trainer.test_freq=20 \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \ trainer.total_epochs=15 \ actor_rollout_ref.rollout.update_weights_bucket_megabytes=128 \ actor_rollout_ref.rollout.trace.backend=weave \ actor_rollout_ref.rollout.trace.token2text=True \ actor_rollout_ref.rollout.mode=async \ actor_rollout_ref.rollout.multi_turn.enable=true ``` Before(v1) & After(v2) <img width="831" height="632" alt="image" src="https://github.com/user-attachments/assets/033737e2-1b63-4b25-8b26-ab593db28a90" /> <img width="1674" height="1272" alt="image" src="https://github.com/user-attachments/assets/296fbb37-430f-4f45-84c1-e003930a1896" /> > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always`

Pad then cat

c188b81

gemini-code-assist bot reviewed Jul 27, 2025

View reviewed changes

verl/experimental/agent_loop/agent_loop.py Outdated Show resolved Hide resolved

PopSoda2002 changed the title ~~[Feat.][perf] Padding before batch post-process in agent-loop~~ [perf] feat: Padding before batch post-process in agent-loop to save time Jul 27, 2025

PopSoda2002 added 2 commits July 27, 2025 21:11

Pad then cat

59d9b29

Pad then cat

04b82ff

wuxibin89 reviewed Jul 28, 2025

View reviewed changes

PopSoda2002 requested a review from wuxibin89 July 28, 2025 03:57

wuxibin89 mentioned this pull request Jul 24, 2025

[agentic RL] multi-turn rollout and agent loop development tracking #2618

Open

7 tasks

upd

276c342

wuxibin89 reviewed Jul 29, 2025

View reviewed changes

PopSoda2002 added 2 commits July 29, 2025 06:37

upd

e6fe7fe

upd

237130a

upd

f195ac2

wuxibin89 approved these changes Jul 30, 2025

View reviewed changes

wuxibin89 merged commit c3df0b5 into volcengine:main Jul 30, 2025
50 of 53 checks passed

PopSoda2002 mentioned this pull request Aug 11, 2025

[Feat.] Add padding before batch postprocess in agentloop #2677

Closed

[perf] feat: Padding before batch post-process in agent-loop to save time #2773

[perf] feat: Padding before batch post-process in agent-loop to save time #2773

Uh oh!

Conversation

PopSoda2002 commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Jul 29, 2025

Uh oh!

zhaochenyang20 commented Jul 30, 2025

Uh oh!

Uh oh!

GuanxingLu commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PopSoda2002 commented Jul 27, 2025 •

edited

Loading

CLAassistant commented Jul 27, 2025 •

edited

Loading