[rollout,trainer] feat: offload param before wake up inference engine #2977

chenhaiq · 2025-08-08T08:46:09Z

What does this PR do?

Move the parameter offloading step before waking up the inference engine to reduce GPU memory cap.

Changed vllm and sglang with fsdp.
Leaving megatron unchanged because it may result illegal access with similar change.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

With this change, the max GPU memory is reduced from logs.

sglang

Now:
(WorkerDict pid=12436) [2025-08-08 07:16:19] Before state_dict() in sharding manager memory, memory allocated (GB): 0.02, memory reserved (GB): 0.87, device memory used/total (GB): 5.50/79.15
(WorkerDict pid=12436) [2025-08-08 07:16:19] After state_dict() in sharding manager memory, memory allocated (GB): 1.57, memory reserved (GB): 3.44, device memory used/total (GB): 8.07/79.15
(WorkerDict pid=12436) [2025-08-08 07:16:19] After offload_param in sharding manager memory, memory allocated (GB): 0.85, memory reserved (GB): 2.52, device memory used/total (GB): 7.14/79.15 <---
(WorkerDict pid=12436) [2025-08-08 07:16:19] Before resume SGLang weights + kv_cache in sharding manager, memory allocated (GB): 0.85, memory reserved (GB): 2.52, device memory used/total (GB): 53.72/79.15. <--

Before
(WorkerDict pid=31787) [2025-08-08 07:31:42] Before state_dict() in sharding manager memory, memory allocated (GB): 0.02, memory reserved (GB): 0.87, device memory used/total (GB): 52.06/79.15
(WorkerDict pid=31787) [2025-08-08 07:31:42] After state_dict() in sharding manager memory, memory allocated (GB): 1.57, memory reserved (GB): 3.44, device memory used/total (GB): 54.63/79.15. <---
(WorkerDict pid=31787) [2025-08-08 07:31:43] After sync model weights in sharding manager, memory allocated (GB): 2.44, memory reserved (GB): 3.44, device memory used/total (GB): 54.61/79.15

vllm

Now

(WorkerDict pid=87197) DEBUG:2025-08-07 11:37:15,191:Before state_dict() in sharding manager memory, memory allocated (GB): 45.21, memory reserved (GB): 45.33, device memory used/total (GB): 2.74/79.15
(WorkerDict pid=87197) DEBUG:2025-08-07 11:37:15,472:After state_dict() in sharding manager memory, memory allocated (GB): 46.06, memory reserved (GB): 47.72, device memory used/total (GB): 5.14/79.15
(WorkerDict pid=87197) DEBUG:2025-08-07 11:37:15,637:After sync model weights in sharding manager, memory allocated (GB): 46.06, memory reserved (GB): 47.72, device memory used/total (GB): 5.92/79.15 <--
(WorkerDict pid=87197) DEBUG:2025-08-07 11:37:15,791:After del state_dict and empty_cache in sharding manager, memory allocated (GB): 45.21, memory reserved (GB): 45.33, device memory used/total (GB): 47.94/79.15

Before
(WorkerDict pid=104544) DEBUG:2025-08-07 11:41:46,431:Before state_dict() in sharding manager memory, memory allocated (GB): 45.21, memory reserved (GB): 45.33, device memory used/total (GB): 2.74/79.15
(WorkerDict pid=104544) DEBUG:2025-08-07 11:41:46,628:After state_dict() in sharding manager memory, memory allocated (GB): 46.78, memory reserved (GB): 48.76, device memory used/total (GB): 6.17/79.15 <--
(WorkerDict pid=104544) DEBUG:2025-08-07 11:41:46,790:After sync model weights in sharding manager, memory allocated (GB): 46.78, memory reserved (GB): 48.76, device memory used/total (GB): 6.96/79.15
(WorkerDict pid=104544) DEBUG:2025-08-07 11:41:47,073:After del state_dict and empty_cache in sharding manager, memory allocated (GB): 45.21, memory reserved (GB): 45.33, device memory used/total (GB): 47.94/79.15

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request refactors the parameter handling logic in fsdp_sglang.py and fsdp_vllm.py to reduce peak GPU memory usage. The change involves offloading the FSDP model's parameters to the CPU before waking up the inference engine. This is a sensible optimization that should prevent having both the training model parameters and inference model parameters on the GPU simultaneously. The implementation looks correct and consistent across both modified files. I have one suggestion for fsdp_vllm.py to improve the clarity of a log message, which will aid in future debugging of memory-related issues.

gemini-code-assist · 2025-08-08T08:48:11Z

verl/workers/sharding_manager/fsdp_vllm.py

+            if self.offload_param:
+                offload_fsdp_model_to_cpu(self.module)
            log_gpu_memory_usage("After state_dict() in sharding manager memory", logger=logger)


The log message "After state_dict()" on line 211 is misleading, as it's now called after the parameters may have been offloaded to CPU. This can cause confusion when debugging memory usage. Renaming it to "After offload_param..." makes it accurate and consistent with the new log message added in fsdp_sglang.py.

For even better diagnostics, you could consider having separate log points for after state_dict and after offload_param, similar to the implementation in fsdp_sglang.py.

Suggested change

if self.offload_param:

offload_fsdp_model_to_cpu(self.module)

log_gpu_memory_usage("After state_dict() in sharding manager memory", logger=logger)

if self.offload_param:

offload_fsdp_model_to_cpu(self.module)

log_gpu_memory_usage("After offload_param in sharding manager memory", logger=logger)

hebiao064 · 2025-08-08T23:18:50Z

sorry, just noticed it, the PR lgtm!

Do you have idea about why it failed for megatron?

hebiao064 · 2025-08-08T23:22:27Z

And whats your setup? it shows that it only saved <1GB, I wonder how many gpus and what model you are using

chenhaiq · 2025-08-11T07:38:11Z

And whats your setup? it shows that it only saved <1GB, I wonder how many gpus and what model you are using

Qwen2.5-7B-Instruct with 8 A800 GPU

chenhaiq · 2025-08-11T07:56:50Z

sorry, just noticed it, the PR lgtm!

Do you have idea about why it failed for megatron?

The failed test is fsdp only testcase. It has no relation to rollout, including vllm and sglang, so my change has no effect to it.

…volcengine#2977)

offload param before wake up inference engine

c3b82b2

chenhaiq requested review from hebiao064 and vermouth1992 August 8, 2025 08:46

gemini-code-assist bot reviewed Aug 8, 2025

View reviewed changes

vermouth1992 approved these changes Aug 8, 2025

View reviewed changes

vermouth1992 merged commit da7fc8e into volcengine:main Aug 8, 2025
34 of 38 checks passed

yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Aug 11, 2025

[rollout,trainer] feat: offload param before wake up inference engine (…

7877885

…volcengine#2977)

ChangyiYang pushed a commit to SwordFaith/verl that referenced this pull request Aug 16, 2025

[rollout,trainer] feat: offload param before wake up inference engine (…

f60108b

…volcengine#2977)

whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025

[rollout,trainer] feat: offload param before wake up inference engine (…

20006e5

…volcengine#2977)

WncFht pushed a commit to WncFht/verl that referenced this pull request Oct 10, 2025

[rollout,trainer] feat: offload param before wake up inference engine (…

6daa201

…volcengine#2977)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rollout,trainer] feat: offload param before wake up inference engine #2977

[rollout,trainer] feat: offload param before wake up inference engine #2977

Uh oh!

chenhaiq commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 8, 2025

Uh oh!

Uh oh!

hebiao064 commented Aug 8, 2025

Uh oh!

hebiao064 commented Aug 8, 2025

Uh oh!

chenhaiq commented Aug 11, 2025 •

edited

Loading

Uh oh!

chenhaiq commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[rollout,trainer] feat: offload param before wake up inference engine #2977

[rollout,trainer] feat: offload param before wake up inference engine #2977

Uh oh!

Conversation

chenhaiq commented Aug 8, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hebiao064 commented Aug 8, 2025

Uh oh!

hebiao064 commented Aug 8, 2025

Uh oh!

chenhaiq commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenhaiq commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenhaiq commented Aug 11, 2025 •

edited

Loading