[rollout] feat: Support Multi-stage Awake for SGLang (verl-project#1911)

hebiao064 · zhaochenyang20 · web-flow · commit fa02416f7f64 · 2025-06-23T14:03:35.000-07:00
Co-authored with: MrAta (immrata@gmail.com) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? ### Motivation In RL Ecosystem which use colocate design like [verl](https://github.com/volcengine/verl/tree/main), we need to offload training model and load serving model & KV Cache frequently. #### Background - Currently SGLang is using [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to pause and resume. - [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) is a open source repo that provided easy to use api to hack **cudaMalloc** and **cudaFree** to make sure the virtual address could be consistent after pause and resume, which is critical to ensure CUDA Graph work. - CUDA Graph is critical to make sure SGLang runs faster in decoding phases. #### Here is the current behavior of VERL + SGLang ![Image](https://github.com/user-attachments/assets/e87e7dd6-f223-4de6-8f07-915eb2030ea8) 1. During Training, we have training model and optimizer state in the GPU Memory, and once training is done, we will offload optimizer state to cpu and keep the model weights in GPU, which is needed in Update Weight. 2. During Update Weight, we awake the SGLang engine, so those paused memory of Model Weights and KV Cache will come back. Then we update model from training model to serving model on the fly using the api: `update_weights_in_tensor` 3. After Model being updated, we delete the training model from GPU Memory. Above design works pretty well so far, however, this would waste a big chunk of GPU Memory during rollout, which could cause a few issues we've seen so far: - **Small KV Cache**: We need to use relative lower number of mem fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV Cache has less tokens, we will hit `RuntimeError: Prefill out of memory. Try to lower your batch size.` when we try prefill large number of requests. - **Out of Memory**: If we use mem fraction ratio 0.8 and run RL for 32B model on 8 H100, it will OOM during update weight #### Challenge - `torch_memory_saver` currently only supports Singleton, hence SGLang will pause and resume KV Cache + Weights together, they are treated as the same group of memory controlled by the singleton `torch_memory_saver` instance #### Proposal ![Image](https://github.com/user-attachments/assets/7fda9638-0dc2-4c14-bc64-cd20616f350f) 1. During Training, we do the same 2. During Update Weight Stage 1, we awake the model weights from SGLang and then update weights 3. During Update Weight Stage 2, we delete the training model weights from GPU Memory 4. Awake the SGLang's KV Cache ![Image](https://github.com/user-attachments/assets/f3dab327-dc2e-4ed8-88d7-15e383f77d25) ### Benefit With above feature, we can train larger model with same GPU, we can also make training/rollout more efficient given we can allocate larger KV Cache ### Solution: Keep using Singleton and provide tag based pause/resume - [x] Support tag based resume/pause: fzyzcjy/torch_memory_saver#20 - [x] Support Multiple Stage Awake in SGLang: sgl-project/sglang#7099 - [ ] Support Multiple Stage Awake in verl: verl-project#1911 ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![Screenshot 2025-06-19 at 12 16 19 PM](https://github.com/user-attachments/assets/a95dd57e-43e1-4f28-8a84-003ec5c043fc) ![Screenshot 2025-06-19 at 12 13 14 PM](https://github.com/user-attachments/assets/f1f4a8a8-1845-4fad-9424-5526d4154dd0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Co-authored-by: Chayenne <zhaochen20@outlook.com>
diff --git a/verl/trainer/config/ppo_trainer.yaml b/verl/trainer/config/ppo_trainer.yaml
@@ -465,6 +465,9 @@ actor_rollout_ref:
     # number of responses (i.e. num sample times). > 1 for grpo
     n: 1
 
+    # Whether to wake up inference engine in multi-stage. (Wake up model weights first, then resume kv cache)
+    multi_stage_wake_up: false
+
     # Extra inference engine arguments (vllm, sglang).
     engine_kwargs:
 
diff --git a/verl/workers/fsdp_workers.py b/verl/workers/fsdp_workers.py
@@ -484,6 +484,7 @@ def _build_rollout(self, trust_remote_code=False):
                 full_params="hf" in self.config.rollout.load_format,
                 device_mesh=rollout_device_mesh,
                 offload_param=self._is_offload_param,
+                multi_stage_wake_up=self.config.rollout.multi_stage_wake_up,
             )
             log_gpu_memory_usage("After building sharding manager", logger=logger)
 
diff --git a/verl/workers/rollout/sglang_rollout/sglang_rollout.py b/verl/workers/rollout/sglang_rollout/sglang_rollout.py
@@ -132,21 +132,27 @@ def __init__(self, **kwargs):
         # default to use dummy load format, which need to reload weights in first time
         self._need_reload = True
 
-    async def release_memory_occupation(self):
+    async def release_memory_occupation(self, tags: Optional[list[str]] = None):
         """Release GPU occupation temporarily."""
-        obj = ReleaseMemoryOccupationReqInput()
+        if tags is None:
+            obj = ReleaseMemoryOccupationReqInput()
+        else:
+            obj = ReleaseMemoryOccupationReqInput(tags=tags)
         return await self.tokenizer_manager.release_memory_occupation(obj, None)
 
-    async def resume_memory_occupation(self):
+    async def resume_memory_occupation(self, tags: Optional[list[str]] = None):
         """Resume GPU occupation."""
-
         # because __init__ is a sync method, it can not call the async release_memory_occupation
         # have to move release_memory_occupation from __init__ to here
+        # For multi-stage awake, we run release weight and kv_cache when we resume weights for the first time.
         if self._need_reload:
             await self.release_memory_occupation()
             self._need_reload = False
 
-        obj = ResumeMemoryOccupationReqInput()
+        if tags is None:
+            obj = ResumeMemoryOccupationReqInput()
+        else:
+            obj = ResumeMemoryOccupationReqInput(tags=tags)
         return await self.tokenizer_manager.resume_memory_occupation(obj, None)
 
     async def update_weights_from_tensor(
diff --git a/verl/workers/sharding_manager/fsdp_sglang.py b/verl/workers/sharding_manager/fsdp_sglang.py
@@ -59,12 +59,14 @@ def __init__(
         full_params: bool = False,
         device_mesh: DeviceMesh = None,
         offload_param: bool = False,
+        multi_stage_wake_up: bool = False,
     ):
         self.module = module
         self.inference_engine = inference_engine
         self.model_config = model_config
         self.device_mesh = device_mesh
         self.offload_param = offload_param
+        self.multi_stage_wake_up = multi_stage_wake_up
 
         # Full params
         self.full_params = full_params
@@ -95,7 +97,17 @@ def __init__(
     def __enter__(self):
         self.timing = {}
         with simple_timer("reshard", self.timing):
+            loop = asyncio.get_event_loop()
+
+            if self.device_mesh["infer_tp"].get_local_rank() == 0:
+                if self.multi_stage_wake_up:
+                    loop.run_until_complete(self.inference_engine.resume_memory_occupation(tags=["weights"]))
+                    log_gpu_memory_usage("Before resume SGLang weights in sharding manager", logger=logger)
+                else:
+                    loop.run_until_complete(self.inference_engine.resume_memory_occupation())
+                    log_gpu_memory_usage("Before resume SGLang weights + kv_cache in sharding manager", logger=logger)
             get_torch_device().empty_cache()
+
             log_gpu_memory_usage("Before state_dict() in sharding manager memory", logger=logger)
             if self.offload_param:
                 load_fsdp_model_to_gpu(self.module)
@@ -105,7 +117,6 @@ def __enter__(self):
             params = {k: v.to(device, non_blocking=True) if fsdp_version(self.module) == 2 else v for k, v in params.items()}
             params = convert_weight_keys(params, getattr(self.module, "_fsdp_wrapped_module", self.module))
             # Copy, not share memory
-            loop = asyncio.get_event_loop()
             loop.run_until_complete(self.update_weights(params))
             log_gpu_memory_usage("After sync model weights in sharding manager", logger=logger)
 
@@ -115,6 +126,10 @@ def __enter__(self):
             get_torch_device().empty_cache()
             log_gpu_memory_usage("After del state_dict and empty_cache in sharding manager", logger=logger)
 
+            if self.multi_stage_wake_up:
+                loop.run_until_complete(self.inference_engine.resume_memory_occupation(tags=["kv_cache"]))
+                log_gpu_memory_usage("After resume SGLang kv_cache in sharding manager", logger=logger)
+
             # important: need to manually set the random states of each tp to be identical.
             if self.device_mesh is not None:
                 self.torch_random_states = get_torch_device().get_rng_state()
@@ -138,9 +153,6 @@ def __exit__(self, exc_type, exc_value, traceback):
             get_torch_device().set_rng_state(self.torch_random_states)
 
     async def update_weights(self, params):
-        if self.device_mesh["infer_tp"].get_local_rank() == 0:
-            await self.inference_engine.resume_memory_occupation()
-
         # Most naive implementation, can optimize a lot if it is bottleneck from sglang Engine weight update
         named_tensors = [(k, v) for k, v in params.items()]
         load_format = None

Original file line number	Diff line number	Diff line change
`@@ -484,6 +484,7 @@ def _build_rollout(self, trust_remote_code=False):`
`484`	`484`	`full_params="hf" in self.config.rollout.load_format,`
`485`	`485`	`device_mesh=rollout_device_mesh,`
`486`	`486`	`offload_param=self._is_offload_param,`
	`487`	`+ multi_stage_wake_up=self.config.rollout.multi_stage_wake_up,`
`487`	`488`	`)`
`488`	`489`	`log_gpu_memory_usage("After building sharding manager", logger=logger)`
`489`	`490`