Commit fa02416
[rollout] feat: Support Multi-stage Awake for SGLang (verl-project#1911)
Co-authored with: MrAta (immrata@gmail.com)
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
### Motivation
In RL Ecosystem which use colocate design like
[verl](https://github.com/volcengine/verl/tree/main), we need to offload
training model and load serving model & KV Cache frequently.
#### Background
- Currently SGLang is using
[torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to
pause and resume.
- [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) is
a open source repo that provided easy to use api to hack **cudaMalloc**
and **cudaFree** to make sure the virtual address could be consistent
after pause and resume, which is critical to ensure CUDA Graph work.
- CUDA Graph is critical to make sure SGLang runs faster in decoding
phases.
#### Here is the current behavior of VERL + SGLang

1. During Training, we have training model and optimizer state in the
GPU Memory, and once training is done, we will offload optimizer state
to cpu and keep the model weights in GPU, which is needed in Update
Weight.
2. During Update Weight, we awake the SGLang engine, so those paused
memory of Model Weights and KV Cache will come back. Then we update
model from training model to serving model on the fly using the api:
`update_weights_in_tensor`
3. After Model being updated, we delete the training model from GPU
Memory.
Above design works pretty well so far, however, this would waste a big
chunk of GPU Memory during rollout, which could cause a few issues we've
seen so far:
- **Small KV Cache**: We need to use relative lower number of mem
fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV
Cache has less tokens, we will hit `RuntimeError: Prefill out of memory.
Try to lower your batch size.` when we try prefill large number of
requests.
- **Out of Memory**: If we use mem fraction ratio 0.8 and run RL for 32B
model on 8 H100, it will OOM during update weight
#### Challenge
- `torch_memory_saver` currently only supports Singleton, hence SGLang
will pause and resume KV Cache + Weights together, they are treated as
the same group of memory controlled by the singleton
`torch_memory_saver` instance
#### Proposal

1. During Training, we do the same
2. During Update Weight Stage 1, we awake the model weights from SGLang
and then update weights
3. During Update Weight Stage 2, we delete the training model weights
from GPU Memory
4. Awake the SGLang's KV Cache

### Benefit
With above feature, we can train larger model with same GPU, we can also
make training/rollout more efficient given we can allocate larger KV
Cache
### Solution: Keep using Singleton and provide tag based pause/resume
- [x] Support tag based resume/pause:
fzyzcjy/torch_memory_saver#20
- [x] Support Multiple Stage Awake in SGLang:
sgl-project/sglang#7099
- [ ] Support Multiple Stage Awake in verl:
verl-project#1911
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
> List the specific changes.
### API
> Demonstrate how the API changes if any.
### Usage Example
> Provide usage example(s) for easier usage.
```python
# Add code snippet or script demonstrating how to use this
```
### Test


### Additional Info.
- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]
### Checklist Before Submitting
- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
---------
Co-authored-by: Chayenne <zhaochen20@outlook.com>1 parent d084f2d commit fa02416
4 files changed
Lines changed: 31 additions & 9 deletions
File tree
- verl
- trainer/config
- workers
- rollout/sglang_rollout
- sharding_manager
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
465 | 465 | | |
466 | 466 | | |
467 | 467 | | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
468 | 471 | | |
469 | 472 | | |
470 | 473 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
484 | 484 | | |
485 | 485 | | |
486 | 486 | | |
| 487 | + | |
487 | 488 | | |
488 | 489 | | |
489 | 490 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
132 | 132 | | |
133 | 133 | | |
134 | 134 | | |
135 | | - | |
| 135 | + | |
136 | 136 | | |
137 | | - | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
138 | 141 | | |
139 | 142 | | |
140 | | - | |
| 143 | + | |
141 | 144 | | |
142 | | - | |
143 | 145 | | |
144 | 146 | | |
| 147 | + | |
145 | 148 | | |
146 | 149 | | |
147 | 150 | | |
148 | 151 | | |
149 | | - | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
150 | 156 | | |
151 | 157 | | |
152 | 158 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| 62 | + | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
67 | 68 | | |
| 69 | + | |
68 | 70 | | |
69 | 71 | | |
70 | 72 | | |
| |||
95 | 97 | | |
96 | 98 | | |
97 | 99 | | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
98 | 109 | | |
| 110 | + | |
99 | 111 | | |
100 | 112 | | |
101 | 113 | | |
| |||
105 | 117 | | |
106 | 118 | | |
107 | 119 | | |
108 | | - | |
109 | 120 | | |
110 | 121 | | |
111 | 122 | | |
| |||
115 | 126 | | |
116 | 127 | | |
117 | 128 | | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
118 | 133 | | |
119 | 134 | | |
120 | 135 | | |
| |||
138 | 153 | | |
139 | 154 | | |
140 | 155 | | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | 156 | | |
145 | 157 | | |
146 | 158 | | |
| |||
0 commit comments