Commit cd734a4
committed
[rollout] fix: sglang async fail with Multi-stage Awake feature (#2365)
### What does this PR do?
Fix a regression from verl-project/verl#1911,
because the PR did not change the sglang async branch.
CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.
To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh
It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286) what(): CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```
### Checklist Before Starting
- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
> List the specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).1 parent be9c8cb commit cd734a4
File tree
4 files changed
+34
-96
lines changed- .github/workflows
- verl/workers
- sharding_manager
4 files changed
+34
-96
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
292 | 292 | | |
293 | 293 | | |
294 | 294 | | |
295 | | - | |
| 295 | + | |
296 | 296 | | |
297 | 297 | | |
298 | 298 | | |
299 | 299 | | |
300 | 300 | | |
301 | | - | |
| 301 | + | |
302 | 302 | | |
303 | 303 | | |
304 | 304 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| |||
692 | 692 | | |
693 | 693 | | |
694 | 694 | | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
695 | 700 | | |
696 | 701 | | |
697 | 702 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
36 | 35 | | |
37 | 36 | | |
38 | 37 | | |
| |||
101 | 100 | | |
102 | 101 | | |
103 | 102 | | |
104 | | - | |
105 | | - | |
106 | 103 | | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
137 | | - | |
138 | | - | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
| 104 | + | |
145 | 105 | | |
146 | 106 | | |
147 | 107 | | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | | - | |
162 | | - | |
| 108 | + | |
| 109 | + | |
163 | 110 | | |
164 | 111 | | |
165 | 112 | | |
| |||
207 | 154 | | |
208 | 155 | | |
209 | 156 | | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
210 | 166 | | |
211 | 167 | | |
212 | 168 | | |
| |||
217 | 173 | | |
218 | 174 | | |
219 | 175 | | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
220 | 180 | | |
221 | 181 | | |
222 | 182 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | 117 | | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
137 | | - | |
| 118 | + | |
138 | 119 | | |
139 | 120 | | |
140 | 121 | | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
| 122 | + | |
| 123 | + | |
156 | 124 | | |
157 | 125 | | |
158 | 126 | | |
| |||
182 | 150 | | |
183 | 151 | | |
184 | 152 | | |
185 | | - | |
| 153 | + | |
186 | 154 | | |
| 155 | + | |
| 156 | + | |
187 | 157 | | |
188 | 158 | | |
189 | 159 | | |
| |||
195 | 165 | | |
196 | 166 | | |
197 | 167 | | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
198 | 171 | | |
199 | 172 | | |
200 | 173 | | |
201 | 174 | | |
202 | 175 | | |
203 | | - | |
| 176 | + | |
204 | 177 | | |
205 | 178 | | |
206 | 179 | | |
| |||
0 commit comments