-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[sglang] feat: add native sgl server #3090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sglang] feat: add native sgl server #3090
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a native HTTP server for SGLang, which is a significant architectural improvement for flexibility and scalability. The implementation is comprehensive, including synchronous and asynchronous adapters, extensive test coverage, and example configurations. However, I've identified a few critical issues that need to be addressed. There's a configuration mismatch that will cause runtime errors, and the server launch logic for multi-node setups appears to be flawed, which would prevent it from working correctly in distributed environments. Additionally, there's a performance-related regression due to a hardcoded logging level.
| args = { | ||
| "model_path": actor_module, | ||
| "dtype": self.config.dtype, | ||
| "mem_fraction_static": self.config.gpu_memory_utilization, | ||
| "enable_memory_saver": True, | ||
| "base_gpu_id": 0, | ||
| "gpu_id_step": 1, | ||
| "tp_size": self._tp_size, | ||
| "node_rank": node_rank, | ||
| "load_format": load_format, | ||
| "dist_init_addr": dist_init_addr, | ||
| "nnodes": nnodes, | ||
| "trust_remote_code": trust_remote_code, | ||
| # NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new | ||
| # when random.seed is being set during training | ||
| port=30000 + rank, | ||
| # NOTE(Chenyang): turn on log_level to see the decoding speed of SGLang Engine | ||
| # log_level="INFO" | ||
| # NOTE(Chenyang): turn the following lines to see the input and output of each request | ||
| "port": 30000 + rank, | ||
| # NOTE(Chenyang): if you want to debug the SGLang engine output | ||
| # please set the following parameters | ||
| # Otherwise, it will make the engine run too slow | ||
| "log_level": "info", | ||
| # "log_level": "error", | ||
| # log_requests=True, | ||
| # log_requests_level=2, | ||
| # NOTE(Chenyang): turn on max_running_requests to set the max concurrent running requests | ||
| # max_running_requests=1, | ||
| mm_attention_backend="fa3", | ||
| attention_backend=attention_backend if attention_backend is not None else "fa3", | ||
| # In async mode for AgentLoop, SGLang support token in token out to avoid the tokenizer | ||
| # inconsistency issue. | ||
| skip_tokenizer_init=self.config.mode == "async", | ||
| **engine_kwargs, | ||
| ) | ||
| "mm_attention_backend": "fa3", | ||
| "attention_backend": attention_backend if attention_backend is not None else "fa3", | ||
| # In async mode, we want token in token out. | ||
| "skip_tokenizer_init": self.config.mode == "async", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring has introduced a regression where engine_kwargs from the configuration are no longer passed to the SGLang engine constructor (AsyncEngine or AsyncHttpServerAdapter). This prevents overriding engine parameters like log_level. Additionally, log_level is now hardcoded to "info", which, as noted in the comments, can cause significant performance degradation and is not suitable for production. The engine_kwargs should be merged into the args dictionary, and the log_level should be configurable rather than hardcoded.
args = {
"model_path": actor_module,
"dtype": self.config.dtype,
"mem_fraction_static": self.config.gpu_memory_utilization,
"enable_memory_saver": True,
"base_gpu_id": 0,
"gpu_id_step": 1,
"tp_size": self._tp_size,
"node_rank": node_rank,
"load_format": load_format,
"dist_init_addr": dist_init_addr,
"nnodes": nnodes,
"trust_remote_code": trust_remote_code,
# NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new
# when random.seed is being set during training
"port": 30000 + rank,
# NOTE(Chenyang): if you want to debug the SGLang engine output
# please set the following parameters
# Otherwise, it will make the engine run too slow
# "log_level": "error",
# log_requests=True,
# log_requests_level=2,
# NOTE(Chenyang): turn on max_running_requests to set the max concurrent running requests
# max_running_requests=1,
"mm_attention_backend": "fa3",
"attention_backend": attention_backend if attention_backend is not None else "fa3",
# In async mode, we want token in token out.
"skip_tokenizer_init": self.config.mode == "async",
**engine_kwargs,
}
tests/workers/rollout/rollout_sglang/test_http_server_engine.py
Outdated
Show resolved
Hide resolved
examples/sglang_multiturn/config/gsm8k_multiturn_grpo_server.yaml
Outdated
Show resolved
Hide resolved
FAILED test_sglang_async_rollout_w_interaction.py::test_async_sglang_rollout_w_interaction - omegaconf.errors.ConfigAttributeError: Missing key sglang_engine_mode
full_key: sglang_engine_mode
object_type=dict@ChangyiYang fix the CI plz. |
|
@ChangyiYang rebase with main to solve this |
|
@zhaochenyang20 some ci should be fixed now. Can you rerun it? |
|
maybe we need to register it |
|
|
||
| python3 -m verl.trainer.main_ppo \ | ||
| --config-path="$CONFIG_PATH" \ | ||
| --config-name='gsm8k_multiturn_grpo_server' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to say this would use the server.
|
|
Correct me if I am wrong. |
|
Thanks @zhaochenyang20 for detailed review and instruction! Thanks @SwordFaith for providing base code and instruction along the way! |
### What does this PR do?
**Summary**
This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.
**Key Changes**
* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**
**Motivation**
The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:
1. **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.
2. **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.
3. **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
### What does this PR do?
**Summary**
This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.
**Key Changes**
* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**
**Motivation**
The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:
1. **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.
2. **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.
3. **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
### What does this PR do?
**Summary**
This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.
**Key Changes**
* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**
**Motivation**
The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:
1. **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.
2. **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.
3. **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
### What does this PR do?
**Summary**
This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.
**Key Changes**
* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**
**Motivation**
The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:
1. **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.
2. **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.
3. **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
|
Hello. thank you for your work.
|
|
Hi @sangnekim . Yes, the naming is better to change in your way. The dist barrier won't go in process since Verl would adopt agent loop and all the old rollout function would be deprecated. |
@ChangyiYang |
|
@lizipao Yes, the config has a typo. I will change that. For now you can change this line verl/verl/workers/config/rollout.py Line 169 in 33edd95
|
|
|
@lizipao I never encountered this error before. |
### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 #3090 @SuperCB #3102 with their prior contribution.
### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine/verl#3090 @SuperCB volcengine/verl#3102 with their prior contribution.
### What does this PR do?
**Summary**
This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.
**Key Changes**
* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**
**Motivation**
The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:
1. **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.
2. **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.
3. **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.
### What does this PR do?
**Summary**
This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.
**Key Changes**
* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**
**Motivation**
The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:
1. **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.
2. **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.
3. **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.
…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.
What does this PR do?
Summary
This PR introduces a native HTTP server implementation for SGLang, aiming to fundamentally improve flexibility, scalability, and integration capabilities. By transitioning to a more robust client-server architecture, this change addresses several core bottlenecks in the current design.
Key Changes
sgl.Engineinstance with a native HTTP server. ✅ Completeddist.barriertimeout issue by replacing the collective wait with per-sample synchronization. 🚧 In ProgressMotivation
The current
sgl.Enginedriver model presents several architectural challenges, particularly in complex distributed environments. Moving to an HTTP server architecture is motivated by the need to solve the following critical issues:Eliminate Data Flow Bottlenecks and Improve Performance:
preprocessandpostprocesssteps are slower than expected.Resolve CPU Resource Contention:
Fix Distributed Synchronization Timeouts:
dist.barriertimeout is a frequent issue where worker ranks remain idle while waiting for TP rank 0 to complete its intensive processing. This collective wait time creates inefficiency and can lead to failures.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)