[sglang] feat: add native sgl server #3090

ChangyiYang · 2025-08-17T00:39:21Z

What does this PR do?

Summary

This PR introduces a native HTTP server implementation for SGLang, aiming to fundamentally improve flexibility, scalability, and integration capabilities. By transitioning to a more robust client-server architecture, this change addresses several core bottlenecks in the current design.

Key Changes

Engine Replacement – Replaced the original sgl.Engine instance with a native HTTP server. ✅ Completed
Distributed Optimization – Utilizing a server-based architecture to remove the requirement of gathering all data to TP rank 0. This change resolves the previous dist.barrier timeout issue by replacing the collective wait with per-sample synchronization. 🚧 In Progress
Router Integration – Plan to integrate with the native SGLang router for streamlined request handling. 💡 Nice to have

Motivation

The current sgl.Engine driver model presents several architectural challenges, particularly in complex distributed environments. Moving to an HTTP server architecture is motivated by the need to solve the following critical issues:

Eliminate Data Flow Bottlenecks and Improve Performance:
- Problem: The data flow logic of the existing driver process is misaligned with the training data flow. It requires all data for a single SGLang instance to be gathered to TP rank 0. This data is then processed by the tokenizer manager and sent via ZMQ to the various schedulers. As a result, the preprocess and postprocess steps are slower than expected.
- Solution: The HTTP server architecture decentralizes this process, allowing each rank to handle requests independently. This removes the "gather to rank 0" bottleneck, dramatically improving data throughput and overall performance.
Resolve CPU Resource Contention:
- Problem: At the request level, the SGLang driver object cannot be pickled for use in subprocesses. This limitation means that the request-level asynchronous rollout logic and the engine itself are forced to compete for the same CPU time slices, leading to performance degradation.
- Solution: By decoupling the request handling (client) from the inference engine (server), we isolate the processes, eliminating the CPU contention and allowing for more efficient resource utilization.
Fix Distributed Synchronization Timeouts:
- Problem: The dist.barrier timeout is a frequent issue where worker ranks remain idle while waiting for TP rank 0 to complete its intensive processing. This collective wait time creates inefficiency and can lead to failures.
- Solution: The HTTP server model shifts this from a collective barrier to a per-sample synchronization. Workers communicate with the server as needed, removing the long wait times and making the distributed setup more stable and efficient.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request introduces a native HTTP server for SGLang, which is a significant architectural improvement for flexibility and scalability. The implementation is comprehensive, including synchronous and asynchronous adapters, extensive test coverage, and example configurations. However, I've identified a few critical issues that need to be addressed. There's a configuration mismatch that will cause runtime errors, and the server launch logic for multi-node setups appears to be flawed, which would prevent it from working correctly in distributed environments. Additionally, there's a performance-related regression due to a hardcoded logging level.

verl/workers/config/rollout.py

verl/workers/rollout/sglang_rollout/http_server_engine.py

gemini-code-assist · 2025-08-17T00:40:48Z

verl/workers/rollout/sglang_rollout/sglang_rollout.py

+            args = {
+                "model_path": actor_module,
+                "dtype": self.config.dtype,
+                "mem_fraction_static": self.config.gpu_memory_utilization,
+                "enable_memory_saver": True,
+                "base_gpu_id": 0,
+                "gpu_id_step": 1,
+                "tp_size": self._tp_size,
+                "node_rank": node_rank,
+                "load_format": load_format,
+                "dist_init_addr": dist_init_addr,
+                "nnodes": nnodes,
+                "trust_remote_code": trust_remote_code,
                # NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new
                # when random.seed is being set during training
-                port=30000 + rank,
-                # NOTE(Chenyang): turn on log_level to see the decoding speed of SGLang Engine
-                # log_level="INFO"
-                # NOTE(Chenyang): turn the following lines to see the input and output of each request
+                "port": 30000 + rank,
+                # NOTE(Chenyang): if you want to debug the SGLang engine output
+                # please set the following parameters
+                # Otherwise, it will make the engine run too slow
+                "log_level": "info",
+                # "log_level": "error",
                # log_requests=True,
                # log_requests_level=2,
                # NOTE(Chenyang): turn on max_running_requests to set the max concurrent running requests
                # max_running_requests=1,
-                mm_attention_backend="fa3",
-                attention_backend=attention_backend if attention_backend is not None else "fa3",
-                # In async mode for AgentLoop, SGLang support token in token out to avoid the tokenizer
-                # inconsistency issue.
-                skip_tokenizer_init=self.config.mode == "async",
-                **engine_kwargs,
-            )
+                "mm_attention_backend": "fa3",
+                "attention_backend": attention_backend if attention_backend is not None else "fa3",
+                # In async mode, we want token in token out.
+                "skip_tokenizer_init": self.config.mode == "async",
+            }


This refactoring has introduced a regression where engine_kwargs from the configuration are no longer passed to the SGLang engine constructor (AsyncEngine or AsyncHttpServerAdapter). This prevents overriding engine parameters like log_level. Additionally, log_level is now hardcoded to "info", which, as noted in the comments, can cause significant performance degradation and is not suitable for production. The engine_kwargs should be merged into the args dictionary, and the log_level should be configurable rather than hardcoded.

args = { "model_path": actor_module, "dtype": self.config.dtype, "mem_fraction_static": self.config.gpu_memory_utilization, "enable_memory_saver": True, "base_gpu_id": 0, "gpu_id_step": 1, "tp_size": self._tp_size, "node_rank": node_rank, "load_format": load_format, "dist_init_addr": dist_init_addr, "nnodes": nnodes, "trust_remote_code": trust_remote_code, # NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new # when random.seed is being set during training "port": 30000 + rank, # NOTE(Chenyang): if you want to debug the SGLang engine output # please set the following parameters # Otherwise, it will make the engine run too slow # "log_level": "error", # log_requests=True, # log_requests_level=2, # NOTE(Chenyang): turn on max_running_requests to set the max concurrent running requests # max_running_requests=1, "mm_attention_backend": "fa3", "attention_backend": attention_backend if attention_backend is not None else "fa3", # In async mode, we want token in token out. "skip_tokenizer_init": self.config.mode == "async", **engine_kwargs, }

tests/workers/rollout/rollout_sglang/test_http_server_engine.py

verl/workers/rollout/sglang_rollout/sglang_rollout.py

tests/workers/rollout/rollout_sglang/conftest.py

examples/sglang_multiturn/config/gsm8k_multiturn_grpo_server.yaml

zhaochenyang20 · 2025-08-19T04:44:22Z

FAILED test_sglang_async_rollout_w_interaction.py::test_async_sglang_rollout_w_interaction - omegaconf.errors.ConfigAttributeError: Missing key sglang_engine_mode
    full_key: sglang_engine_mode
    object_type=dict

@ChangyiYang fix the CI plz.

4c3310d

zhaochenyang20 · 2025-08-20T04:30:03Z

https://github.com/volcengine/verl/actions/runs/17015367333/job/48239555678?pr=3090#step:12:813

@ChangyiYang rebase with main to solve this

…rl into feat/native_sgl_server_new

ChangyiYang · 2025-08-21T01:58:56Z

@zhaochenyang20 some ci should be fixed now. Can you rerun it?

zhaochenyang20 · 2025-08-25T01:20:55Z

https://github.com/volcengine/verl/actions/runs/17114945140/job/48721055839?pr=3090#step:7:321

maybe we need to register it

zhaochenyang20 · 2025-08-27T02:35:21Z

examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_4xgpu_server.sh

+
+python3 -m verl.trainer.main_ppo \
+    --config-path="$CONFIG_PATH" \
+    --config-name='gsm8k_multiturn_grpo_server' \


Add a comment to say this would use the server.

wuxibin89 · 2025-08-28T03:00:53Z

This PR is mainly addressing issues in batch mode with generate_sequences, while batch mode is going to be deprecated and switch to server mode by default. [rollout] feat: change rollout default mode from spmd to server mode #3161. So we tend to not make any new major change to batch mode except bugfix.
For the server mode, I found it's pretty easy to add native sglang server with a few lines:
wuxibin89/verl@wuxibin/refactor_rollout2...wuxibin/sglang_native_server

ChangyiYang · 2025-08-28T03:51:46Z

Correct me if I am wrong.
In my opinion, the code @wuxibin89 provide is NOT providing the native sgl engine. We want the "generate“ api to send a http request a sgl server. But in the provided code, it launchs a sgl server but the generate api still goes to the local engine instance, which cannot allow the addition of sgl router in the future. So I think the provided code cannot do what this pr do. At lease, the sgl client that rollout holds (which send request to a sgl server) needs to provide the api that local engine provide

ChangyiYang · 2025-08-28T05:13:25Z

Thanks @zhaochenyang20 for detailed review and instruction! Thanks @SwordFaith for providing base code and instruction along the way!

### What does this PR do? **Summary** This PR introduces a native HTTP server implementation for SGLang, aiming to fundamentally improve flexibility, scalability, and integration capabilities. By transitioning to a more robust client-server architecture, this change addresses several core bottlenecks in the current design. **Key Changes** * **Engine Replacement** – Replaced the original `sgl.Engine` instance with a native HTTP server. ✅ **Completed** * **Distributed Optimization** – Utilizing a server-based architecture to remove the requirement of gathering all data to TP rank 0. This change resolves the previous `dist.barrier` timeout issue by replacing the collective wait with per-sample synchronization. 🚧 **In Progress** * **Router Integration** – Plan to integrate with the native SGLang router for streamlined request handling. 💡 **Nice to have** **Motivation** The current `sgl.Engine` driver model presents several architectural challenges, particularly in complex distributed environments. Moving to an HTTP server architecture is motivated by the need to solve the following critical issues: 1. **Eliminate Data Flow Bottlenecks and Improve Performance:** * **Problem:** The data flow logic of the existing driver process is misaligned with the training data flow. It requires all data for a single SGLang instance to be gathered to TP rank 0. This data is then processed by the tokenizer manager and sent via ZMQ to the various schedulers. As a result, the `preprocess` and `postprocess` steps are slower than expected. * **Solution:** The HTTP server architecture decentralizes this process, allowing each rank to handle requests independently. This removes the "gather to rank 0" bottleneck, dramatically improving data throughput and overall performance. 2. **Resolve CPU Resource Contention:** * **Problem:** At the request level, the SGLang driver object cannot be pickled for use in subprocesses. This limitation means that the request-level asynchronous rollout logic and the engine itself are forced to compete for the same CPU time slices, leading to performance degradation. * **Solution:** By decoupling the request handling (client) from the inference engine (server), we isolate the processes, eliminating the CPU contention and allowing for more efficient resource utilization. 3. **Fix Distributed Synchronization Timeouts:** * **Problem:** The `dist.barrier` timeout is a frequent issue where worker ranks remain idle while waiting for TP rank 0 to complete its intensive processing. This collective wait time creates inefficiency and can lead to failures. * **Solution:** The HTTP server model shifts this from a collective barrier to a per-sample synchronization. Workers communicate with the server as needed, removing the long wait times and making the distributed setup more stable and efficient. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

sangnekim · 2025-09-10T01:11:58Z

Hello. thank you for your work.
I think the variable should be replaced sglang_engine_mode with sglang_rollout_mode.

dist.barrier() timeout error issue still in progress?

ChangyiYang · 2025-09-10T19:36:17Z

Hi @sangnekim . Yes, the naming is better to change in your way. The dist barrier won't go in process since Verl would adopt agent loop and all the old rollout function would be deprecated.

lizipao · 2025-09-11T08:34:22Z

sglang_engine_mode

@ChangyiYang
TypeError: RolloutConfig.init() got an unexpected keyword argument 'sglang_rollout_mode'
Is this a bug?

ChangyiYang · 2025-09-11T18:46:55Z

@lizipao Yes, the config has a typo. I will change that. For now you can change this line

verl/verl/workers/config/rollout.py

Line 169 in 33edd95

sglang_engine_mode: str = "local"

lizipao · 2025-09-12T02:37:21Z

@lizipao Yes, the config has a typo. I will change that. For now you can change this line

verl/verl/workers/config/rollout.py

Line 169 in 33edd95

sglang_engine_mode: str = "local"

Have you encountered this error before?
[torch_memory_saver.cpp] CUresult error: 1 (invalid argument) file=csrc/core.cpp func=pause line=77
[http_server_engine.py:691] : Async request to release_memory_occupation timed out (attempt 1)

ChangyiYang · 2025-09-12T18:12:31Z

@lizipao I never encountered this error before.

@ChangyiYang

### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 #3090 @SuperCB #3102 with their prior contribution.

@ChangyiYang

### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine/verl#3090 @SuperCB volcengine/verl#3102 with their prior contribution.

### What does this PR do? **Summary** This PR introduces a native HTTP server implementation for SGLang, aiming to fundamentally improve flexibility, scalability, and integration capabilities. By transitioning to a more robust client-server architecture, this change addresses several core bottlenecks in the current design. **Key Changes** * **Engine Replacement** – Replaced the original `sgl.Engine` instance with a native HTTP server. ✅ **Completed** * **Distributed Optimization** – Utilizing a server-based architecture to remove the requirement of gathering all data to TP rank 0. This change resolves the previous `dist.barrier` timeout issue by replacing the collective wait with per-sample synchronization. 🚧 **In Progress** * **Router Integration** – Plan to integrate with the native SGLang router for streamlined request handling. 💡 **Nice to have** **Motivation** The current `sgl.Engine` driver model presents several architectural challenges, particularly in complex distributed environments. Moving to an HTTP server architecture is motivated by the need to solve the following critical issues: 1. **Eliminate Data Flow Bottlenecks and Improve Performance:** * **Problem:** The data flow logic of the existing driver process is misaligned with the training data flow. It requires all data for a single SGLang instance to be gathered to TP rank 0. This data is then processed by the tokenizer manager and sent via ZMQ to the various schedulers. As a result, the `preprocess` and `postprocess` steps are slower than expected. * **Solution:** The HTTP server architecture decentralizes this process, allowing each rank to handle requests independently. This removes the "gather to rank 0" bottleneck, dramatically improving data throughput and overall performance. 2. **Resolve CPU Resource Contention:** * **Problem:** At the request level, the SGLang driver object cannot be pickled for use in subprocesses. This limitation means that the request-level asynchronous rollout logic and the engine itself are forced to compete for the same CPU time slices, leading to performance degradation. * **Solution:** By decoupling the request handling (client) from the inference engine (server), we isolate the processes, eliminating the CPU contention and allowing for more efficient resource utilization. 3. **Fix Distributed Synchronization Timeouts:** * **Problem:** The `dist.barrier` timeout is a frequent issue where worker ranks remain idle while waiting for TP rank 0 to complete its intensive processing. This collective wait time creates inefficiency and can lead to failures. * **Solution:** The HTTP server model shifts this from a collective barrier to a per-sample synchronization. Workers communicate with the server as needed, removing the long wait times and making the distributed setup more stable and efficient. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

### What does this PR do? **Summary** This PR introduces a native HTTP server implementation for SGLang, aiming to fundamentally improve flexibility, scalability, and integration capabilities. By transitioning to a more robust client-server architecture, this change addresses several core bottlenecks in the current design. **Key Changes** * **Engine Replacement** – Replaced the original `sgl.Engine` instance with a native HTTP server. ✅ **Completed** * **Distributed Optimization** – Utilizing a server-based architecture to remove the requirement of gathering all data to TP rank 0. This change resolves the previous `dist.barrier` timeout issue by replacing the collective wait with per-sample synchronization. 🚧 **In Progress** * **Router Integration** – Plan to integrate with the native SGLang router for streamlined request handling. 💡 **Nice to have** **Motivation** The current `sgl.Engine` driver model presents several architectural challenges, particularly in complex distributed environments. Moving to an HTTP server architecture is motivated by the need to solve the following critical issues: 1. **Eliminate Data Flow Bottlenecks and Improve Performance:** * **Problem:** The data flow logic of the existing driver process is misaligned with the training data flow. It requires all data for a single SGLang instance to be gathered to TP rank 0. This data is then processed by the tokenizer manager and sent via ZMQ to the various schedulers. As a result, the `preprocess` and `postprocess` steps are slower than expected. * **Solution:** The HTTP server architecture decentralizes this process, allowing each rank to handle requests independently. This removes the "gather to rank 0" bottleneck, dramatically improving data throughput and overall performance. 2. **Resolve CPU Resource Contention:** * **Problem:** At the request level, the SGLang driver object cannot be pickled for use in subprocesses. This limitation means that the request-level asynchronous rollout logic and the engine itself are forced to compete for the same CPU time slices, leading to performance degradation. * **Solution:** By decoupling the request handling (client) from the inference engine (server), we isolate the processes, eliminating the CPU contention and allowing for more efficient resource utilization. 3. **Fix Distributed Synchronization Timeouts:** * **Problem:** The `dist.barrier` timeout is a frequent issue where worker ranks remain idle while waiting for TP rank 0 to complete its intensive processing. This collective wait time creates inefficiency and can lead to failures. * **Solution:** The HTTP server model shifts this from a collective barrier to a per-sample synchronization. Workers communicate with the server as needed, removing the long wait times and making the distributed setup more stable and efficient. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

add

967e0a6

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

ChangyiYang added 2 commits August 17, 2025 00:59

add abort request

91bcebc

fix

4edf261

ChangyiYang marked this pull request as ready for review August 17, 2025 01:02

ChangyiYang requested review from SwordFaith, chenhaiq and zhaochenyang20 as code owners August 17, 2025 01:02

ChangyiYang changed the title ~~[WIP][sglang] feat: add native sgl server~~ [sglang] feat: add native sgl server Aug 17, 2025

pre commit

d797d91

zhaochenyang20 requested changes Aug 17, 2025

View reviewed changes

wuxibin89 mentioned this pull request Aug 18, 2025

[Do not Merge] Make VeRL SGLang Native #3102

Closed

ChangyiYang and others added 5 commits August 21, 2025 01:13

change config name

0c90ce5

move conftest location, clean up test

c7e341b

Merge branch 'main' into feat/native_sgl_server_new

0d75e79

clean up log

82fa1aa

Merge branch 'feat/native_sgl_server_new' of github.com:SwordFaith/ve…

3853f45

…rl into feat/native_sgl_server_new

add try catch block

446f30e

ChangyiYang requested a review from zhaochenyang20 August 26, 2025 00:32

ChangyiYang and others added 4 commits August 26, 2025 15:16

add async mark

40a6fc7

Merge branch 'main' into feat/native_sgl_server_new

577bb2b

pre commit

899fe4d

nothing

a160231

zhaochenyang20 requested changes Aug 27, 2025

View reviewed changes

zhaochenyang20 approved these changes Aug 28, 2025

View reviewed changes

wuxibin89 merged commit e95bd9e into volcengine:main Aug 28, 2025
55 of 58 checks passed

wuxibin89 mentioned this pull request Sep 11, 2025

[1/N][rollout] feat: support vllm/sglang native http server #3456

Merged

[sglang] feat: add native sgl server #3090

[sglang] feat: add native sgl server #3090

Uh oh!

Conversation

ChangyiYang commented Aug 17, 2025 • edited by zhaochenyang20 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 20, 2025

Uh oh!

ChangyiYang commented Aug 21, 2025

Uh oh!

zhaochenyang20 commented Aug 25, 2025

Uh oh!

zhaochenyang20 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Aug 28, 2025

Uh oh!

ChangyiYang commented Aug 28, 2025

Uh oh!

Uh oh!

ChangyiYang commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sangnekim commented Sep 10, 2025

Uh oh!

ChangyiYang commented Sep 10, 2025

Uh oh!

lizipao commented Sep 11, 2025

Uh oh!

ChangyiYang commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizipao commented Sep 12, 2025

Uh oh!

ChangyiYang commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ChangyiYang commented Aug 17, 2025 •

edited by zhaochenyang20

Loading

zhaochenyang20 commented Aug 19, 2025 •

edited

Loading

ChangyiYang commented Aug 28, 2025 •

edited

Loading

ChangyiYang commented Sep 11, 2025 •

edited

Loading