[Do not Merge] Make VeRL SGLang Native #3102

SuperCB · 2025-08-18T02:07:03Z

What does this PR do?

We developed this PR based on the following three considerations:

Multiple teams and individuals, including the XHS team, have observed that Sglang often becomes exceptionally slow when used with Verl Issue1. Sometimes the first training step becomes extremely slow, and sometimes the slowdown starts from the second step. Through profiling, we discovered that the allreduce operation in Sglang slowed down by dozens of times.（As shown in the figure above, it took 17ms for a single token of data to run TP16's allreduce, which is far too long considering we are using an H100 cluster） We traced the issue to this line of code LineLink (This causes time slice contention between two processes on the same GPU, and this contention becomes more severe as the TP_size increases.We cannot initiate other processes to run communication operations on the same GPU when SGlang is undergoing the inference process.).
The original feature in Verl that pre-splits the training data before rollout makes the development of features like partial rollout and streaming rollout very difficult. If each Sglang Engine receives a portion of the data, we would need to maintain the current generation state of each Sglang Engine and track how many tokens each Engine has generated. Such an implementation would be overly complex, and we want to make it simpler and more intuitive.
Algorithm researchers need to rapidly iterate through various experimental schemes during reinforcement learning framework development. This includes exploring different off-policy and on-policy data ratio strategies and verifying various advanced Agent RL mechanisms. However, in the current Verl architecture, the rollout processing code and the inference engine code are tightly coupled (as shown in the code link Link). This means that any modification to any stage of the rollout process requires launching the complete Verl framework for debugging, which significantly increases development costs and iteration cycles. We want to decouple the rollout functionality into an independent "Rollout Manager" component, allowing it to be developed and debugged independently from the Verl framework, thereby reducing the time cost for algorithm engineers to conduct experiments.

Based on the above two considerations, we decided to refactor the parts of Verl related to Sglang rollout to overcome the last obstacle in using Verl + Sglang. We are grateful to zhuzilin for open-sourcing the Slime framework. Slime's data buffer-centric design architecture has greatly inspired us, and we have referred to Slime's implementation during our design process.

CLAassistant · 2025-08-18T02:07:11Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ SuperCB
❌ 崔博

崔博 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces a significant refactoring to integrate SGLang natively into Verl, aiming to improve performance and simplify the architecture. The changes include a new data buffer-centric design, HTTP-based communication with the SGLang engine, and a new rollout manager. While the overall direction is sound and the implementation is comprehensive, I've identified a few areas that could be improved for robustness and maintainability. Specifically, there are cases of unsafe dictionary access, a potentially problematic bare except clause, a function with unexpected side effects, and code duplication that should be addressed.

gemini-code-assist · 2025-08-18T02:09:39Z

verl/workers/megatron_workers.py

+        self.sglang_router_ip = kwargs["sglang_router_ip"]
+        self.sglang_router_port = kwargs["sglang_router_port"]


Accessing kwargs with [] is unsafe as it will raise a KeyError if the keys sglang_router_ip or sglang_router_port are not provided by the caller. This can lead to runtime crashes that are hard to debug. For better robustness, you should use kwargs.get() and handle the case where the keys might be missing, or explicitly check for their existence and raise a more informative ValueError.

Suggested change

self.sglang_router_ip = kwargs["sglang_router_ip"]

self.sglang_router_port = kwargs["sglang_router_port"]

self.sglang_router_ip = kwargs.get("sglang_router_ip")

self.sglang_router_port = kwargs.get("sglang_router_port")

if self.sglang_router_ip is None or self.sglang_router_port is None:

raise ValueError("sglang_router_ip and sglang_router_port must be provided for ActorRolloutRefWorker")

gemini-code-assist · 2025-08-18T02:09:39Z

verl/workers/rollout/http_utils.py

+                    output = response.json()
+                except:
+                    output = response.text


Using a bare except: is risky because it catches all exceptions, including system-exiting ones like SystemExit and KeyboardInterrupt. This can make the program difficult to terminate and can mask underlying bugs. It's better to catch a more specific exception, such as json.JSONDecodeError, to handle cases where the response body is not valid JSON.

Suggested change

output = response.json()

except:

output = response.text

try:

output = response.json()

except json.JSONDecodeError:

output = response.text

gemini-code-assist · 2025-08-18T02:09:39Z

verl/workers/rollout/rollout_manager.py

+        batch_size = idx.size(0)
+
+        # Extract non-tensor data
+        non_tensor_batch = prompts.non_tensor_batch


Using pop() on non_tensor_batch modifies the dictionary in-place. Since non_tensor_batch is a reference to prompts.non_tensor_batch, this function has the side effect of altering the prompts object that was passed in. This can lead to unexpected behavior in other parts of the code that might use the prompts object later, as it would be missing the popped keys. To avoid this, you should operate on a copy of the dictionary.

Suggested change

non_tensor_batch = prompts.non_tensor_batch

non_tensor_batch = prompts.non_tensor_batch.copy()

gemini-code-assist · 2025-08-18T02:09:40Z

verl/workers/rollout/sglang_rollout/sglang_rollout.py

+def find_available_port(base_port: int):
+    port = base_port + random.randint(100, 1000)
+    while True:
+        if is_port_available(port):
+            return port
+        if port < 60000:
+            port += 42
+        else:
+            port -= 43
+
+
+def is_port_available(port):
+    """Return whether a port is available."""
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        try:
+            s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+            s.bind(("", port))
+            s.listen(1)
+            return True
+        except socket.error:
+            return False
+        except OverflowError:
+            return False
+
+
+def get_host_info():
+    hostname = socket.gethostname()
+
+    local_ip = socket.gethostbyname(hostname)
+
+    return hostname, local_ip
+
+
+def run_router(args):
+    try:
+        from sglang_router.launch_router import launch_router
+
+        router = launch_router(args)
+        if router is None:
+            return 1
+        return 0
+    except Exception as e:
+        print(e)
+        return 1


The functions find_available_port, is_port_available, get_host_info, and run_router are duplicated from verl/workers/rollout/http_utils.py. This code duplication increases maintenance overhead and risks inconsistencies if one version is updated and the other is not. To improve maintainability, you should remove these duplicated functions and instead import them from verl.workers.rollout.http_utils.

wuxibin89 · 2025-08-18T02:42:18Z

Through profiling, we discovered that the allreduce operation in Sglang slowed down by dozens of times. We traced the issue to this line of code LineLink (which sounds unbelievable, as it appears that there is interference between nccl operations run by multiple processes on the same GPU).

The offline batch inference mode has some drawbacks #2854 (comment) that we're going to deprecate it and switch to server mode. Server mode rollout is more natural way to use inference framework, and we can reuse all the features, optimizations for online serving.

We have an OpenAI compatible http server implementation:
https://github.com/volcengine/verl/blob/main/verl/workers/rollout/sglang_rollout/async_sglang_server.py#L70-L76

wuxibin89 · 2025-08-18T03:05:26Z

BTW. This PR seems duplicate with #3090?

SuperCB · 2025-08-18T03:09:02Z

BTW. This PR seems duplicate with #3090?

This PR not only supports online server mode, but I've also ported the data buffer-centric architecture from slime and modified the way data is distributed during the rollout phase.If we only port the content from #3090, the sglang exception slowing down issue will still occur.

ChangyiYang · 2025-08-21T01:08:07Z

Hi! I am the author for pr #3090. My pr is intended to provide server based rollout for sglang without breaking current Verl's structure. But @SuperCB is right, my design can only mitigate the problem. Truly solving this problem need to change the data flow. #3090 is just a start for the solution. I am willing to discuss and talk the next following step.

Cesilina · 2025-08-26T07:59:28Z

想问下，目前的修改，是否跑通拉呀，没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

SuperCB · 2025-08-27T12:44:50Z

想问下，目前的修改，是否跑通拉呀，没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

我发现有一部分代码没有提交

Cesilina · 2025-08-28T01:24:13Z

想问下，目前的修改，是否跑通拉呀，没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

我发现有一部分代码没有提交

hh,楼主啥时候能提交下啊，我这边做实验发现使用sglang server的话，其timing_s/update actor的时间是递增的，不知道楼主是否遇到过这个问题呢，

ChenyuWang1022 · 2025-08-28T11:41:15Z

想问一下楼主，当前对Sglang的优化有没有测试的脚本呢，我这边用了您rollout那边的verl分支，但是运行测试的时候总是会出现错误，想请教一下楼主的训练脚本🙏，因为感觉大部分脚本的测试参数太乱了。

SuperCB · 2025-08-28T15:04:57Z

想问下，目前的修改，是否跑通拉呀，没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

我发现有一部分代码没有提交

hh,楼主啥时候能提交下啊，我这边做实验发现使用sglang server的话，其timing_s/update actor的时间是递增的，不知道楼主是否遇到过这个问题呢，

我补充提交了代码，你遇到的问题我们并没有遇到过，具体可微信交流15930067963

SuperCB · 2025-08-28T15:05:10Z

想问一下楼主，当前对Sglang的优化有没有测试的脚本呢，我这边用了您rollout那边的verl分支，但是运行测试的时候总是会出现错误，想请教一下楼主的训练脚本🙏，因为感觉大部分脚本的测试参数太乱了。

可微信交流

lizipao · 2025-09-07T02:20:38Z

请问一下楼主这个PR还有合入的计划吗

SuperCB · 2025-09-10T06:56:14Z

请问一下楼主这个PR还有合入的计划吗

没有这个计划，字节有另外的方案

EduardDurech · 2025-09-10T20:08:09Z

I'd also like to add this issue I've had open for 5 months #1208 you can see @squall1988 's recent comment too, also #3173

@ChangyiYang

### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 #3090 @SuperCB #3102 with their prior contribution.

@ChangyiYang

### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine/verl#3090 @SuperCB volcengine/verl#3102 with their prior contribution.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

崔博 and others added 2 commits August 15, 2025 12:02

make verl greate again

b5adce7

add copyright

80458ae

SuperCB requested review from SwordFaith, chenhaiq, wuxibin89, zhaochenyang20 and zw0610 as code owners August 18, 2025 02:07

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

wuxibin89 mentioned this pull request Aug 18, 2025

[agentic RL] multi-turn rollout and agent loop development tracking #2618

Open

7 tasks

dongju-2 mentioned this pull request Aug 21, 2025

Megatron + sglang slow down. #3160

Open

make verl great again

6041e0e

SuperCB requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners August 28, 2025 15:03

wuxibin89 mentioned this pull request Sep 11, 2025

[1/N][rollout] feat: support vllm/sglang native http server #3456

Merged

SuperCB changed the title ~~[Do not Merge] Make Verl Sglang Native~~ [Do not Merge] Make VeRL SGLang Native Sep 19, 2025

wuxibin89 closed this Sep 26, 2025

		self.sglang_router_ip = kwargs["sglang_router_ip"]
		self.sglang_router_port = kwargs["sglang_router_port"]

-        self.sglang_router_ip = kwargs["sglang_router_ip"]
-        self.sglang_router_port = kwargs["sglang_router_port"]
+        self.sglang_router_ip = kwargs.get("sglang_router_ip")
+        self.sglang_router_port = kwargs.get("sglang_router_port")
+        if self.sglang_router_ip is None or self.sglang_router_port is None:
+            raise ValueError("sglang_router_ip and sglang_router_port must be provided for ActorRolloutRefWorker")

	non_tensor_batch = prompts.non_tensor_batch
	non_tensor_batch = prompts.non_tensor_batch.copy()

[Do not Merge] Make VeRL SGLang Native #3102

[Do not Merge] Make VeRL SGLang Native #3102

Uh oh!

Conversation

SuperCB commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

CLAassistant commented Aug 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wuxibin89 commented Aug 18, 2025

Uh oh!

SuperCB commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChangyiYang commented Aug 21, 2025

Uh oh!

Cesilina commented Aug 26, 2025

Uh oh!

SuperCB commented Aug 27, 2025

Uh oh!

Cesilina commented Aug 28, 2025

Uh oh!

ChenyuWang1022 commented Aug 28, 2025

Uh oh!

SuperCB commented Aug 28, 2025

Uh oh!

SuperCB commented Aug 28, 2025

Uh oh!

lizipao commented Sep 7, 2025

Uh oh!

SuperCB commented Sep 10, 2025

Uh oh!

EduardDurech commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

SuperCB commented Aug 18, 2025 •

edited

Loading

wuxibin89 commented Aug 18, 2025 •

edited

Loading

SuperCB commented Aug 18, 2025 •

edited

Loading