Skip to content

Conversation

@SuperCB
Copy link
Contributor

@SuperCB SuperCB commented Aug 18, 2025

What does this PR do?

We developed this PR based on the following three considerations:
image

  1. Multiple teams and individuals, including the XHS team, have observed that Sglang often becomes exceptionally slow when used with Verl Issue1. Sometimes the first training step becomes extremely slow, and sometimes the slowdown starts from the second step. Through profiling, we discovered that the allreduce operation in Sglang slowed down by dozens of times.(As shown in the figure above, it took 17ms for a single token of data to run TP16's allreduce, which is far too long considering we are using an H100 cluster) We traced the issue to this line of code LineLink (This causes time slice contention between two processes on the same GPU, and this contention becomes more severe as the TP_size increases.We cannot initiate other processes to run communication operations on the same GPU when SGlang is undergoing the inference process.).

  2. The original feature in Verl that pre-splits the training data before rollout makes the development of features like partial rollout and streaming rollout very difficult. If each Sglang Engine receives a portion of the data, we would need to maintain the current generation state of each Sglang Engine and track how many tokens each Engine has generated. Such an implementation would be overly complex, and we want to make it simpler and more intuitive.

  3. Algorithm researchers need to rapidly iterate through various experimental schemes during reinforcement learning framework development. This includes exploring different off-policy and on-policy data ratio strategies and verifying various advanced Agent RL mechanisms. However, in the current Verl architecture, the rollout processing code and the inference engine code are tightly coupled (as shown in the code link Link). This means that any modification to any stage of the rollout process requires launching the complete Verl framework for debugging, which significantly increases development costs and iteration cycles. We want to decouple the rollout functionality into an independent "Rollout Manager" component, allowing it to be developed and debugged independently from the Verl framework, thereby reducing the time cost for algorithm engineers to conduct experiments.

Based on the above two considerations, we decided to refactor the parts of Verl related to Sglang rollout to overcome the last obstacle in using Verl + Sglang. We are grateful to zhuzilin for open-sourcing the Slime framework. Slime's data buffer-centric design architecture has greatly inspired us, and we have referred to Slime's implementation during our design process.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ SuperCB
❌ 崔博


崔博 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring to integrate SGLang natively into Verl, aiming to improve performance and simplify the architecture. The changes include a new data buffer-centric design, HTTP-based communication with the SGLang engine, and a new rollout manager. While the overall direction is sound and the implementation is comprehensive, I've identified a few areas that could be improved for robustness and maintainability. Specifically, there are cases of unsafe dictionary access, a potentially problematic bare except clause, a function with unexpected side effects, and code duplication that should be addressed.

Comment on lines +93 to +94
self.sglang_router_ip = kwargs["sglang_router_ip"]
self.sglang_router_port = kwargs["sglang_router_port"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing kwargs with [] is unsafe as it will raise a KeyError if the keys sglang_router_ip or sglang_router_port are not provided by the caller. This can lead to runtime crashes that are hard to debug. For better robustness, you should use kwargs.get() and handle the case where the keys might be missing, or explicitly check for their existence and raise a more informative ValueError.

Suggested change
self.sglang_router_ip = kwargs["sglang_router_ip"]
self.sglang_router_port = kwargs["sglang_router_port"]
self.sglang_router_ip = kwargs.get("sglang_router_ip")
self.sglang_router_port = kwargs.get("sglang_router_port")
if self.sglang_router_ip is None or self.sglang_router_port is None:
raise ValueError("sglang_router_ip and sglang_router_port must be provided for ActorRolloutRefWorker")

Comment on lines +83 to +85
output = response.json()
except:
output = response.text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a bare except: is risky because it catches all exceptions, including system-exiting ones like SystemExit and KeyboardInterrupt. This can make the program difficult to terminate and can mask underlying bugs. It's better to catch a more specific exception, such as json.JSONDecodeError, to handle cases where the response body is not valid JSON.

Suggested change
output = response.json()
except:
output = response.text
try:
output = response.json()
except json.JSONDecodeError:
output = response.text

batch_size = idx.size(0)

# Extract non-tensor data
non_tensor_batch = prompts.non_tensor_batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using pop() on non_tensor_batch modifies the dictionary in-place. Since non_tensor_batch is a reference to prompts.non_tensor_batch, this function has the side effect of altering the prompts object that was passed in. This can lead to unexpected behavior in other parts of the code that might use the prompts object later, as it would be missing the popped keys. To avoid this, you should operate on a copy of the dictionary.

Suggested change
non_tensor_batch = prompts.non_tensor_batch
non_tensor_batch = prompts.non_tensor_batch.copy()

Comment on lines +1509 to +1552
def find_available_port(base_port: int):
port = base_port + random.randint(100, 1000)
while True:
if is_port_available(port):
return port
if port < 60000:
port += 42
else:
port -= 43


def is_port_available(port):
"""Return whether a port is available."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
try:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("", port))
s.listen(1)
return True
except socket.error:
return False
except OverflowError:
return False


def get_host_info():
hostname = socket.gethostname()

local_ip = socket.gethostbyname(hostname)

return hostname, local_ip


def run_router(args):
try:
from sglang_router.launch_router import launch_router

router = launch_router(args)
if router is None:
return 1
return 0
except Exception as e:
print(e)
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The functions find_available_port, is_port_available, get_host_info, and run_router are duplicated from verl/workers/rollout/http_utils.py. This code duplication increases maintenance overhead and risks inconsistencies if one version is updated and the other is not. To improve maintainability, you should remove these duplicated functions and instead import them from verl.workers.rollout.http_utils.

@wuxibin89
Copy link
Collaborator

wuxibin89 commented Aug 18, 2025

Through profiling, we discovered that the allreduce operation in Sglang slowed down by dozens of times. We traced the issue to this line of code LineLink (which sounds unbelievable, as it appears that there is interference between nccl operations run by multiple processes on the same GPU).

The offline batch inference mode has some drawbacks #2854 (comment) that we're going to deprecate it and switch to server mode. Server mode rollout is more natural way to use inference framework, and we can reuse all the features, optimizations for online serving.

We have an OpenAI compatible http server implementation:
https://github.com/volcengine/verl/blob/main/verl/workers/rollout/sglang_rollout/async_sglang_server.py#L70-L76

@wuxibin89
Copy link
Collaborator

BTW. This PR seems duplicate with #3090?

@SuperCB
Copy link
Contributor Author

SuperCB commented Aug 18, 2025

BTW. This PR seems duplicate with #3090?

This PR not only supports online server mode, but I've also ported the data buffer-centric architecture from slime and modified the way data is distributed during the rollout phase.If we only port the content from #3090, the sglang exception slowing down issue will still occur.

@ChangyiYang
Copy link
Contributor

Hi! I am the author for pr #3090. My pr is intended to provide server based rollout for sglang without breaking current Verl's structure. But @SuperCB is right, my design can only mitigate the problem. Truly solving this problem need to change the data flow. #3090 is just a start for the solution. I am willing to discuss and talk the next following step.

@Cesilina
Copy link

想问下,目前的修改,是否跑通拉呀,没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

@SuperCB
Copy link
Contributor Author

SuperCB commented Aug 27, 2025

想问下,目前的修改,是否跑通拉呀,没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

我发现有一部分代码没有提交

@Cesilina
Copy link

想问下,目前的修改,是否跑通拉呀,没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

我发现有一部分代码没有提交

hh,楼主啥时候能提交下啊,我这边做实验发现使用sglang server的话,其timing_s/update actor的时间是递增的,不知道楼主是否遇到过这个问题呢,

@ChenyuWang1022
Copy link

想问一下楼主,当前对Sglang的优化有没有测试的脚本呢,我这边用了您rollout那边的verl分支,但是运行测试的时候总是会出现错误,想请教一下楼主的训练脚本🙏,因为感觉大部分脚本的测试参数太乱了。

@SuperCB
Copy link
Contributor Author

SuperCB commented Aug 28, 2025

想问下,目前的修改,是否跑通拉呀,没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊

我发现有一部分代码没有提交

hh,楼主啥时候能提交下啊,我这边做实验发现使用sglang server的话,其timing_s/update actor的时间是递增的,不知道楼主是否遇到过这个问题呢,

我补充提交了代码,你遇到的问题我们并没有遇到过,具体可微信交流15930067963

@SuperCB
Copy link
Contributor Author

SuperCB commented Aug 28, 2025

想问一下楼主,当前对Sglang的优化有没有测试的脚本呢,我这边用了您rollout那边的verl分支,但是运行测试的时候总是会出现错误,想请教一下楼主的训练脚本🙏,因为感觉大部分脚本的测试参数太乱了。

可微信交流

@lizipao
Copy link

lizipao commented Sep 7, 2025

请问一下楼主这个PR还有合入的计划吗

@SuperCB
Copy link
Contributor Author

SuperCB commented Sep 10, 2025

请问一下楼主这个PR还有合入的计划吗

没有这个计划,字节有另外的方案

@EduardDurech
Copy link
Collaborator

I'd also like to add this issue I've had open for 5 months #1208 you can see @squall1988 's recent comment too, also #3173

vermouth1992 pushed a commit that referenced this pull request Sep 16, 2025
### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
#3090 @SuperCB
#3102 with their prior
contribution.
@SuperCB SuperCB changed the title [Do not Merge] Make Verl Sglang Native [Do not Merge] Make VeRL SGLang Native Sep 19, 2025
VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request Sep 24, 2025
### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine/verl#3090 @SuperCB
volcengine/verl#3102 with their prior
contribution.
@wuxibin89 wuxibin89 closed this Sep 26, 2025
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants