-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[Do not Merge] Make VeRL SGLang Native #3102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
崔博 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant refactoring to integrate SGLang natively into Verl, aiming to improve performance and simplify the architecture. The changes include a new data buffer-centric design, HTTP-based communication with the SGLang engine, and a new rollout manager. While the overall direction is sound and the implementation is comprehensive, I've identified a few areas that could be improved for robustness and maintainability. Specifically, there are cases of unsafe dictionary access, a potentially problematic bare except clause, a function with unexpected side effects, and code duplication that should be addressed.
| self.sglang_router_ip = kwargs["sglang_router_ip"] | ||
| self.sglang_router_port = kwargs["sglang_router_port"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing kwargs with [] is unsafe as it will raise a KeyError if the keys sglang_router_ip or sglang_router_port are not provided by the caller. This can lead to runtime crashes that are hard to debug. For better robustness, you should use kwargs.get() and handle the case where the keys might be missing, or explicitly check for their existence and raise a more informative ValueError.
| self.sglang_router_ip = kwargs["sglang_router_ip"] | |
| self.sglang_router_port = kwargs["sglang_router_port"] | |
| self.sglang_router_ip = kwargs.get("sglang_router_ip") | |
| self.sglang_router_port = kwargs.get("sglang_router_port") | |
| if self.sglang_router_ip is None or self.sglang_router_port is None: | |
| raise ValueError("sglang_router_ip and sglang_router_port must be provided for ActorRolloutRefWorker") |
| output = response.json() | ||
| except: | ||
| output = response.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except: is risky because it catches all exceptions, including system-exiting ones like SystemExit and KeyboardInterrupt. This can make the program difficult to terminate and can mask underlying bugs. It's better to catch a more specific exception, such as json.JSONDecodeError, to handle cases where the response body is not valid JSON.
| output = response.json() | |
| except: | |
| output = response.text | |
| try: | |
| output = response.json() | |
| except json.JSONDecodeError: | |
| output = response.text |
| batch_size = idx.size(0) | ||
|
|
||
| # Extract non-tensor data | ||
| non_tensor_batch = prompts.non_tensor_batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pop() on non_tensor_batch modifies the dictionary in-place. Since non_tensor_batch is a reference to prompts.non_tensor_batch, this function has the side effect of altering the prompts object that was passed in. This can lead to unexpected behavior in other parts of the code that might use the prompts object later, as it would be missing the popped keys. To avoid this, you should operate on a copy of the dictionary.
| non_tensor_batch = prompts.non_tensor_batch | |
| non_tensor_batch = prompts.non_tensor_batch.copy() |
| def find_available_port(base_port: int): | ||
| port = base_port + random.randint(100, 1000) | ||
| while True: | ||
| if is_port_available(port): | ||
| return port | ||
| if port < 60000: | ||
| port += 42 | ||
| else: | ||
| port -= 43 | ||
|
|
||
|
|
||
| def is_port_available(port): | ||
| """Return whether a port is available.""" | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| try: | ||
| s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
| s.bind(("", port)) | ||
| s.listen(1) | ||
| return True | ||
| except socket.error: | ||
| return False | ||
| except OverflowError: | ||
| return False | ||
|
|
||
|
|
||
| def get_host_info(): | ||
| hostname = socket.gethostname() | ||
|
|
||
| local_ip = socket.gethostbyname(hostname) | ||
|
|
||
| return hostname, local_ip | ||
|
|
||
|
|
||
| def run_router(args): | ||
| try: | ||
| from sglang_router.launch_router import launch_router | ||
|
|
||
| router = launch_router(args) | ||
| if router is None: | ||
| return 1 | ||
| return 0 | ||
| except Exception as e: | ||
| print(e) | ||
| return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions find_available_port, is_port_available, get_host_info, and run_router are duplicated from verl/workers/rollout/http_utils.py. This code duplication increases maintenance overhead and risks inconsistencies if one version is updated and the other is not. To improve maintainability, you should remove these duplicated functions and instead import them from verl.workers.rollout.http_utils.
The offline batch inference mode has some drawbacks #2854 (comment) that we're going to deprecate it and switch to server mode. Server mode rollout is more natural way to use inference framework, and we can reuse all the features, optimizations for online serving. We have an OpenAI compatible http server implementation: |
|
BTW. This PR seems duplicate with #3090? |
This PR not only supports online server mode, but I've also ported the data buffer-centric architecture from slime and modified the way data is distributed during the rollout phase.If we only port the content from #3090, the sglang exception slowing down issue will still occur. |
|
Hi! I am the author for pr #3090. My pr is intended to provide server based rollout for sglang without breaking current Verl's structure. But @SuperCB is right, my design can only mitigate the problem. Truly solving this problem need to change the data flow. #3090 is just a start for the solution. I am willing to discuss and talk the next following step. |
|
想问下,目前的修改,是否跑通拉呀,没看到megatron_workers中的prepare_for_generate和finish_generate这两个函数在哪里调用啊 |
我发现有一部分代码没有提交 |
hh,楼主啥时候能提交下啊,我这边做实验发现使用sglang server的话,其timing_s/update actor的时间是递增的,不知道楼主是否遇到过这个问题呢, |
|
想问一下楼主,当前对Sglang的优化有没有测试的脚本呢,我这边用了您rollout那边的verl分支,但是运行测试的时候总是会出现错误,想请教一下楼主的训练脚本🙏,因为感觉大部分脚本的测试参数太乱了。 |
我补充提交了代码,你遇到的问题我们并没有遇到过,具体可微信交流15930067963 |
可微信交流 |
|
请问一下楼主这个PR还有合入的计划吗 |
没有这个计划,字节有另外的方案 |
|
I'd also like to add this issue I've had open for 5 months #1208 you can see @squall1988 's recent comment too, also #3173 |
### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 #3090 @SuperCB #3102 with their prior contribution.
### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine/verl#3090 @SuperCB volcengine/verl#3102 with their prior contribution.
…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.
…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.
…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.
What does this PR do?
We developed this PR based on the following three considerations:

Multiple teams and individuals, including the XHS team, have observed that Sglang often becomes exceptionally slow when used with Verl Issue1. Sometimes the first training step becomes extremely slow, and sometimes the slowdown starts from the second step. Through profiling, we discovered that the allreduce operation in Sglang slowed down by dozens of times.(As shown in the figure above, it took 17ms for a single token of data to run TP16's allreduce, which is far too long considering we are using an H100 cluster) We traced the issue to this line of code LineLink (This causes time slice contention between two processes on the same GPU, and this contention becomes more severe as the TP_size increases.We cannot initiate other processes to run communication operations on the same GPU when SGlang is undergoing the inference process.).
The original feature in Verl that pre-splits the training data before rollout makes the development of features like partial rollout and streaming rollout very difficult. If each Sglang Engine receives a portion of the data, we would need to maintain the current generation state of each Sglang Engine and track how many tokens each Engine has generated. Such an implementation would be overly complex, and we want to make it simpler and more intuitive.
Algorithm researchers need to rapidly iterate through various experimental schemes during reinforcement learning framework development. This includes exploring different off-policy and on-policy data ratio strategies and verifying various advanced Agent RL mechanisms. However, in the current Verl architecture, the rollout processing code and the inference engine code are tightly coupled (as shown in the code link Link). This means that any modification to any stage of the rollout process requires launching the complete Verl framework for debugging, which significantly increases development costs and iteration cycles. We want to decouple the rollout functionality into an independent "Rollout Manager" component, allowing it to be developed and debugged independently from the Verl framework, thereby reducing the time cost for algorithm engineers to conduct experiments.
Based on the above two considerations, we decided to refactor the parts of Verl related to Sglang rollout to overcome the last obstacle in using Verl + Sglang. We are grateful to zhuzilin for open-sourcing the Slime framework. Slime's data buffer-centric design architecture has greatly inspired us, and we have referred to Slime's implementation during our design process.