Skip to content

Commit fd8ae66

Browse files
authored
[1/N][rollout] feat: support vllm/sglang native http server (#3456)
### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 #3090 @SuperCB #3102 with their prior contribution.
1 parent ac2f790 commit fd8ae66

36 files changed

+1089
-1014
lines changed

.github/workflows/sgl.yml

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ jobs:
9898

9999
sgl:
100100
needs: setup
101-
runs-on: [ "${{ needs.setup.outputs.runner-label || 'L20x8' }}" ]
101+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
102102
timeout-minutes: 35 # Increase this timeout value as needed
103103
env:
104104
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
@@ -119,10 +119,18 @@ jobs:
119119
pip3 install -e .[test]
120120
- name: Download Model to Use
121121
run: |
122-
huggingface-cli download 'Qwen/Qwen2-7B-Instruct' --local-dir ${HOME}/models/Qwen/Qwen2-7B-Instruct
123-
huggingface-cli download 'Qwen/Qwen2.5-0.5B' --local-dir ${HOME}/models/Qwen/Qwen2.5-0.5B
122+
huggingface-cli download Qwen/Qwen2.5-0.5B --local-dir ${HOME}/models/Qwen/Qwen2.5-0.5B
124123
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-1.5B-Instruct
124+
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-VL-3B-Instruct
125125
export HF_HUB_OFFLINE=1
126+
- name: Prepare gsm8k dataset
127+
run: |
128+
ray stop --force
129+
python3 examples/data_preprocess/gsm8k.py
130+
- name: Test the latest SGLang Rollout async with agent loop
131+
run: |
132+
huggingface-cli download verl-team/gsm8k-v0.4.1 --repo-type dataset --local-dir ~/verl-data/gsm8k
133+
ROLLOUT_NAME=sglang pytest -svvv tests/experimental/agent_loop
126134
- name: Test the latest SGLang
127135
run: |
128136
cd tests/workers/rollout
@@ -151,10 +159,6 @@ jobs:
151159
run: |
152160
cd tests/workers/rollout
153161
pytest -s test_sglang_async_rollout_mcp_tools.py
154-
- name: Test the latest SGLang Rollout async with agent loop
155-
run: |
156-
huggingface-cli download verl-team/gsm8k-v0.4.1 --repo-type dataset --local-dir ~/verl-data/gsm8k
157-
ROLLOUT_NAME=sglang pytest -svvv tests/experimental/agent_loop
158162
# Note(haibin.lin): for any new test, please update gpu_unit_tests.yaml to avoid repeated tests
159163
- name: Test the latest SGLang Rollout async with multimodal delta
160164
run: |
@@ -163,16 +167,12 @@ jobs:
163167
164168
cleanup:
165169
runs-on: ubuntu-latest
166-
needs:
167-
[
168-
setup,
169-
sgl
170-
]
170+
needs: [setup, sgl]
171171
if: always()
172172
steps:
173173
- id: destroy-runner
174174
uses: volcengine/vemlp-github-runner@v1
175175
with:
176176
mode: "destroy"
177177
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
178-
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
178+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"

.github/workflows/vllm.yml

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ jobs:
9595

9696
vllm:
9797
needs: setup
98-
runs-on: [ "${{ needs.setup.outputs.runner-label || 'L20x8' }}" ]
98+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
9999
timeout-minutes: 35 # Increase this timeout value as needed
100100
env:
101101
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
@@ -110,20 +110,20 @@ jobs:
110110
- name: Install the current repository
111111
run: |
112112
pip3 install -e .[test]
113-
pip install tensordict==0.6.2
114113
- name: Download Model to Use
115114
run: |
116115
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct
117116
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-1.5B-Instruct
118-
huggingface-cli download 'Qwen/Qwen2-7B-Instruct' --local-dir ${HOME}/models/Qwen/Qwen2-7B-Instruct
119-
huggingface-cli download 'deepseek-ai/deepseek-llm-7b-chat' --local-dir ${HOME}/models/deepseek-ai/deepseek-llm-7b-chat
120-
huggingface-cli download 'OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN' --local-dir $HOME/models/OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN
117+
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-VL-3B-Instruct
118+
huggingface-cli download OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN --local-dir ${HOME}/models/OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN
121119
export HF_HUB_OFFLINE=1
122-
# Disable requests to avoid network errors
123120
- name: Prepare gsm8k dataset
124121
run: |
125122
ray stop --force
126123
python3 examples/data_preprocess/gsm8k.py
124+
- name: Test the latest vLLM Rollout async with agent loop
125+
run: |
126+
ROLLOUT_NAME=vllm pytest -svvv tests/experimental/agent_loop
127127
- name: Test the latest vLLM
128128
run: |
129129
torchrun --standalone --nnodes=1 --nproc_per_node=4 $(which pytest) -s tests/workers/rollout/rollout_vllm/test_vllm_spmd.py
@@ -142,24 +142,16 @@ jobs:
142142
export OUTPUT_PATH="${HOME}/data/gen/qwen_05_gen_test.parquet"
143143
MODEL_ID=${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct NGPUS_PER_NODE=1 GEN_TP=1 bash ./run_gen_qwen05.sh
144144
rm -rf "${OUTPUT_PATH}"
145-
- name: Test the latest vLLM Rollout async with agent loop
146-
run: |
147-
huggingface-cli download verl-team/gsm8k-v0.4.1 --repo-type dataset --local-dir ~/verl-data/gsm8k
148-
ROLLOUT_NAME=vllm pytest -svvv tests/experimental/agent_loop
149145
# Note(haibin.lin): for any new test, please update gpu_unit_tests.yaml to avoid repeated tests
150146

151147
cleanup:
152148
runs-on: ubuntu-latest
153-
needs:
154-
[
155-
setup,
156-
vllm
157-
]
149+
needs: [setup, vllm]
158150
if: always()
159151
steps:
160152
- id: destroy-runner
161153
uses: volcengine/vemlp-github-runner@v1
162154
with:
163155
mode: "destroy"
164156
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
165-
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
157+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"

recipe/one_step_off_policy/megatron_workers.py

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,12 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
import copy
1716
import logging
1817
import os
1918

2019
import torch
2120
import torch.distributed
22-
from omegaconf import DictConfig, OmegaConf, open_dict
21+
from omegaconf import DictConfig, OmegaConf
2322

2423
from verl.single_controller.base.decorator import Dispatch, make_nd_compute_dataproto_dispatch_fn, register
2524
from verl.utils.config import omega_conf_to_dataclass
@@ -180,14 +179,7 @@ def init_model(self):
180179
log_gpu_memory_usage("Before building vllm rollout", logger=None)
181180

182181
rollout_config: RolloutConfig = omega_conf_to_dataclass(self.config.rollout)
183-
# (vermouth1992). self.config.model in megatron differs from that of fsdp in the override_config.
184-
# To workaround this we deepcopy self.config.model and make them compatible
185-
omega_model_config = copy.deepcopy(self.config.model)
186-
with open_dict(omega_model_config):
187-
override_config = omega_model_config.override_config.pop("model_config")
188-
omega_model_config.override_config = override_config
189-
190-
model_config: HFModelConfig = omega_conf_to_dataclass(omega_model_config, dataclass_type=HFModelConfig)
182+
model_config: HFModelConfig = omega_conf_to_dataclass(self.config.model, dataclass_type=HFModelConfig)
191183
rollout = get_rollout_class(rollout_config.name, rollout_config.mode)(
192184
config=rollout_config, model_config=model_config, device_mesh=rollout_device_mesh
193185
)

tests/experimental/agent_loop/test_agent_loop_reward.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
from torchdata.stateful_dataloader import StatefulDataLoader
1919
from transformers import AutoTokenizer
2020

21-
from tests.experimental.agent_loop.agent_utils import init_agent_loop_manager
21+
from verl.experimental.agent_loop import AgentLoopManager
2222
from verl.protocol import DataProto
2323
from verl.trainer.main_ppo import create_rl_sampler
2424
from verl.utils.dataset.rl_dataset import RLHFDataset, collate_fn
@@ -45,15 +45,16 @@ def test_agent_loop_compute_score():
4545
config.actor_rollout_ref.actor.use_dynamic_bsz = True
4646
config.actor_rollout_ref.rollout.name = os.environ["ROLLOUT_NAME"]
4747
config.actor_rollout_ref.rollout.mode = "async"
48+
config.actor_rollout_ref.rollout.enforce_eager = True
4849
config.actor_rollout_ref.rollout.prompt_length = 1024
4950
config.actor_rollout_ref.rollout.response_length = 4096
5051
config.actor_rollout_ref.rollout.skip_tokenizer_init = True
5152

5253
# 1. init agent loop manager
53-
agent_loop_manager = init_agent_loop_manager(config)
54+
agent_loop_manager = AgentLoopManager(config)
5455

5556
# 2. init dataset and dataloader
56-
local_folder = os.path.expanduser("~/verl-data/gsm8k/")
57+
local_folder = os.path.expanduser("~/data/gsm8k/")
5758
data_files = [os.path.join(local_folder, "train.parquet")]
5859
tokenizer = AutoTokenizer.from_pretrained(model_path)
5960

tests/experimental/agent_loop/test_agent_loop_reward_model.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,17 +13,19 @@
1313
# limitations under the License.
1414
import os
1515

16+
import pytest
1617
import ray
1718
from hydra import compose, initialize_config_dir
1819
from torchdata.stateful_dataloader import StatefulDataLoader
1920
from transformers import AutoTokenizer
2021

21-
from tests.experimental.agent_loop.agent_utils import init_agent_loop_manager
22+
from tests.experimental.agent_loop.agent_utils import AgentLoopManager
2223
from verl.protocol import DataProto
2324
from verl.trainer.main_ppo import create_rl_sampler
2425
from verl.utils.dataset.rl_dataset import RLHFDataset, collate_fn
2526

2627

28+
@pytest.mark.skip(reason="reward model is depreated and replaced by GRM")
2729
def test_agent_loop_compute_score_with_model():
2830
ray.init(
2931
runtime_env={
@@ -45,6 +47,7 @@ def test_agent_loop_compute_score_with_model():
4547
config.actor_rollout_ref.actor.use_dynamic_bsz = True
4648
config.actor_rollout_ref.rollout.name = os.environ["ROLLOUT_NAME"]
4749
config.actor_rollout_ref.rollout.mode = "async"
50+
config.actor_rollout_ref.rollout.enforce_eager = True
4851
config.actor_rollout_ref.rollout.prompt_length = 1024
4952
config.actor_rollout_ref.rollout.response_length = 4096
5053
config.actor_rollout_ref.rollout.skip_tokenizer_init = True
@@ -61,10 +64,10 @@ def test_agent_loop_compute_score_with_model():
6164
config.trainer.n_gpus_per_node = 4
6265
config.trainer.nnodes = 1
6366
# 1. init agent loop manager
64-
agent_loop_manager = init_agent_loop_manager(config)
67+
agent_loop_manager = AgentLoopManager(config)
6568

6669
# 2. init dataset and dataloader
67-
local_folder = os.path.expanduser("~/verl-data/gsm8k/")
70+
local_folder = os.path.expanduser("~/data/gsm8k/")
6871
data_files = [os.path.join(local_folder, "train.parquet")]
6972
tokenizer = AutoTokenizer.from_pretrained(model_path)
7073

tests/experimental/agent_loop/test_basic_agent_loop.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
from transformers.utils import get_json_schema
2323

2424
from tests.experimental.agent_loop.agent_utils import init_agent_loop_manager
25+
from verl.experimental.agent_loop import AgentLoopManager
2526
from verl.experimental.agent_loop.agent_loop import get_trajectory_info
2627
from verl.protocol import DataProto
2728
from verl.tools.base_tool import BaseTool, OpenAIFunctionToolSchema
@@ -53,6 +54,7 @@ def init_config() -> DictConfig:
5354
config.actor_rollout_ref.model.path = model_path
5455
config.actor_rollout_ref.rollout.name = os.environ["ROLLOUT_NAME"]
5556
config.actor_rollout_ref.rollout.mode = "async"
57+
config.actor_rollout_ref.rollout.enforce_eager = True
5658
config.actor_rollout_ref.rollout.prompt_length = 4096
5759
config.actor_rollout_ref.rollout.response_length = 4096
5860
config.actor_rollout_ref.rollout.n = 4
@@ -74,7 +76,7 @@ def test_single_turn(init_config):
7476
}
7577
)
7678

77-
agent_loop_manager = init_agent_loop_manager(init_config)
79+
agent_loop_manager = AgentLoopManager(init_config)
7880
tokenizer = hf_tokenizer(init_config.actor_rollout_ref.model.path)
7981
reward_fn = load_reward_manager(
8082
init_config, tokenizer, num_examine=0, **init_config.reward_model.get("reward_kwargs", {})
@@ -223,7 +225,7 @@ def test_tool_agent(init_config):
223225
init_config.actor_rollout_ref.rollout.multi_turn.tool_config_path = tool_config_path
224226
init_config.actor_rollout_ref.rollout.multi_turn.max_parallel_calls = 2
225227
init_config.actor_rollout_ref.rollout.calculate_log_probs = True
226-
agent_loop_manager = init_agent_loop_manager(init_config)
228+
agent_loop_manager = AgentLoopManager(init_config)
227229

228230
# =========================== 2. Generate sequences ===========================
229231
raw_prompts = [

tests/experimental/agent_loop/test_multi_modal.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
from PIL import Image
2323
from transformers.utils import get_json_schema
2424

25-
from tests.experimental.agent_loop.agent_utils import init_agent_loop_manager
25+
from verl.experimental.agent_loop import AgentLoopManager
2626
from verl.protocol import DataProto
2727
from verl.tools.base_tool import BaseTool, OpenAIFunctionToolSchema
2828
from verl.tools.schemas import ToolResponse
@@ -48,6 +48,7 @@ def init_config() -> DictConfig:
4848
config.actor_rollout_ref.model.path = model_path
4949
config.actor_rollout_ref.rollout.name = os.environ["ROLLOUT_NAME"]
5050
config.actor_rollout_ref.rollout.mode = "async"
51+
config.actor_rollout_ref.rollout.enforce_eager = True
5152
config.actor_rollout_ref.rollout.prompt_length = 4096
5253
config.actor_rollout_ref.rollout.response_length = 4096
5354
config.actor_rollout_ref.rollout.n = 4
@@ -147,7 +148,7 @@ def test_multimodal_tool_agent(init_config):
147148
init_config.actor_rollout_ref.rollout.multi_turn.tool_config_path = tool_config_path
148149
init_config.actor_rollout_ref.rollout.multi_turn.max_parallel_calls = 1
149150
init_config.actor_rollout_ref.rollout.multi_turn.max_user_turns = 1
150-
agent_loop_manager = init_agent_loop_manager(init_config)
151+
agent_loop_manager = AgentLoopManager(init_config)
151152

152153
# =========================== 2. Generate sequences with multimodal prompts ===========================
153154
raw_prompts = [
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Copyright 2024 Bytedance Ltd. and/or its affiliates
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import asyncio
15+
import os
16+
17+
import pytest
18+
import ray
19+
from omegaconf import DictConfig
20+
from openai import AsyncOpenAI
21+
22+
from verl.workers.rollout.replica import get_rollout_replica_class
23+
24+
25+
@pytest.fixture
26+
def init_config() -> DictConfig:
27+
from hydra import compose, initialize_config_dir
28+
29+
with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
30+
config = compose(config_name="ppo_trainer")
31+
32+
config.trainer.n_gpus_per_node = 4
33+
config.trainer.nnodes = 2
34+
config.actor_rollout_ref.model.path = os.path.expanduser("~/models/Qwen/Qwen2.5-1.5B-Instruct")
35+
config.actor_rollout_ref.rollout.name = os.environ["ROLLOUT_NAME"]
36+
config.actor_rollout_ref.rollout.load_format = "auto"
37+
config.actor_rollout_ref.rollout.enforce_eager = True
38+
39+
return config
40+
41+
42+
@pytest.mark.asyncio
43+
@pytest.mark.parametrize("tp_size", [2, 4])
44+
async def test_standalone_(init_config, tp_size):
45+
"""Test standalone rollout single node and multi nodes."""
46+
ray.init(
47+
runtime_env={
48+
"env_vars": {
49+
"TOKENIZERS_PARALLELISM": "true",
50+
"NCCL_DEBUG": "WARN",
51+
"VLLM_LOGGING_LEVEL": "INFO",
52+
"VLLM_USE_V1": "1",
53+
}
54+
}
55+
)
56+
57+
init_config.actor_rollout_ref.rollout.skip_tokenizer_init = False
58+
init_config.actor_rollout_ref.rollout.tensor_model_parallel_size = tp_size
59+
num_replicas = (init_config.trainer.n_gpus_per_node * init_config.trainer.nnodes) // tp_size
60+
61+
# create standalone rollout server
62+
rollout_server_class = get_rollout_replica_class(init_config.actor_rollout_ref.rollout.name)
63+
rollout_servers = [
64+
rollout_server_class(replica_rank=replica_rank, config=init_config, gpus_per_node=2)
65+
for replica_rank in range(num_replicas)
66+
]
67+
await asyncio.gather(*[server.init_standalone() for server in rollout_servers])
68+
69+
server_handles = [server._server_handle for server in rollout_servers]
70+
server_addresses = [server._server_address for server in rollout_servers]
71+
assert len(server_handles) == num_replicas
72+
assert len(server_addresses) == num_replicas
73+
74+
os.environ.pop("HTTPS_PROXY", None)
75+
os.environ.pop("HTTP_PROXY", None)
76+
os.environ.pop("NO_PROXY", None)
77+
78+
client = AsyncOpenAI(
79+
api_key="123-abc",
80+
base_url=f"http://{server_addresses[0]}/v1",
81+
)
82+
83+
completion = await client.chat.completions.create(
84+
model=init_config.actor_rollout_ref.model.path,
85+
messages=[{"role": "user", "content": "What can you do?"}],
86+
)
87+
print(completion.choices[0].message.content)
88+
89+
ray.shutdown()

tests/special_e2e/ppo_trainer/run_function_reward.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ LORA_ALPHA=${LORA_ALPHA:-${LORA_RANK}}
4141
LORA_TARGET=${LORA_TARGET:-"all-linear"}
4242
LORA_EXCLUDE=${LORA_EXCLUDE:-"DONT_EXCLUDE"}
4343
USE_SHM=${USE_SHM:-False}
44-
LOAD_FORMAT=${LOAD_FORMAT:-dummy_dtensor}
44+
LOAD_FORMAT=${LOAD_FORMAT:-dummy}
4545
LAYERED_SUMMON=${LAYERED_SUMMON:-False}
4646
# Validation
4747
VAL_BEFORE_TRAIN=${VAL_BEFORE_TRAIN:-False}

tests/special_sanity/check_device_api_usage.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@
3838
"verl/workers/reward_model/megatron/reward_model.py", # appear in default device_name
3939
"verl/third_party/torch/distributed/_state_dict_utils.py", # torch monkey patch fixes
4040
"verl/third_party/torch/distributed/checkpoint/state_dict.py", # torch monkey patch fixes
41+
"verl/workers/rollout/vllm_rollout/vllm_async_server.py", # appear in config.cudagraph_capture_sizes
42+
"verl/workers/rollout/sglang_rollout/async_sglang_server.py", # manually set CUDA_VISIBLE_DEVICES
4143
]
4244

4345
# directory or file path must contain keyword "nccl"

0 commit comments

Comments
 (0)