Skip to content

Conversation

@ChangyiYang
Copy link
Contributor

@ChangyiYang ChangyiYang commented Aug 17, 2025

What does this PR do?

Summary

This PR introduces a native HTTP server implementation for SGLang, aiming to fundamentally improve flexibility, scalability, and integration capabilities. By transitioning to a more robust client-server architecture, this change addresses several core bottlenecks in the current design.

Key Changes

  • Engine Replacement – Replaced the original sgl.Engine instance with a native HTTP server. ✅ Completed
  • Distributed Optimization – Utilizing a server-based architecture to remove the requirement of gathering all data to TP rank 0. This change resolves the previous dist.barrier timeout issue by replacing the collective wait with per-sample synchronization. 🚧 In Progress
  • Router Integration – Plan to integrate with the native SGLang router for streamlined request handling. 💡 Nice to have

Motivation

The current sgl.Engine driver model presents several architectural challenges, particularly in complex distributed environments. Moving to an HTTP server architecture is motivated by the need to solve the following critical issues:

  1. Eliminate Data Flow Bottlenecks and Improve Performance:

    • Problem: The data flow logic of the existing driver process is misaligned with the training data flow. It requires all data for a single SGLang instance to be gathered to TP rank 0. This data is then processed by the tokenizer manager and sent via ZMQ to the various schedulers. As a result, the preprocess and postprocess steps are slower than expected.
    • Solution: The HTTP server architecture decentralizes this process, allowing each rank to handle requests independently. This removes the "gather to rank 0" bottleneck, dramatically improving data throughput and overall performance.
  2. Resolve CPU Resource Contention:

    • Problem: At the request level, the SGLang driver object cannot be pickled for use in subprocesses. This limitation means that the request-level asynchronous rollout logic and the engine itself are forced to compete for the same CPU time slices, leading to performance degradation.
    • Solution: By decoupling the request handling (client) from the inference engine (server), we isolate the processes, eliminating the CPU contention and allowing for more efficient resource utilization.
  3. Fix Distributed Synchronization Timeouts:

    • Problem: The dist.barrier timeout is a frequent issue where worker ranks remain idle while waiting for TP rank 0 to complete its intensive processing. This collective wait time creates inefficiency and can lead to failures.
    • Solution: The HTTP server model shifts this from a collective barrier to a per-sample synchronization. Workers communicate with the server as needed, removing the long wait times and making the distributed setup more stable and efficient.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a native HTTP server for SGLang, which is a significant architectural improvement for flexibility and scalability. The implementation is comprehensive, including synchronous and asynchronous adapters, extensive test coverage, and example configurations. However, I've identified a few critical issues that need to be addressed. There's a configuration mismatch that will cause runtime errors, and the server launch logic for multi-node setups appears to be flawed, which would prevent it from working correctly in distributed environments. Additionally, there's a performance-related regression due to a hardcoded logging level.

Comment on lines 457 to 486
args = {
"model_path": actor_module,
"dtype": self.config.dtype,
"mem_fraction_static": self.config.gpu_memory_utilization,
"enable_memory_saver": True,
"base_gpu_id": 0,
"gpu_id_step": 1,
"tp_size": self._tp_size,
"node_rank": node_rank,
"load_format": load_format,
"dist_init_addr": dist_init_addr,
"nnodes": nnodes,
"trust_remote_code": trust_remote_code,
# NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new
# when random.seed is being set during training
port=30000 + rank,
# NOTE(Chenyang): turn on log_level to see the decoding speed of SGLang Engine
# log_level="INFO"
# NOTE(Chenyang): turn the following lines to see the input and output of each request
"port": 30000 + rank,
# NOTE(Chenyang): if you want to debug the SGLang engine output
# please set the following parameters
# Otherwise, it will make the engine run too slow
"log_level": "info",
# "log_level": "error",
# log_requests=True,
# log_requests_level=2,
# NOTE(Chenyang): turn on max_running_requests to set the max concurrent running requests
# max_running_requests=1,
mm_attention_backend="fa3",
attention_backend=attention_backend if attention_backend is not None else "fa3",
# In async mode for AgentLoop, SGLang support token in token out to avoid the tokenizer
# inconsistency issue.
skip_tokenizer_init=self.config.mode == "async",
**engine_kwargs,
)
"mm_attention_backend": "fa3",
"attention_backend": attention_backend if attention_backend is not None else "fa3",
# In async mode, we want token in token out.
"skip_tokenizer_init": self.config.mode == "async",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This refactoring has introduced a regression where engine_kwargs from the configuration are no longer passed to the SGLang engine constructor (AsyncEngine or AsyncHttpServerAdapter). This prevents overriding engine parameters like log_level. Additionally, log_level is now hardcoded to "info", which, as noted in the comments, can cause significant performance degradation and is not suitable for production. The engine_kwargs should be merged into the args dictionary, and the log_level should be configurable rather than hardcoded.

            args = {
                "model_path": actor_module,
                "dtype": self.config.dtype,
                "mem_fraction_static": self.config.gpu_memory_utilization,
                "enable_memory_saver": True,
                "base_gpu_id": 0,
                "gpu_id_step": 1,
                "tp_size": self._tp_size,
                "node_rank": node_rank,
                "load_format": load_format,
                "dist_init_addr": dist_init_addr,
                "nnodes": nnodes,
                "trust_remote_code": trust_remote_code,
                # NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new
                # when random.seed is being set during training
                "port": 30000 + rank,
                # NOTE(Chenyang): if you want to debug the SGLang engine output
                # please set the following parameters
                # Otherwise, it will make the engine run too slow
                # "log_level": "error",
                # log_requests=True,
                # log_requests_level=2,
                # NOTE(Chenyang): turn on max_running_requests to set the max concurrent running requests
                # max_running_requests=1,
                "mm_attention_backend": "fa3",
                "attention_backend": attention_backend if attention_backend is not None else "fa3",
                # In async mode, we want token in token out.
                "skip_tokenizer_init": self.config.mode == "async",
                **engine_kwargs,
            }

@ChangyiYang ChangyiYang marked this pull request as ready for review August 17, 2025 01:02
@ChangyiYang ChangyiYang changed the title [WIP][sglang] feat: add native sgl server [sglang] feat: add native sgl server Aug 17, 2025
@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Aug 19, 2025

FAILED test_sglang_async_rollout_w_interaction.py::test_async_sglang_rollout_w_interaction - omegaconf.errors.ConfigAttributeError: Missing key sglang_engine_mode
    full_key: sglang_engine_mode
    object_type=dict

@ChangyiYang fix the CI plz.

4c3310d

@zhaochenyang20
Copy link
Collaborator

@ChangyiYang
Copy link
Contributor Author

@zhaochenyang20 some ci should be fixed now. Can you rerun it?

@zhaochenyang20
Copy link
Collaborator


python3 -m verl.trainer.main_ppo \
--config-path="$CONFIG_PATH" \
--config-name='gsm8k_multiturn_grpo_server' \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to say this would use the server.

@wuxibin89
Copy link
Collaborator

  1. This PR is mainly addressing issues in batch mode with generate_sequences, while batch mode is going to be deprecated and switch to server mode by default. [rollout] feat: change rollout default mode from spmd to server mode #3161. So we tend to not make any new major change to batch mode except bugfix.
  2. For the server mode, I found it's pretty easy to add native sglang server with a few lines:
    wuxibin89/verl@wuxibin/refactor_rollout2...wuxibin/sglang_native_server

@ChangyiYang
Copy link
Contributor Author

Correct me if I am wrong.
In my opinion, the code @wuxibin89 provide is NOT providing the native sgl engine. We want the "generate“ api to send a http request a sgl server. But in the provided code, it launchs a sgl server but the generate api still goes to the local engine instance, which cannot allow the addition of sgl router in the future. So I think the provided code cannot do what this pr do. At lease, the sgl client that rollout holds (which send request to a sgl server) needs to provide the api that local engine provide

@wuxibin89 wuxibin89 merged commit e95bd9e into volcengine:main Aug 28, 2025
55 of 58 checks passed
@ChangyiYang
Copy link
Contributor Author

ChangyiYang commented Aug 28, 2025

Thanks @zhaochenyang20 for detailed review and instruction! Thanks @SwordFaith for providing base code and instruction along the way!

susumuota pushed a commit to susumuota/verl that referenced this pull request Aug 28, 2025
### What does this PR do?

**Summary**

This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.

**Key Changes**

* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**

**Motivation**

The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:

1.  **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.

2.  **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.

3.  **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Aug 29, 2025
### What does this PR do?

**Summary**

This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.

**Key Changes**

* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**

**Motivation**

The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:

1.  **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.

2.  **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.

3.  **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
cczitong123 pushed a commit to cczitong123/verl that referenced this pull request Sep 5, 2025
### What does this PR do?

**Summary**

This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.

**Key Changes**

* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**

**Motivation**

The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:

1.  **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.

2.  **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.

3.  **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
DDVD233 pushed a commit to DDVD233/mirl that referenced this pull request Sep 5, 2025
### What does this PR do?

**Summary**

This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.

**Key Changes**

* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**

**Motivation**

The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:

1.  **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.

2.  **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.

3.  **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
@sangnekim
Copy link

Hello. thank you for your work.
I think the variable should be replaced sglang_engine_mode with sglang_rollout_mode.

  • dist.barrier() timeout error issue still in progress?

@ChangyiYang
Copy link
Contributor Author

Hi @sangnekim . Yes, the naming is better to change in your way. The dist barrier won't go in process since Verl would adopt agent loop and all the old rollout function would be deprecated.

@lizipao
Copy link

lizipao commented Sep 11, 2025

sglang_engine_mode

@ChangyiYang
TypeError: RolloutConfig.init() got an unexpected keyword argument 'sglang_rollout_mode'
Is this a bug?

@ChangyiYang
Copy link
Contributor Author

ChangyiYang commented Sep 11, 2025

@lizipao Yes, the config has a typo. I will change that. For now you can change this line

sglang_engine_mode: str = "local"

@lizipao
Copy link

lizipao commented Sep 12, 2025

@lizipao Yes, the config has a typo. I will change that. For now you can change this line

sglang_engine_mode: str = "local"

Have you encountered this error before?
[torch_memory_saver.cpp] CUresult error: 1 (invalid argument) file=csrc/core.cpp func=pause line=77
[http_server_engine.py:691] : Async request to release_memory_occupation timed out (attempt 1)

@ChangyiYang
Copy link
Contributor Author

@lizipao I never encountered this error before.

vermouth1992 pushed a commit that referenced this pull request Sep 16, 2025
### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
#3090 @SuperCB
#3102 with their prior
contribution.
VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request Sep 24, 2025
### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine/verl#3090 @SuperCB
volcengine/verl#3102 with their prior
contribution.
WncFht pushed a commit to WncFht/verl that referenced this pull request Oct 10, 2025
### What does this PR do?

**Summary**

This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.

**Key Changes**

* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**

**Motivation**

The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:

1.  **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.

2.  **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.

3.  **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
### What does this PR do?

**Summary**

This PR introduces a native HTTP server implementation for SGLang,
aiming to fundamentally improve flexibility, scalability, and
integration capabilities. By transitioning to a more robust
client-server architecture, this change addresses several core
bottlenecks in the current design.

**Key Changes**

* **Engine Replacement** – Replaced the original `sgl.Engine` instance
with a native HTTP server. ✅ **Completed**
* **Distributed Optimization** – Utilizing a server-based architecture
to remove the requirement of gathering all data to TP rank 0. This
change resolves the previous `dist.barrier` timeout issue by replacing
the collective wait with per-sample synchronization. 🚧 **In Progress**
* **Router Integration** – Plan to integrate with the native SGLang
router for streamlined request handling. 💡 **Nice to have**

**Motivation**

The current `sgl.Engine` driver model presents several architectural
challenges, particularly in complex distributed environments. Moving to
an HTTP server architecture is motivated by the need to solve the
following critical issues:

1.  **Eliminate Data Flow Bottlenecks and Improve Performance:**
* **Problem:** The data flow logic of the existing driver process is
misaligned with the training data flow. It requires all data for a
single SGLang instance to be gathered to TP rank 0. This data is then
processed by the tokenizer manager and sent via ZMQ to the various
schedulers. As a result, the `preprocess` and `postprocess` steps are
slower than expected.
* **Solution:** The HTTP server architecture decentralizes this process,
allowing each rank to handle requests independently. This removes the
"gather to rank 0" bottleneck, dramatically improving data throughput
and overall performance.

2.  **Resolve CPU Resource Contention:**
* **Problem:** At the request level, the SGLang driver object cannot be
pickled for use in subprocesses. This limitation means that the
request-level asynchronous rollout logic and the engine itself are
forced to compete for the same CPU time slices, leading to performance
degradation.
* **Solution:** By decoupling the request handling (client) from the
inference engine (server), we isolate the processes, eliminating the CPU
contention and allowing for more efficient resource utilization.

3.  **Fix Distributed Synchronization Timeouts:**
* **Problem:** The `dist.barrier` timeout is a frequent issue where
worker ranks remain idle while waiting for TP rank 0 to complete its
intensive processing. This collective wait time creates inefficiency and
can lead to failures.
* **Solution:** The HTTP server model shifts this from a collective
barrier to a per-sample synchronization. Workers communicate with the
server as needed, removing the long wait times and making the
distributed setup more stable and efficient.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants