[roadmap] verl development Q2

Past roadmap for reference: https://github.com/volcengine/verl/issues/22

## Agentic RL: Environment interaction & tool support [P0]
- [x] Integrate [SandBox](https://github.com/bytedance/SandboxFusion/tree/main) for code generation tasks (more mature than verl's current code sandbox)
- [ ] search / environment interaction via http / grpc 
- [x] multi-turn optimizations, <del>better kv cache management and streaming generations (potential inference engine dependency)</del> https://github.com/volcengine/verl/pull/1037 https://github.com/volcengine/verl/pull/1138  will leave kv cache optimization to inference engines
- [ ] further multi-turn rollout improvements see https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/131 

## Scaling up RL & system performance [P0]
- [ ] Ring Attention
- [x] Ulyssess sequence parallel for VLM models, e.g Qwen2VL
- [x] reference system tuning script for best RL throughput on different types of accelerators
- [x] multi-node rollout (potential inference engine dependency)
- [x] alignment loss fused kernels https://github.com/volcengine/verl/pull/1212

## Usability improvement 
- [x] make the current ray trainer easier to extend (without modifying verl source code or forks). Currently users can define their custom reward via command line without modifying verl source code. Ideally in the main training loop we should allow custom dataset as well https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py#L41. 
- [x] collect benchmark result on torchtitan vs megatron
- [ ] support TorchTitan nd parallelism for better usability 

## Latest Model & Algorithm Support
See https://verl.readthedocs.io/en/latest/advance/fsdp_extension.html for adding models with FSDP backend
See https://verl.readthedocs.io/en/latest/advance/megatron_extension.html for adding models with Megatron backend. 
- [ ] gemma3 https://github.com/volcengine/verl/pull/1613/files
- [x] deepseek v3 - since it's large, we should start with SFT for correctness verification, optimize it with fused kernels/recomputation, before moving to RL https://github.com/volcengine/verl/issues/708 https://github.com/volcengine/verl/pull/1771
- [x] qwen3 & qwen3-moe https://github.com/huggingface/transformers/pull/36878 
and any other popular models. 
- [ ] OLMo2
- [x] Dr. GRPO https://github.com/volcengine/verl/issues/742

## Component Continuous Updates
- [x] verify ulysses sequence parallelism support works with latest version of transformers >= v4.50
- [x] replace FSDP1 with FSDP2 https://github.com/volcengine/verl/pull/1026 
- [x] add activation offloading optimization https://github.com/volcengine/verl/pull/1220/files 

## dataset & benchmark
- [ ] gpqa diamond (english)
- [ ] LiveCodeBench (code)
- [ ] SWE-bench Verified (code)
- [ ] CNMO 2024 (math)
- [ ] codecontests (Code Generation)
- [ ] TACO (Code Generation)
- [ ] competition_math (Math)

Please also help provide scripts to reproduce evaluation performance of public released models. 

## Efficient RL / codesign [P1]
- [x] lora support for RL, and provide convergence report  https://github.com/volcengine/verl/pull/1127

## Wide Hardware Coverage
Make the experience on non-nvidia GPUs more smooth
- [x] stable Ascend NPUs suppport, with reproducible examples and logs 
- [x] stable AMD GPUs suppport, with sglang
- [x] AMD GPU with mcore support

## Make verl easier to extend with custom train/infer engine and roles
- [x] https://github.com/volcengine/verl/pull/1424
- [x] https://github.com/volcengine/verl/issues/1371 

# other community requests

- retool https://github.com/volcengine/verl/issues/1169 
- load balance #658

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[roadmap] verl development Q2 #710

Agentic RL: Environment interaction & tool support [P0]

Scaling up RL & system performance [P0]

Usability improvement

Latest Model & Algorithm Support

Component Continuous Updates

dataset & benchmark

Efficient RL / codesign [P1]

Wide Hardware Coverage

Make verl easier to extend with custom train/infer engine and roles

other community requests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[roadmap] verl development Q2 #710

Description

Agentic RL: Environment interaction & tool support [P0]

Scaling up RL & system performance [P0]

Usability improvement

Latest Model & Algorithm Support

Component Continuous Updates

dataset & benchmark

Efficient RL / codesign [P1]

Wide Hardware Coverage

Make verl easier to extend with custom train/infer engine and roles

other community requests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions