Megatron+sglang training occurs nccl timeout during rollout generation

my environment: 64 H20(8 nodes)+ NCCL version 2.21.5 + cuda12.4
Basic hyperparameter settings:
```
offload=True
tensor_model_parallel_size=8
pipeline_model_parallel_size=1
sgl_tensor_model_parallel_size=8
```

Some package version information:
```
torch                             2.6.0
torch_memory_saver                0.0.8
sgl-kernel                        0.1.4
sglang                            0.4.6.post5
verl                              0.4.1
vllm                              0.8.2
megatron-core                     0.12.1
```


The log indicates this error occurs when executing the code below:

![Image](https://github.com/user-attachments/assets/678f4d91-33bb-44d0-a105-95b2f2c5a3e8)

![Image](https://github.com/user-attachments/assets/cd01f16c-5fa9-42b4-9580-d712a8aacd85)

I can use the same configuration to train with vLLM(tp=8) without issues, but when I switch to sglang and generate rollouts, this NCCL communication timeout occurs(enven with nccl_timeout=7200). I suspect some process is getting stuck for some reason. Are there any methods to debug and identify the specific cause? Let me know if additional details would help resolve this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Megatron+sglang training occurs nccl timeout during rollout generation #2325

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Megatron+sglang training occurs nccl timeout during rollout generation #2325

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions