Skip to content

Megatron+sglang training occurs nccl timeout during rollout generation #2325

@Xialinxuan

Description

@Xialinxuan

my environment: 64 H20(8 nodes)+ NCCL version 2.21.5 + cuda12.4
Basic hyperparameter settings:

offload=True
tensor_model_parallel_size=8
pipeline_model_parallel_size=1
sgl_tensor_model_parallel_size=8

Some package version information:

torch                             2.6.0
torch_memory_saver                0.0.8
sgl-kernel                        0.1.4
sglang                            0.4.6.post5
verl                              0.4.1
vllm                              0.8.2
megatron-core                     0.12.1

The log indicates this error occurs when executing the code below:

Image

Image

I can use the same configuration to train with vLLM(tp=8) without issues, but when I switch to sglang and generate rollouts, this NCCL communication timeout occurs(enven with nccl_timeout=7200). I suspect some process is getting stuck for some reason. Are there any methods to debug and identify the specific cause? Let me know if additional details would help resolve this problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions