-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Description
my environment: 64 H20(8 nodes)+ NCCL version 2.21.5 + cuda12.4
Basic hyperparameter settings:
offload=True
tensor_model_parallel_size=8
pipeline_model_parallel_size=1
sgl_tensor_model_parallel_size=8
Some package version information:
torch 2.6.0
torch_memory_saver 0.0.8
sgl-kernel 0.1.4
sglang 0.4.6.post5
verl 0.4.1
vllm 0.8.2
megatron-core 0.12.1
The log indicates this error occurs when executing the code below:
I can use the same configuration to train with vLLM(tp=8) without issues, but when I switch to sglang and generate rollouts, this NCCL communication timeout occurs(enven with nccl_timeout=7200). I suspect some process is getting stuck for some reason. Are there any methods to debug and identify the specific cause? Let me know if additional details would help resolve this problem.
Metadata
Metadata
Assignees
Labels
No labels

