Hi team,
I’m trying to reproduce this example training of qwen-omni, but I consistently hit CUDA OOM.
Hardware
- 1 node, 8× NVIDIA H100 80GB
- CPU RAM: e.g., 256 GB
Symptoms
- With ZeRO-0: CUDA OOM on the first forward/backward steps.
- Switching to ZeRO-3 avoids OOM but sometimes triggers NCCL collective timeouts (
_ALLGATHER_BASE watchdog).
Questions
-
In your reproduction, did you train LoRA with ZeRO-0 on 8× H100 80GB?
- If yes, could you share the exact batch sizes, sequence lengths, and DeepSpeed config? (The current one consistently hits OOM)
-
Could you please provide a requirements.txt (or conda env) for training Qwen 2.5 Omni with this example?
- Version pins for torch / deepspeed / nccl would be very helpful.
What I tried
- Lowering
per_device_train_batch_size (16 → 2).
- Shorter seq lengths (e.g.,
query_max_len=256, passage_max_len=256).
- Enabling gradient checkpointing.
If you have a minimal working config (train args + DS json) and a requirements.txt, that would greatly help us reproduce your results. Thanks!