OOM when running example of LoRA on Qwen2.5-Omni

Hi team,

I’m trying to reproduce [this](https://github.com/texttron/tevatron/tree/main/examples/multimodal/qwen_omni) example training of qwen-omni, but I consistently hit CUDA OOM.

### Hardware

* 1 node, **8× NVIDIA H100 80GB**
* CPU RAM: e.g., 256 GB

### Symptoms

* With **ZeRO-0**: CUDA OOM on the first forward/backward steps.
* Switching to **ZeRO-3** avoids OOM but sometimes triggers NCCL collective timeouts (`_ALLGATHER_BASE` watchdog).

### Questions

1. In your reproduction, did you train **LoRA** with **ZeRO-0 on 8× H100 80GB**?

   * If yes, could you share the exact batch sizes, sequence lengths, and DeepSpeed config? (The current one consistently hits OOM)
2. Could you please provide a **`requirements.txt`** (or conda env) for training **Qwen 2.5 Omni** with this example?

   * Version pins for torch / deepspeed / nccl would be very helpful.

### What I tried

* Lowering `per_device_train_batch_size` (16 → 2).
* Shorter seq lengths (e.g., `query_max_len=256`, `passage_max_len=256`).
* Enabling gradient checkpointing.

If you have a minimal working config (train args + DS json) and a `requirements.txt`, that would greatly help us reproduce your results. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when running example of LoRA on Qwen2.5-Omni #205

Hardware

Symptoms

Questions

What I tried

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM when running example of LoRA on Qwen2.5-Omni #205

Description

Hardware

Symptoms

Questions

What I tried

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions