[Bug] Qwen3_5 SFT on ROCm (gfx1201) fails at step 0: FailOnRecompileLimitHit (fullgraph=True)

> **Note:** This report was drafted with the help of AI (assisted structuring, log analysis, and wording). All technical details, versions, reproduction steps, and workarounds described below were verified on my local machine — please treat anything that looks off as something I'm happy to clarify.

Hi Unsloth team 👋

I'm running **Unsloth Studio** on an **AMD ROCm** workstation (dual RX 9070 XT, RDNA4 / `gfx1201`) and ran into a consistent training failure when fine-tuning **Qwen3_5** models. Studio itself starts and loads models fine; the crash happens during the first SFT training step.

Below is a structured bug report with environment details, reproduction steps, logs, workarounds we tried, and a suggested upstream fix — hopefully useful for other ROCm users and for the maintainers.

---

## Summary

SFT training of a Qwen3_5 model fails at **step 0** during the first backward pass (~2 minutes after "Starting training..."):

```
torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True
```

The failure appears to originate in the compiled `Qwen3_5RMSNorm_forward` path inside `unsloth_compiled_cache/unsloth_compiled_module_qwen3_5.py`, triggered through gradient checkpointing / attention forwards.

This looks related to **#3205** (same Dynamo `fullgraph=True` failure on GPT-OSS) and **#3385** (ROCm Gemma3 workaround via partial compile disable) — but I couldn't find an open issue specifically for **Qwen3_5 on ROCm/RDNA4**.

Both **16-bit LoRA** and **4-bit QLoRA** reproduce the failure identically.

---

## Environment

| Component | Version |
|-----------|---------|
| GPU | 2× AMD Radeon **RX 9070 XT** (`gfx1201`), ~16 GB VRAM each |
| ROCm HIP | **7.2.53211** |
| PyTorch | **2.12.1+rocm7.2** |
| Triton | **3.7.1** |
| Unsloth | **2026.6.9** |
| transformers | **5.12.1** |
| Python | **3.13** |
| Platform | Linux (openSUSE-style kernel `7.2.0-rc1`) |
| Interface | **Unsloth Studio** (local, port 8888) |

`rocminfo` reports both GPUs as `gfx1201` (RDNA4).

---

## Reproduction steps

1. Start Unsloth Studio locally
2. **Model:** `deepreinforce-ai/Ornith-1.0-9B` (Qwen3_5 architecture)
3. **Dataset:** `NousResearch/hermes-function-calling-v1` (ShareGPT format, auto-detected)
4. **Training:** SFT / LoRA with default-ish Studio settings
   - `max_seq_length`: 2048
   - Multi-GPU: 2 visible GPUs → `device_map=balanced`
   - Tried both **16-bit LoRA** and **4-bit QLoRA**
5. Click **Start Training**
6. Training reaches `0/N` steps, runs ~2 minutes, then crashes

Reproduced **multiple times** across separate Studio restarts on 2026-07-02.

---

## Error output

Studio log shows:

```
Training error: Hard failure due to fullgraph=True
```

Full Dynamo exception (from earlier runs with traceback enabled):

```
torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True

  ...
  Qwen3_5Attention_forward
    → Qwen3_5RMSNorm_forward
      → torch._dynamo.exc.FailOnRecompileLimitHit
```

The compiled cache path involved:

```
unsloth_compiled_cache/unsloth_compiled_module_qwen3_5.py
```

---

## What we tried (workarounds)

We investigated in this order:

### 1. Raise Dynamo recompile/cache limits ❌ (not sufficient alone)

```python
torch._dynamo.config.recompile_limit = 2048
torch._dynamo.config.cache_size_limit = 2048
torch._dynamo.config.accumulated_cache_size_limit = 2048
```

Applied early in the training subprocess (before first `torch` compile). **Still fails** with the same `fullgraph=True` hard failure at step 0.

### 2. Disable fullgraph / partial compile ✅ (expected fix)

Setting on ROCm:

```bash
export UNSLOTH_COMPILE_DISABLE=partial
export UNSLOTH_ROCM_NO_FULLGRAPH=1
```

…and forcing `fullgraph=False` in `unsloth_compile_transformers()` — this mirrors the existing **Gemma3 RDNA workaround** in `loader.py` (`UNSLOTH_COMPILE_DISABLE=partial` for `is_rdna()`).

We implemented an **auto-retry** in Studio: on `FailOnRecompileLimitHit`, reload model with partial compile and retry training once. (Happy to share a patch / PR if useful.)

### 3. PyTorch downgrade (not yet needed)

Fallback plan: downgrade to `torch==2.11.0+rocm7.2` from `https://download.pytorch.org/whl/rocm7.2` — not tested yet since workaround #2 should suffice.

---

## Harmless noise (not the cause)

These appear in logs but training fails regardless:

- TorchCodec / FFmpeg version mismatch → falls back to llama-server GGUF embedder (Studio startup)
- transformers Qwen3 MoE docstring `[ERROR]` lines for undocumented `loss`/`logits` fields
- torchao Cutlass/MXFP8 `.so` load failures

---

## Suggested upstream fix

Similar to the existing Gemma3 RDNA guard in `loader.py`:

```python
# Gemma 3 on RDNA GPUs
if "gemma3" in model_types_all:
    if is_rdna():
        os.environ["UNSLOTH_COMPILE_DISABLE"] = "partial"
```

**Proposal for Qwen3_5 on ROCm/HIP:**

- Proactively set `UNSLOTH_COMPILE_DISABLE=partial` for `qwen3_5` on ROCm (especially RDNA / gfx1201), **or**
- Compile Qwen3_5 forwards with `fullgraph=False` on HIP, **or**
- Auto-detect `FailOnRecompileLimitHit` and retry without fullgraph (Studio-side)

This would save ROCm users from hitting a confusing step-0 failure on otherwise supported hardware.

---

## Related issues

- #3205 — `[Bug] Error Following GPT Finetuning Tutorial` — same `FailOnRecompileLimitHit` / `fullgraph=True` (closed)
- #3385 — ROCm Gemma3 NaN losses — partial compile workaround on RDNA (closed via #4109)

---

## Happy to help

I can provide full Studio server logs, a minimal repro script outside Studio, or test a PR on my RX 9070 XT setup. Thanks for all the ROCm work — Unsloth Studio is otherwise running great on this machine! 🦥

---

Schöne Grüße von der **mecklenburgischen Ostseeküste** *(kind regards from the Mecklenburg Baltic Sea coast)* — Euer **Momix** 🌊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Qwen3_5 SFT on ROCm (gfx1201) fails at step 0: FailOnRecompileLimitHit (fullgraph=True) #6825

Summary

Environment

Reproduction steps

Error output

What we tried (workarounds)

1. Raise Dynamo recompile/cache limits ❌ (not sufficient alone)

2. Disable fullgraph / partial compile ✅ (expected fix)

3. PyTorch downgrade (not yet needed)

Harmless noise (not the cause)

Suggested upstream fix

Related issues

Happy to help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Version
GPU	2× AMD Radeon RX 9070 XT (`gfx1201`), ~16 GB VRAM each
ROCm HIP	7.2.53211
PyTorch	2.12.1+rocm7.2
Triton	3.7.1
Unsloth	2026.6.9
transformers	5.12.1
Python	3.13
Platform	Linux (openSUSE-style kernel `7.2.0-rc1`)
Interface	Unsloth Studio (local, port 8888)

Uh oh!

Uh oh!

[Bug] Qwen3_5 SFT on ROCm (gfx1201) fails at step 0: FailOnRecompileLimitHit (fullgraph=True) #6825

Description

Summary

Environment

Reproduction steps

Error output

What we tried (workarounds)

1. Raise Dynamo recompile/cache limits ❌ (not sufficient alone)

2. Disable fullgraph / partial compile ✅ (expected fix)

3. PyTorch downgrade (not yet needed)

Harmless noise (not the cause)

Suggested upstream fix

Related issues

Happy to help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions