Note: This report was drafted with the help of AI (assisted structuring, log analysis, and wording). All technical details, versions, reproduction steps, and workarounds described below were verified on my local machine — please treat anything that looks off as something I'm happy to clarify.
Hi Unsloth team 👋
I'm running Unsloth Studio on an AMD ROCm workstation (dual RX 9070 XT, RDNA4 / gfx1201) and ran into a consistent training failure when fine-tuning Qwen3_5 models. Studio itself starts and loads models fine; the crash happens during the first SFT training step.
Below is a structured bug report with environment details, reproduction steps, logs, workarounds we tried, and a suggested upstream fix — hopefully useful for other ROCm users and for the maintainers.
Summary
SFT training of a Qwen3_5 model fails at step 0 during the first backward pass (~2 minutes after "Starting training..."):
torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True
The failure appears to originate in the compiled Qwen3_5RMSNorm_forward path inside unsloth_compiled_cache/unsloth_compiled_module_qwen3_5.py, triggered through gradient checkpointing / attention forwards.
This looks related to #3205 (same Dynamo fullgraph=True failure on GPT-OSS) and #3385 (ROCm Gemma3 workaround via partial compile disable) — but I couldn't find an open issue specifically for Qwen3_5 on ROCm/RDNA4.
Both 16-bit LoRA and 4-bit QLoRA reproduce the failure identically.
Environment
| Component |
Version |
| GPU |
2× AMD Radeon RX 9070 XT (gfx1201), ~16 GB VRAM each |
| ROCm HIP |
7.2.53211 |
| PyTorch |
2.12.1+rocm7.2 |
| Triton |
3.7.1 |
| Unsloth |
2026.6.9 |
| transformers |
5.12.1 |
| Python |
3.13 |
| Platform |
Linux (openSUSE-style kernel 7.2.0-rc1) |
| Interface |
Unsloth Studio (local, port 8888) |
rocminfo reports both GPUs as gfx1201 (RDNA4).
Reproduction steps
- Start Unsloth Studio locally
- Model:
deepreinforce-ai/Ornith-1.0-9B (Qwen3_5 architecture)
- Dataset:
NousResearch/hermes-function-calling-v1 (ShareGPT format, auto-detected)
- Training: SFT / LoRA with default-ish Studio settings
max_seq_length: 2048
- Multi-GPU: 2 visible GPUs →
device_map=balanced
- Tried both 16-bit LoRA and 4-bit QLoRA
- Click Start Training
- Training reaches
0/N steps, runs ~2 minutes, then crashes
Reproduced multiple times across separate Studio restarts on 2026-07-02.
Error output
Studio log shows:
Training error: Hard failure due to fullgraph=True
Full Dynamo exception (from earlier runs with traceback enabled):
torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True
...
Qwen3_5Attention_forward
→ Qwen3_5RMSNorm_forward
→ torch._dynamo.exc.FailOnRecompileLimitHit
The compiled cache path involved:
unsloth_compiled_cache/unsloth_compiled_module_qwen3_5.py
What we tried (workarounds)
We investigated in this order:
1. Raise Dynamo recompile/cache limits ❌ (not sufficient alone)
torch._dynamo.config.recompile_limit = 2048
torch._dynamo.config.cache_size_limit = 2048
torch._dynamo.config.accumulated_cache_size_limit = 2048
Applied early in the training subprocess (before first torch compile). Still fails with the same fullgraph=True hard failure at step 0.
2. Disable fullgraph / partial compile ✅ (expected fix)
Setting on ROCm:
export UNSLOTH_COMPILE_DISABLE=partial
export UNSLOTH_ROCM_NO_FULLGRAPH=1
…and forcing fullgraph=False in unsloth_compile_transformers() — this mirrors the existing Gemma3 RDNA workaround in loader.py (UNSLOTH_COMPILE_DISABLE=partial for is_rdna()).
We implemented an auto-retry in Studio: on FailOnRecompileLimitHit, reload model with partial compile and retry training once. (Happy to share a patch / PR if useful.)
3. PyTorch downgrade (not yet needed)
Fallback plan: downgrade to torch==2.11.0+rocm7.2 from https://download.pytorch.org/whl/rocm7.2 — not tested yet since workaround #2 should suffice.
Harmless noise (not the cause)
These appear in logs but training fails regardless:
- TorchCodec / FFmpeg version mismatch → falls back to llama-server GGUF embedder (Studio startup)
- transformers Qwen3 MoE docstring
[ERROR] lines for undocumented loss/logits fields
- torchao Cutlass/MXFP8
.so load failures
Suggested upstream fix
Similar to the existing Gemma3 RDNA guard in loader.py:
# Gemma 3 on RDNA GPUs
if "gemma3" in model_types_all:
if is_rdna():
os.environ["UNSLOTH_COMPILE_DISABLE"] = "partial"
Proposal for Qwen3_5 on ROCm/HIP:
- Proactively set
UNSLOTH_COMPILE_DISABLE=partial for qwen3_5 on ROCm (especially RDNA / gfx1201), or
- Compile Qwen3_5 forwards with
fullgraph=False on HIP, or
- Auto-detect
FailOnRecompileLimitHit and retry without fullgraph (Studio-side)
This would save ROCm users from hitting a confusing step-0 failure on otherwise supported hardware.
Related issues
Happy to help
I can provide full Studio server logs, a minimal repro script outside Studio, or test a PR on my RX 9070 XT setup. Thanks for all the ROCm work — Unsloth Studio is otherwise running great on this machine! 🦥
Schöne Grüße von der mecklenburgischen Ostseeküste (kind regards from the Mecklenburg Baltic Sea coast) — Euer Momix 🌊
Hi Unsloth team 👋
I'm running Unsloth Studio on an AMD ROCm workstation (dual RX 9070 XT, RDNA4 /
gfx1201) and ran into a consistent training failure when fine-tuning Qwen3_5 models. Studio itself starts and loads models fine; the crash happens during the first SFT training step.Below is a structured bug report with environment details, reproduction steps, logs, workarounds we tried, and a suggested upstream fix — hopefully useful for other ROCm users and for the maintainers.
Summary
SFT training of a Qwen3_5 model fails at step 0 during the first backward pass (~2 minutes after "Starting training..."):
The failure appears to originate in the compiled
Qwen3_5RMSNorm_forwardpath insideunsloth_compiled_cache/unsloth_compiled_module_qwen3_5.py, triggered through gradient checkpointing / attention forwards.This looks related to #3205 (same Dynamo
fullgraph=Truefailure on GPT-OSS) and #3385 (ROCm Gemma3 workaround via partial compile disable) — but I couldn't find an open issue specifically for Qwen3_5 on ROCm/RDNA4.Both 16-bit LoRA and 4-bit QLoRA reproduce the failure identically.
Environment
gfx1201), ~16 GB VRAM each7.2.0-rc1)rocminforeports both GPUs asgfx1201(RDNA4).Reproduction steps
deepreinforce-ai/Ornith-1.0-9B(Qwen3_5 architecture)NousResearch/hermes-function-calling-v1(ShareGPT format, auto-detected)max_seq_length: 2048device_map=balanced0/Nsteps, runs ~2 minutes, then crashesReproduced multiple times across separate Studio restarts on 2026-07-02.
Error output
Studio log shows:
Full Dynamo exception (from earlier runs with traceback enabled):
The compiled cache path involved:
What we tried (workarounds)
We investigated in this order:
1. Raise Dynamo recompile/cache limits ❌ (not sufficient alone)
Applied early in the training subprocess (before first
torchcompile). Still fails with the samefullgraph=Truehard failure at step 0.2. Disable fullgraph / partial compile ✅ (expected fix)
Setting on ROCm:
…and forcing
fullgraph=Falseinunsloth_compile_transformers()— this mirrors the existing Gemma3 RDNA workaround inloader.py(UNSLOTH_COMPILE_DISABLE=partialforis_rdna()).We implemented an auto-retry in Studio: on
FailOnRecompileLimitHit, reload model with partial compile and retry training once. (Happy to share a patch / PR if useful.)3. PyTorch downgrade (not yet needed)
Fallback plan: downgrade to
torch==2.11.0+rocm7.2fromhttps://download.pytorch.org/whl/rocm7.2— not tested yet since workaround #2 should suffice.Harmless noise (not the cause)
These appear in logs but training fails regardless:
[ERROR]lines for undocumentedloss/logitsfields.soload failuresSuggested upstream fix
Similar to the existing Gemma3 RDNA guard in
loader.py:Proposal for Qwen3_5 on ROCm/HIP:
UNSLOTH_COMPILE_DISABLE=partialforqwen3_5on ROCm (especially RDNA / gfx1201), orfullgraph=Falseon HIP, orFailOnRecompileLimitHitand retry without fullgraph (Studio-side)This would save ROCm users from hitting a confusing step-0 failure on otherwise supported hardware.
Related issues
[Bug] Error Following GPT Finetuning Tutorial— sameFailOnRecompileLimitHit/fullgraph=True(closed)Happy to help
I can provide full Studio server logs, a minimal repro script outside Studio, or test a PR on my RX 9070 XT setup. Thanks for all the ROCm work — Unsloth Studio is otherwise running great on this machine! 🦥
Schöne Grüße von der mecklenburgischen Ostseeküste (kind regards from the Mecklenburg Baltic Sea coast) — Euer Momix 🌊