Skip to content

[Bug] Qwen3_5 SFT on ROCm (gfx1201) fails at step 0: FailOnRecompileLimitHit (fullgraph=True) #6825

Description

@Momix-77

Note: This report was drafted with the help of AI (assisted structuring, log analysis, and wording). All technical details, versions, reproduction steps, and workarounds described below were verified on my local machine — please treat anything that looks off as something I'm happy to clarify.

Hi Unsloth team 👋

I'm running Unsloth Studio on an AMD ROCm workstation (dual RX 9070 XT, RDNA4 / gfx1201) and ran into a consistent training failure when fine-tuning Qwen3_5 models. Studio itself starts and loads models fine; the crash happens during the first SFT training step.

Below is a structured bug report with environment details, reproduction steps, logs, workarounds we tried, and a suggested upstream fix — hopefully useful for other ROCm users and for the maintainers.


Summary

SFT training of a Qwen3_5 model fails at step 0 during the first backward pass (~2 minutes after "Starting training..."):

torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True

The failure appears to originate in the compiled Qwen3_5RMSNorm_forward path inside unsloth_compiled_cache/unsloth_compiled_module_qwen3_5.py, triggered through gradient checkpointing / attention forwards.

This looks related to #3205 (same Dynamo fullgraph=True failure on GPT-OSS) and #3385 (ROCm Gemma3 workaround via partial compile disable) — but I couldn't find an open issue specifically for Qwen3_5 on ROCm/RDNA4.

Both 16-bit LoRA and 4-bit QLoRA reproduce the failure identically.


Environment

Component Version
GPU 2× AMD Radeon RX 9070 XT (gfx1201), ~16 GB VRAM each
ROCm HIP 7.2.53211
PyTorch 2.12.1+rocm7.2
Triton 3.7.1
Unsloth 2026.6.9
transformers 5.12.1
Python 3.13
Platform Linux (openSUSE-style kernel 7.2.0-rc1)
Interface Unsloth Studio (local, port 8888)

rocminfo reports both GPUs as gfx1201 (RDNA4).


Reproduction steps

  1. Start Unsloth Studio locally
  2. Model: deepreinforce-ai/Ornith-1.0-9B (Qwen3_5 architecture)
  3. Dataset: NousResearch/hermes-function-calling-v1 (ShareGPT format, auto-detected)
  4. Training: SFT / LoRA with default-ish Studio settings
    • max_seq_length: 2048
    • Multi-GPU: 2 visible GPUs → device_map=balanced
    • Tried both 16-bit LoRA and 4-bit QLoRA
  5. Click Start Training
  6. Training reaches 0/N steps, runs ~2 minutes, then crashes

Reproduced multiple times across separate Studio restarts on 2026-07-02.


Error output

Studio log shows:

Training error: Hard failure due to fullgraph=True

Full Dynamo exception (from earlier runs with traceback enabled):

torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True

  ...
  Qwen3_5Attention_forward
    → Qwen3_5RMSNorm_forward
      → torch._dynamo.exc.FailOnRecompileLimitHit

The compiled cache path involved:

unsloth_compiled_cache/unsloth_compiled_module_qwen3_5.py

What we tried (workarounds)

We investigated in this order:

1. Raise Dynamo recompile/cache limits ❌ (not sufficient alone)

torch._dynamo.config.recompile_limit = 2048
torch._dynamo.config.cache_size_limit = 2048
torch._dynamo.config.accumulated_cache_size_limit = 2048

Applied early in the training subprocess (before first torch compile). Still fails with the same fullgraph=True hard failure at step 0.

2. Disable fullgraph / partial compile ✅ (expected fix)

Setting on ROCm:

export UNSLOTH_COMPILE_DISABLE=partial
export UNSLOTH_ROCM_NO_FULLGRAPH=1

…and forcing fullgraph=False in unsloth_compile_transformers() — this mirrors the existing Gemma3 RDNA workaround in loader.py (UNSLOTH_COMPILE_DISABLE=partial for is_rdna()).

We implemented an auto-retry in Studio: on FailOnRecompileLimitHit, reload model with partial compile and retry training once. (Happy to share a patch / PR if useful.)

3. PyTorch downgrade (not yet needed)

Fallback plan: downgrade to torch==2.11.0+rocm7.2 from https://download.pytorch.org/whl/rocm7.2 — not tested yet since workaround #2 should suffice.


Harmless noise (not the cause)

These appear in logs but training fails regardless:

  • TorchCodec / FFmpeg version mismatch → falls back to llama-server GGUF embedder (Studio startup)
  • transformers Qwen3 MoE docstring [ERROR] lines for undocumented loss/logits fields
  • torchao Cutlass/MXFP8 .so load failures

Suggested upstream fix

Similar to the existing Gemma3 RDNA guard in loader.py:

# Gemma 3 on RDNA GPUs
if "gemma3" in model_types_all:
    if is_rdna():
        os.environ["UNSLOTH_COMPILE_DISABLE"] = "partial"

Proposal for Qwen3_5 on ROCm/HIP:

  • Proactively set UNSLOTH_COMPILE_DISABLE=partial for qwen3_5 on ROCm (especially RDNA / gfx1201), or
  • Compile Qwen3_5 forwards with fullgraph=False on HIP, or
  • Auto-detect FailOnRecompileLimitHit and retry without fullgraph (Studio-side)

This would save ROCm users from hitting a confusing step-0 failure on otherwise supported hardware.


Related issues


Happy to help

I can provide full Studio server logs, a minimal repro script outside Studio, or test a PR on my RX 9070 XT setup. Thanks for all the ROCm work — Unsloth Studio is otherwise running great on this machine! 🦥


Schöne Grüße von der mecklenburgischen Ostseeküste (kind regards from the Mecklenburg Baltic Sea coast) — Euer Momix 🌊

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions