OOM when fine-tuning Mistral Small 3.2 with LoRA on 140+ GB VRAM #3227

mags0ft · 2025-10-22T14:26:55Z

mags0ft
Oct 22, 2025

Hello to y'all,

I am currently trying to fine-tune Mistral Small 3.2 Instruct with LoRA (r = 32 and alpha = 32) on a small 3k rows dataset @ BF16. However, I frequently get OOM errors when increasing the context length too much, even when using multi-GPU (2x H200 SXM with 280 GB VRAM total). The errors come right after the first evaluation run has finished and training starts. I have roughly 180M trainable parameters...

Am I missing something obvious?

Here is my config:

############### SETUP ###############

# deepspeed: deepspeed_configs/zero1.json

############### MODEL ###############

base_model: unsloth/Mistral-Small-3.2-24B-Instruct-2506

tokenizer_use_mistral_common: true

load_in_4bit: false
load_in_8bit: false

adapter: lora
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
lora_r: 32
lora_alpha: 32
lora_dropout: 0.05

# processor: AutoProcessor
# skip_prepare_dataset: true
# remove_unused_columns: false
# sample_packing: false

sequence_len: 128000
pad_to_sequence_len: false

############### DATASET ###############

datasets:
  - path: ...
    split: train
    type: chat_template

    field_messages: messages
    field_tools: tools

dataset_prepared_path: last_run_prepared
val_set_size: 0.1

############### TRAINING ###############

num_epochs: 1

gradient_accumulation_steps: 1
micro_batch_size: 2
optimizer: adamw_bnb_8bit

lr_scheduler: cosine
learning_rate: 0.00005

bf16: true
fp16:
tf32: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 10
saves_per_epoch: 2
weight_decay: 0.0

use_tensorboard: true

############### FINALIZE ###############

output_dir: ./outputs/lora-out

hub_strategy: end
hub_model_id: ...
save_safetensors: true

gpu_memory_limit: 140GiB

I'd greatly appreciate any help! Thanks in advance. :)

NanoCode012 · 2025-10-22T14:57:45Z

NanoCode012
Oct 22, 2025
Maintainer

This is very likely the cause of your issue:

sequence_len: 128000

Is this how long your sequences are? If so, could you look into Sequence Parallelism? https://docs.axolotl.ai/docs/sequence_parallelism.html

Side note: were you also the one who asked in Discord?

0 replies

mags0ft · 2025-10-22T15:07:41Z

mags0ft
Oct 22, 2025
Author

Hey there, thanks for replying! I indeed need to use that sequence length in order to prevent dropping sequences which are too long; however, I also tried reducing the sequence length to 32k, but I am still facing crashes at around 40% of the training run.

The spikes in VRAM usage appear to be happening quite randomly in the middle of training - do you maybe know why that might be happening right in the middle?

Regarding your question, no, that wasn't me...

Edit: I am also now loading the model in 8 bit only, but it still happens.

0 replies

Tahirc1 · 2025-10-24T07:48:19Z

Tahirc1
Oct 24, 2025

I’m trying to fine-tune the model unsloth/Mistral-Small-3.2-24B-Instruct-2506 for tool calling, but I haven’t been able to find any documentation or guides explaining how to do this.

This will be my first time fine-tuning a model, so I’d really appreciate any guidance, resources, or examples to help me get started.

3 replies

mags0ft Oct 24, 2025
Author

It's the horror. Rare model architecture and struggles with the vision adapter. 2/10 cannot recommend.

Mistral has extra stuff for everything. Start with something small and simple like Llama 3.2 or stuff like that.

Tahirc1 Oct 24, 2025

I’m currently using this model for a vision task, and it performs well. However, when I use it as an AI agent, the results are bad. I’d like to fine-tune it so I can have a single model that handles both MCP + chatbot functions and vision tasks effectively.

currently i use gpt-oss:20B for MCP+chatbot

mags0ft Oct 24, 2025
Author

Do you have the hardware available for Fine-Tuning? Again, as mentioned in my post, training in full context length requires at least 2x H200 SXM GPUs. Which can quickly become very expensive to rent.

Mistral Small should be good with agentic tasks, have you made sure to set temperature = 0.15?

Uh oh!

OOM when fine-tuning Mistral Small 3.2 with LoRA on 140+ GB VRAM #3227

Uh oh!

mags0ft Oct 22, 2025

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

NanoCode012 Oct 22, 2025 Maintainer

Uh oh!

Uh oh!

mags0ft Oct 22, 2025 Author

Uh oh!

Tahirc1 Oct 24, 2025

Uh oh!

mags0ft Oct 24, 2025 Author

Uh oh!

Uh oh!

Tahirc1 Oct 24, 2025

Uh oh!

Uh oh!

mags0ft Oct 24, 2025 Author

mags0ft
Oct 22, 2025

Replies: 3 comments 3 replies

NanoCode012
Oct 22, 2025
Maintainer

mags0ft
Oct 22, 2025
Author

Tahirc1
Oct 24, 2025

mags0ft Oct 24, 2025
Author

mags0ft Oct 24, 2025
Author