Skip to content

FastLanguageModel load_in_4bit not working, when used with fast_inference=True #2008

@shao-shuai

Description

@shao-shuai

I was trying the script for training reasoning model with Qwen2.5-1.5B-Instruct. I can run the script but the VRAM usage is pretty wired. Based on the blog I only need 7GB VRAM for training, but the actual VRAM usage is 16GB.

Loading model with AutoModelForCausalLM looks normal, VRAM usage is only 3.1GB

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer
import os

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    load_in_4bit=True,
    cache_dir="D:\Data Scientist\.transformers_cache",
    device_map="cuda"
)

Image

But when I load the model with FastLanguageModel, the VRAM spiked to 16GB

from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

Image

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-13 09:45:36 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 56.82%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 24.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 256.
Unsloth: vLLM's KV Cache can use up to 12.37 GB. Also swap space = 4 GB.
INFO 03-13 09:45:44 config.py:549] This model supports multiple tasks: {'embed', 'classify', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection', 'model.layers.0.self_attn', 'model.layers.1.mlp', 'model.layers.2.mlp', 'model.layers.3.mlp', 'model.layers.7.mlp', 'model.layers.24.mlp', 'model.layers.26.mlp', 'model.layers.15.self_attn'], 'llm_int8_threshold': 6.0}
INFO 03-13 09:45:44 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 03-13 09:45:45 interface.py:304] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 03-13 09:45:45 cuda.py:229] Using Flash Attention backend.
INFO 03-13 09:45:47 model_runner.py:1110] Starting to load model unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit...
[W313 09:45:46.584979306 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 03-13 09:45:47 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 03-13 09:45:47 weight_utils.py:254] Using model weights format ['*.safetensors']
Loadingsafetensorscheckpointshards: 100%Completed|1/1 [00:03<00:00,  3.87s/it]
Loadingsafetensorscheckpointshards: 100%Completed|1/1 [00:01<00:00,  1.20s/it]
INFO 03-13 09:45:53 model_runner.py:1115] Loading model weights took 1.4331 GB
INFO 03-13 09:45:53 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-13 09:45:56 worker.py:267] Memory profiling takes 2.27 seconds
INFO 03-13 09:45:56 worker.py:267] the current vLLM instance can use total_gpu_memory (24.00GiB) x gpu_memory_utilization (0.57) = 13.64GiB
INFO 03-13 09:45:56 worker.py:267] model weights take 1.43GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 10.74GiB.
INFO 03-13 09:45:56 executor_base.py:111] # cuda blocks: 25142, # CPU blocks: 9362
INFO 03-13 09:45:56 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 392.84x
INFO 03-13 09:45:56 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:23<00:00,  1.47it/s]
INFO 03-13 09:46:20 model_runner.py:1562] Graph capturing finished in 24 secs, took 0.60 GiB
INFO 03-13 09:46:20 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 27.23 seconds

I am runing on win 10 WSL by the way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions