-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Closed
Description
I was trying the script for training reasoning model with Qwen2.5-1.5B-Instruct. I can run the script but the VRAM usage is pretty wired. Based on the blog I only need 7GB VRAM for training, but the actual VRAM usage is 16GB.
Loading model with AutoModelForCausalLM looks normal, VRAM usage is only 3.1GB
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer
import os
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct",
load_in_4bit=True,
cache_dir="D:\Data Scientist\.transformers_cache",
device_map="cuda"
)But when I load the model with FastLanguageModel, the VRAM spiked to 16GB
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Qwen/Qwen2.5-1.5B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.6, # Reduce if out of memory
)🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-13 09:45:36 __init__.py:207] Automatically detected platform cuda.
==((====))== Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
\\ /| NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.999 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 56.82%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 24.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 256.
Unsloth: vLLM's KV Cache can use up to 12.37 GB. Also swap space = 4 GB.
INFO 03-13 09:45:44 config.py:549] This model supports multiple tasks: {'embed', 'classify', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection', 'model.layers.0.self_attn', 'model.layers.1.mlp', 'model.layers.2.mlp', 'model.layers.3.mlp', 'model.layers.7.mlp', 'model.layers.24.mlp', 'model.layers.26.mlp', 'model.layers.15.self_attn'], 'llm_int8_threshold': 6.0}
INFO 03-13 09:45:44 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 03-13 09:45:45 interface.py:304] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 03-13 09:45:45 cuda.py:229] Using Flash Attention backend.
INFO 03-13 09:45:47 model_runner.py:1110] Starting to load model unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit...
[W313 09:45:46.584979306 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 03-13 09:45:47 loader.py:1089] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 03-13 09:45:47 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.20s/it]
INFO 03-13 09:45:53 model_runner.py:1115] Loading model weights took 1.4331 GB
INFO 03-13 09:45:53 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-13 09:45:56 worker.py:267] Memory profiling takes 2.27 seconds
INFO 03-13 09:45:56 worker.py:267] the current vLLM instance can use total_gpu_memory (24.00GiB) x gpu_memory_utilization (0.57) = 13.64GiB
INFO 03-13 09:45:56 worker.py:267] model weights take 1.43GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 10.74GiB.
INFO 03-13 09:45:56 executor_base.py:111] # cuda blocks: 25142, # CPU blocks: 9362
INFO 03-13 09:45:56 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 392.84x
INFO 03-13 09:45:56 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:23<00:00, 1.47it/s]
INFO 03-13 09:46:20 model_runner.py:1562] Graph capturing finished in 24 secs, took 0.60 GiB
INFO 03-13 09:46:20 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 27.23 secondsI am runing on win 10 WSL by the way.
void-mckenzie
Metadata
Metadata
Assignees
Labels
No labels

