Bug in GPU VRAM approximation when loading a 16bit model with load_in_4bit=True
How to reproduce:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "XGenerationLab/XiYanSQL-QwenCoder-7B-2504",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.35, # Reduce if out of memory
)
in approximate_vllm_memory_usage, the load_in_4bit is False, causing wrong 'max_num_batched_tokens', 'approx_max_num_seqs' estimations.