Bug in GPU VRAM approximation when loading a 16bit model with load_in_4bit=True

Bug in GPU VRAM approximation when loading a 16bit model with load_in_4bit=True

How to reproduce:
```python
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "XGenerationLab/XiYanSQL-QwenCoder-7B-2504",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.35, # Reduce if out of memory
)
```


in approximate_vllm_memory_usage, the load_in_4bit is False, causing wrong 'max_num_batched_tokens', 'approx_max_num_seqs' estimations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug in GPU VRAM approximation when loading a 16bit model with load_in_4bit=True #192

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug in GPU VRAM approximation when loading a 16bit model with load_in_4bit=True #192

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions