Skip to content

Conversation

@rolandtannous
Copy link
Collaborator

@rolandtannous rolandtannous commented Jul 12, 2025

Problem

The current implementation of fast_dequantize and fast_gemv kernels assumes that quantization statistics (absmax values) always need to be dequantized at inference time. However, recent versions of vLLM have introduced a _dequantize_dq optimization method that pre-processes double quantization during model loading rather than at inference time. This optimization trades memory for compute performance by dequantizing the scaling statistics ahead of time.

quant_state.nested becomes False
quant_state.state2 becomes None
quant_state.offset becomes None

As a consequence, and when loading a model using unsloth with fast_inference=True, this triggers the application of dequantize_dq . During training and as the GRPOTrainer is called on the the model, the existing unsloth fast_dequantize and fast_gemv kernels which were not initially written to handle this edge case, will always attempt to access state2.absmax, state2.code, etc., leading to

AttributeError: 'NoneType' object has no attribute 'absmax'

Solution

Modified both fast_dequantize and fast_gemv kernels across all device types (XPU, CUDA, fallback) to:

  1. Check for pre-dequantized state: Added logic to detect when double quantization has already been resolved
  • For object-based quant_state: Check hasattr(quant_state, 'nested') and quant_state.nested and state2 is not None
  • For list-based quant_state: Check state2 is not None
  1. Conditional statistics dequantization: Only perform cdequantize_blockwise_fp32 when needed
  • When has_nested_quant=True: Dequantize statistics using state2 parameters
  • When has_nested_quant=False: Use pre-dequantized absmax directly
  1. Consistent buffer handling: Ensure out_absmax buffer is properly populated in both cases for fast_dequantize
  2. Safe pointer management: Define ptr_out_absmax before conditional blocks to avoid scope issues

This maintains backward compatibility while supporting the performance optimization provided by dequantize_dq.

Solves

Reproducible code

from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    #model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    #load_in_4bit = True, # False for LoRA 16bit
    load_in_4bit=True,
    #use_gradient_checkpointing="unsloth",
    #load_in_8bit=True,
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

quant_state=getattr(model.model.layers[0].self_attn.q_proj.weight,"quant_state", None)
print(type(quant_state))
print(quant_state.nested)
print(type(quant_state.state2))

Tests

We tested end-to-end the following GRPO notebooks to ensure both training and inference work correctly.
After applying the fixes in both #2944 and this PR, all notebooks now complete successfully without errors

Notebook Training Inference
Advanced Llama-3.1-(3B)-GRPO-Lora
Advanced_Llama3_2_(3B)_GRPO_LoRA
Phi-14B-GRPO
MMistral_v0.3_(7B)-GRPO
qwen3_4b-GRPO

Additional notes:

After we resolved this issue , we faced another ValueError related to dataloader_num_workers . We issued a fix for that in PR 2944

@rolandtannous rolandtannous changed the title Support pre-dequantized quantization states in fast_dequantize kernel GRPO Fix - Support pre-dequantized quantization states in fast_dequantize kernel Jul 12, 2025
@rolandtannous rolandtannous changed the title GRPO Fix - Support pre-dequantized quantization states in fast_dequantize kernel GRPO Fix - Support vllm pre-dequantized quantization states in fast_dequantize kernel Jul 12, 2025
@danielhanchen danielhanchen merged commit 0eb61fb into unslothai:main Jul 14, 2025
danielhanchen added a commit that referenced this pull request Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants