GRPO Fix - Support vllm pre-dequantized quantization states in fast_dequantize kernel #2943
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The current implementation of
fast_dequantizeandfast_gemvkernels assumes that quantization statistics (absmax values) always need to be dequantized at inference time. However, recent versions of vLLM have introduced a _dequantize_dq optimization method that pre-processes double quantization during model loading rather than at inference time. This optimization trades memory for compute performance by dequantizing the scaling statistics ahead of time.quant_state.nestedbecomesFalsequant_state.state2becomesNonequant_state.offsetbecomesNoneAs a consequence, and when loading a model using unsloth with
fast_inference=True, this triggers the application ofdequantize_dq. During training and as the GRPOTrainer is called on the the model, the existing unslothfast_dequantizeandfast_gemvkernels which were not initially written to handle this edge case, will always attempt to accessstate2.absmax,state2.code, etc., leading toSolution
Modified both fast_dequantize and fast_gemv kernels across all device types (XPU, CUDA, fallback) to:
quant_state: Checkhasattr(quant_state, 'nested') and quant_state.nested and state2 is not Nonequant_state: Checkstate2 is not Noneout_absmaxbuffer is properly populated in both cases forfast_dequantizeptr_out_absmaxbefore conditional blocks to avoid scope issuesThis maintains backward compatibility while supporting the performance optimization provided by
dequantize_dq.Solves
Reproducible code
Tests
We tested end-to-end the following GRPO notebooks to ensure both training and inference work correctly.
After applying the fixes in both #2944 and this PR, all notebooks now complete successfully without errors
Additional notes:
After we resolved this issue , we faced another ValueError related to dataloader_num_workers . We issued a fix for that in PR 2944