forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 0
Add INT4 compressed-tensors + LoRA support #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit enables vLLM to support INT4 quantized models using compressed-tensors with LoRA adapters. ## Problem LoRA injection previously assumed tensors existed directly, but compressed-tensors quantized models only expose packed buffers. Direct access to `weight.shape` would fail or return incorrect dimensions due to bit-packing. ## Solution Implemented a multi-tiered fallback strategy for obtaining correct tensor dimensions: 1. Layer-specific attributes (org_vocab_size, embedding_dim) 2. Generic layer attributes (input_size, output_size) 3. weight_shape parameter (stores unpacked dims for compressed-tensors) 4. Fallback to tensor shape ## Changes - vllm/lora/models.py: Fixed dummy LoRA creation to use layer attributes and weight_shape instead of direct shape access - tests/lora/test_quant_model.py: Added INT4 compressed-tensors test case with neuralmagic/TinyLlama-1.1B-Chat-v1.0-INT4 - examples/offline_inference/lora_with_quantization_inference.py: Added compressed-tensors example ## Testing - Added integration test with compressed-tensors INT4 model - Follows existing patterns from AWQ/GPTQ/BitsAndBytes + LoRA support - All modified files pass Python syntax validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>
Fixes INT4 compressed-tensors + LoRA for MoE models (e.g., Kimi K2 Thinking). ## Problem CompressedTensorsWNA16MoEMethod and CompressedTensorsWNA16MarlinMoEMethod did not set required layer attributes (hidden_size, intermediate_size_per_partition, local_num_experts) that the FusedMoEWithLoRA wrapper expects to access. This caused LoRA to fail with MoE models using compressed-tensors quantization, even though the weights were accessible. ## Solution Added layer attribute initialization in create_weights() methods for both: - CompressedTensorsWNA16MoEMethod - CompressedTensorsWNA16MarlinMoEMethod These attributes are set before weight creation, matching the pattern used by other MoE methods (e.g., CompressedTensorsW8A8Fp8MoEMethod). ## Impact - Enables LoRA with Kimi K2 Thinking (INT4 MoE + compressed-tensors) - Follows existing patterns from FP8 MoE + LoRA support - No changes to weight layout or kernel behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>
Fixed incorrect fallback logic for embedding layers where dimensions were reversed. ## Problem For embedding layers with shape [vocab_size, embedding_dim]: - input_dim should be vocab_size (shape[0]) - output_dim should be embedding_dim (shape[1]) - embeddings_tensor_dim should be embedding_dim (shape[1]) Previous code had: - input_dim fallback: shape[1] ❌ (was getting embedding_dim instead of vocab_size) - output_dim fallback: shape[0] ❌ (was getting vocab_size instead of embedding_dim) - embeddings_tensor_dim: Used input_size instead of output_size ❌ ## Fix Corrected all fallback paths to use proper dimensions for embedding layers: - input_dim: shape[0] (vocab_size) - output_dim: shape[1] (embedding_dim) - embeddings_tensor_dim: shape[1] (embedding_dim) Also fixed elif chain to check output_size instead of input_size for embeddings_tensor_dim. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>
4a746ad to
8fd7c16
Compare
Extends LoRA support to NVFP4 (W4A4) and W4A8 MoE quantization methods. ## Problem CompressedTensorsW4A4MoeMethod and CompressedTensorsW4A8Int8MoEMethod did not set required layer attributes for LoRA compatibility. ## Solution Added layer attribute initialization in create_weights() for both: - CompressedTensorsW4A4MoeMethod (NVFP4) - CompressedTensorsW4A8Int8MoEMethod ## Impact - Enables LoRA with NVFP4-quantized MoE models - Enables LoRA with W4A8 INT8 MoE models (CPU/ARM) - Completes LoRA support for all compressed-tensors MoE variants Signed-off-by: sheikheddy <[email protected]>
Signed-off-by: Bram Wasti <[email protected]> Signed-off-by: Bram Wasti <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…project#28194) Signed-off-by: wang.yuqi <[email protected]> Signed-off-by: wang.yuqi <[email protected]>
…roject#23691) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>
…ect#28715) Co-authored-by: Dezhan Tu <[email protected]>
…#28679) Signed-off-by: Scott Zhang <[email protected]> Co-authored-by: Scott Zhang <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Didier Durand <[email protected]>
Signed-off-by: Andy Xie <[email protected]>
…llm-project#28769) Signed-off-by: Lukas Geiger <[email protected]> Co-authored-by: Isotr0py <[email protected]>
…oject#28787) Signed-off-by: Nick Hill <[email protected]>
…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
…25763 (vllm-project#28670) Signed-off-by: Xiake Sun <[email protected]>
Signed-off-by: Sheikh Abdur Raheem Ali <[email protected]>
…eaming mode (vllm-project#28543) Signed-off-by: Jscaldwell55 <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR enables vLLM to support INT4 quantized models using compressed-tensors with LoRA adapters. Previously, LoRA injection assumed that tensors existed directly, but quantized models only expose packed buffers.
Problem
The LoRA dummy creation code in
vllm/lora/models.pydirectly accessedmodule.base_layer.weight.shapeto determine tensor dimensions. For compressed-tensors quantized models:weight_packed(int32 packed buffers) instead of regular tensorsweight_packedhas shape[output_size, input_size // pack_factor]due to bit-packingSolution
Implemented a multi-tiered fallback strategy to get correct dimensions:
org_vocab_size,embedding_dim)input_size,output_size)weight_shapeparameter (stores unpacked dimensions for compressed-tensors)This approach works for all quantization methods (AWQ, GPTQ, BitsAndBytes, compressed-tensors) and all layer types.
Changes Made
1. Fixed Dummy LoRA Creation (
vllm/lora/models.py)weight.shapeaccess with robust fallback chain2. Added Integration Tests (
tests/lora/test_quant_model.py)neuralmagic/TinyLlama-1.1B-Chat-v1.0-INT4to test model list3. Added Example Code (
examples/offline_inference/lora_with_quantization_inference.py)Technical Details
How LoRA Works with Quantization
LoRA operates on activations, not weights:
This is why the integration works seamlessly - LoRA doesn't need to touch the packed weights directly.
Compatibility
The fix maintains backward compatibility with:
Testing
Run the Integration Test
Run the Example
Performance Characteristics
References
🤖 Generated with Claude Code