-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Add INT4 compressed-tensors + LoRA support (including MoE) #28791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add INT4 compressed-tensors + LoRA support (including MoE) #28791
Conversation
|
Documentation preview: https://vllm--28791.org.readthedocs.build/en/28791/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for INT4 compressed-tensors with LoRA, including for MoE models. The changes involve updating the LoRA dummy weight creation logic to robustly handle dimensions for compressed tensors, and initializing necessary layer attributes for MoE models. The changes look good and the added tests and examples are helpful.
I've found a couple of potential issues in vllm/lora/models.py related to how layer dimensions are determined, which could lead to incorrect behavior for certain model architectures. My review comments provide more details and suggestions for fixes.
This commit enables vLLM to support INT4 quantized models using compressed-tensors with LoRA adapters. ## Problem LoRA injection previously assumed tensors existed directly, but compressed-tensors quantized models only expose packed buffers. Direct access to `weight.shape` would fail or return incorrect dimensions due to bit-packing. ## Solution Implemented a multi-tiered fallback strategy for obtaining correct tensor dimensions: 1. Layer-specific attributes (org_vocab_size, embedding_dim) 2. Generic layer attributes (input_size, output_size) 3. weight_shape parameter (stores unpacked dims for compressed-tensors) 4. Fallback to tensor shape ## Changes - vllm/lora/models.py: Fixed dummy LoRA creation to use layer attributes and weight_shape instead of direct shape access - tests/lora/test_quant_model.py: Added INT4 compressed-tensors test case with neuralmagic/TinyLlama-1.1B-Chat-v1.0-INT4 - examples/offline_inference/lora_with_quantization_inference.py: Added compressed-tensors example ## Testing - Added integration test with compressed-tensors INT4 model - Follows existing patterns from AWQ/GPTQ/BitsAndBytes + LoRA support - All modified files pass Python syntax validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>
Fixes INT4 compressed-tensors + LoRA for MoE models (e.g., Kimi K2 Thinking). ## Problem CompressedTensorsWNA16MoEMethod and CompressedTensorsWNA16MarlinMoEMethod did not set required layer attributes (hidden_size, intermediate_size_per_partition, local_num_experts) that the FusedMoEWithLoRA wrapper expects to access. This caused LoRA to fail with MoE models using compressed-tensors quantization, even though the weights were accessible. ## Solution Added layer attribute initialization in create_weights() methods for both: - CompressedTensorsWNA16MoEMethod - CompressedTensorsWNA16MarlinMoEMethod These attributes are set before weight creation, matching the pattern used by other MoE methods (e.g., CompressedTensorsW8A8Fp8MoEMethod). ## Impact - Enables LoRA with Kimi K2 Thinking (INT4 MoE + compressed-tensors) - Follows existing patterns from FP8 MoE + LoRA support - No changes to weight layout or kernel behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>
Fixed incorrect fallback logic for embedding layers where dimensions were reversed. ## Problem For embedding layers with shape [vocab_size, embedding_dim]: - input_dim should be vocab_size (shape[0]) - output_dim should be embedding_dim (shape[1]) - embeddings_tensor_dim should be embedding_dim (shape[1]) Previous code had: - input_dim fallback: shape[1] ❌ (was getting embedding_dim instead of vocab_size) - output_dim fallback: shape[0] ❌ (was getting vocab_size instead of embedding_dim) - embeddings_tensor_dim: Used input_size instead of output_size ❌ ## Fix Corrected all fallback paths to use proper dimensions for embedding layers: - input_dim: shape[0] (vocab_size) - output_dim: shape[1] (embedding_dim) - embeddings_tensor_dim: shape[1] (embedding_dim) Also fixed elif chain to check output_size instead of input_size for embeddings_tensor_dim. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>
4a746ad to
8fd7c16
Compare
Extends LoRA support to NVFP4 (W4A4) and W4A8 MoE quantization methods. ## Problem CompressedTensorsW4A4MoeMethod and CompressedTensorsW4A8Int8MoEMethod did not set required layer attributes for LoRA compatibility. ## Solution Added layer attribute initialization in create_weights() for both: - CompressedTensorsW4A4MoeMethod (NVFP4) - CompressedTensorsW4A8Int8MoEMethod ## Impact - Enables LoRA with NVFP4-quantized MoE models - Enables LoRA with W4A8 INT8 MoE models (CPU/ARM) - Completes LoRA support for all compressed-tensors MoE variants Signed-off-by: sheikheddy <[email protected]>
Signed-off-by: Sheikh Abdur Raheem Ali <[email protected]>
|
@jeejeelee any thoughts? (if it's slop and I should start over, that would be good to know) |
|
Thank you for contribution, will look at this PR ASAP |
|
Note that this is my first attempt to contribute to vllm and the content of this PR is mostly AI generated. I'm happy to answer questions about what I'm trying to achieve or elaborate on the goals here if that helps you provide pointers on how to proceed. |
| if hasattr(module.base_layer, "embedding_dim") | ||
| else module.base_layer.weight.shape[1] | ||
| ) | ||
| # Try to get dimensions from layer attributes first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to detect if we're doing Lora and write that in one if branch and normal logic in the other.
easier to read
if A
input_dim = X1
output_dim = Y1
embedding_dim = Z1
elif B
input_dim = X2
output_dim = Y2
embedding_dim = Z2
else C
input_dim = X3
output_dim = Y3
embedding_dim = Z3
than
if A
input_dim = X1
elif B
input_dim = X2
else C
input_dim = X3
if A
output_dim = Y1
elif B
output_dim = Y2
else C
output_dim = Y3
...etc
| input_dim = module.base_layer.weight_shape[0].item() | ||
| else: | ||
| # For embeddings: weight.shape = [vocab_size, embedding_dim] | ||
| input_dim = module.weight.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be
module.base_layer.weight.shape[1]
?
worrying that tests passed with an issue like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Not sure the tests cover this branch.
| output_dim = module.base_layer.weight_shape[1].item() | ||
| else: | ||
| # For embeddings: weight.shape = [vocab_size, embedding_dim] | ||
| output_dim = module.weight.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also backward, should be shape[0]
| **extra_weight_attrs, | ||
| ): | ||
| # Shapes per local rank (TP/EP): | ||
| # Set layer attributes needed for LoRA compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this only needed for W4A16 as of now?
| params_dtype: torch.dtype, | ||
| **extra_weight_attrs, | ||
| ): | ||
| # Set layer attributes needed for LoRA compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this only needed for W4A16?
HDCharles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- link the llm-compressor PR
- fix bugs
- i'd like clarity on why the tests you ran didn't catch those bugs, that is incredibly worrying, and run new tests that could catch such an issue.
- this is an unnecessarily verbose PR description for what boils down to storing 3 additional attributes and a small addition to how lora shapes are calculated. If you're going to use an llm you need to be the one to go through the slop first and clean it up.
Look at the recent landed PRs with a similar scope from a similarly experienced contributor and try to match that if you're unsure how much detail to add/not add.
|
I'll work on these changes today, and make sure to keep your guidelines in mind for future contributions. Thanks :) |
|
sounds good! appreciate your work |
|
This pull request has merge conflicts that must be resolved before it can be |
|
We have merged #28971, CT MoE model + LoRA should now be properly supported. If there are any issues, please provide feedback. Thank you |
|
Thanks, will check it out! |
Summary
This PR enables vLLM to support INT4 quantized models using compressed-tensors with LoRA adapters, for both standard models and MoE models (e.g., Kimi K2 Thinking).
Problems Solved
1. Standard Models: Packed Weight Dimension Access
LoRA dummy creation code directly accessed
module.base_layer.weight.shape, which fails for compressed-tensors because:weight_packed(int32 packed buffers)weight_packedhas shape[output_size, input_size // pack_factor]due to bit-packing2. MoE Models: Missing Layer Attributes
CompressedTensorsWNA16MoEMethodandCompressedTensorsWNA16MarlinMoEMethoddidn't set required layer attributes thatFusedMoEWithLoRAexpects:hidden_sizeintermediate_size_per_partitionlocal_num_expertsThis prevented LoRA from working with INT4 MoE models like Kimi K2 Thinking.
Solution
Fix 1: Robust Dimension Detection (
vllm/lora/models.py)Implemented multi-tiered fallback strategy:
org_vocab_size,embedding_dim)input_size,output_size)weight_shapeparameter (stores unpacked dims for compressed-tensors)Fix 2: MoE Layer Attribute Initialization (
compressed_tensors_moe.py)Added layer attribute initialization in
create_weights()for:CompressedTensorsWNA16MoEMethod(line 1741-1744)CompressedTensorsWNA16MarlinMoEMethod(line 1370-1373)This matches the pattern used by other MoE methods (e.g.,
CompressedTensorsW8A8Fp8MoEMethod).Technical Details
How LoRA Works with Quantization
LoRA operates on activations, not weights:
This is why the integration works seamlessly - LoRA doesn't need to touch packed weights directly.
Compressed-Tensors Weight Structure
For INT4 quantization:
weight_packed: Packed int32 tensor, shape[output_size, input_size // pack_factor]weight_scale: FP16/BF16 scales for dequantizationweight_zero_point: Optional zero points (if asymmetric)weight_shape: 2D int64 tensor storing original[output_size, input_size]For MoE:
[num_experts, ...]Changes Made
1. Fixed Dummy LoRA Creation (
vllm/lora/models.py)weight.shapeaccess with robust fallback chain2. Added MoE Layer Attributes (
compressed_tensors_moe.py)CompressedTensorsWNA16MoEMethod.create_weights(): Added attributes (line 1741-1744)CompressedTensorsWNA16MarlinMoEMethod.create_weights(): Added attributes (line 1370-1373)3. Added Integration Tests (
tests/lora/test_quant_model.py)neuralmagic/TinyLlama-1.1B-Chat-v1.0-INT4to test model list4. Added Example Code (
examples/offline_inference/lora_with_quantization_inference.py)Compatibility
This fix maintains backward compatibility with:
Testing
Run Integration Tests
Run Example
Test with Kimi K2 Thinking
Performance Characteristics
References
🤖 Generated with Claude Code