-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ZX-ModelCloud <[email protected]>
|
cc @MekkCyber for quantization |
5492303 to
d839d2b
Compare
|
We have begun AutoAWQ deprecation as well.
|
|
Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components? |
|
@SunMarc @MekkCyber PR is now synced to Peft/Optimum pending Prs. Ready for code review for this portion. All tests passing with pending gpt-qmodel 5.4.4 release (later today). Notable changes:
|
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some minor comments !
| do_fuse (`bool`, *optional*, defaults to `False`): | ||
| Whether to fuse attention and mlp layers together for faster inference | ||
| Deprecated, Whether to fuse attention and mlp layers together for faster inference | ||
| fuse_max_seq_len (`int`, *optional*): | ||
| The Maximum sequence length to generate when using fusing. | ||
| Deprecated, The Maximum sequence length to generate when using fusing. | ||
| modules_to_fuse (`dict`, *optional*, default to `None`): | ||
| Overwrite the natively supported fusing scheme with the one specified by the users. | ||
| Deprecated, Overwrite the natively supported fusing scheme with the one specified by the users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it directly since those are not used
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
| quantization_config = AwqConfig(backend=AwqBackend.GEMM) | ||
| cls.quantized_model = AutoModelForCausalLM.from_pretrained( | ||
| cls.model_name, device_map=cls.device_map, quantization_config=quantization_config | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc This part of the awq test was modified since previous test assumes kernel loading and weight save is 1-to-1 compatible. It is not as exllama/marlin kernels mutates the weights on load so they are in-effect, not packable/re-savable. To make the test pass, which expect quantized model to load, and then save, and thenreload/inference again, we need to specify the gemm kernel which does not mutates weights.
|
@SunMarc Since last review:
The PR currently depends on GPT-QModel 5.4.4 which is not yet released as we are working to resolve asap an internal regression related to gptq packing code: ModelCloud/GPTQModel#2234 |
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, autoawq, gptq |
Remove autogptq clutter and autogptq related configs that are not worth adding backward compat.
GPTQModel has a slight project name change (pypi package and import name stays the same) to GPT-QModel with
-as we now have addedawq/AutoAWQ into our repo and will be making pr soon to address awq loading using GPT-QModel.GPTQConfighas the most important changes in this PR:The 3 removed properties are all related
kernelselection. These 3 are a hot potatoe mess and legacy from autogptq. GPT-QModel uses unifiedbackend(existing) property to select kernels. There were compat codes written toconvertthese 3 properties tobackendbehind the scenes in 2024 but no longer relevant for 2025.Note:
kernel.QUANT_TYPE(str). GPTQ-QModel will return best performing kernel for the relevant module and it may be different per module due to in/out features and other gptq/module properties in relation to device type + dtype + many factors.kernel.QUANT_TYPEif the test specifies a specific kernel viabackendselection.