-
Notifications
You must be signed in to change notification settings - Fork 31.8k
SINQ quantization strategy integration (adapted for Transformers V5) #43112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SINQ quantization strategy integration (adapted for Transformers V5) #43112
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: sinq |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work, really appreciate to see that you adapted your work to fit the v5 ! Left a bunch of comments. I see that sinq api is quite similar to hqq and it would be great if we can improve a bit the api so that the integration is simpler and easier to maintain in general ! We had a lot of issues in our hqq integration and I prefer that we don't go through that again ;) The best would be to take inspiration from other quants methods like bnb, fp8, mxfp4 or torchao. Also, I would be great to reduce the complexity of this PR by simplifying as much as possible. You will see that the other integrations are much more easy to go through which makes the maintenance and usability in the future much better. Feel free to ask any questions, I'm really eager to merge this PR !
docs/source/en/quantization/sinq.md
Outdated
| | `--tiling_mode` | Weight matrix tiling strategy | str | 1D, 2D | 1D | | ||
| | `--group_size` | Weights per quantization group | int | 64, 128 | 64 | | ||
| | `--method` | Quantization method | str | sinq, asinq | sinq | | ||
| | `--dtype` | Data type of the original model | str | auto, float16, float32 | auto (bfloat16) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine to remove this as we should already have this information when calling from_pretrained
| | `--method` | Quantization method | str | sinq, asinq | sinq | | ||
| | `--dtype` | Data type of the original model | str | auto, float16, float32 | auto (bfloat16) | | ||
| | `--modules_to_not_convert` | List of the layers that are NOT quantize | List of str | [lm_head, ...] | [lm_head] | | ||
| | `--device` | Device on which the model is loaded | str | cpu, cuda:0, cuda:1, etc | cuda:0 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this one, the user can just pass device_map in from_pretrained.
docs/source/en/quantization/sinq.md
Outdated
| qmodel = AutoModelForCausalLM.from_pretrained( | ||
| model_name, | ||
| quantization_config=cfg, | ||
| torch_dtype=torch.bfloat16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch_dtype is deprecated in favor of dtype
| torch_dtype=torch.bfloat16 | |
| dtype=torch.bfloat16 |
docs/source/en/quantization/sinq.md
Outdated
| patch_hf_pretrained_io() | ||
| # Save sinq quantized model | ||
| model.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit") | ||
| model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True, private=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True, private=True) | |
| model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit") |
docs/source/en/quantization/sinq.md
Outdated
| # Push to the Hub | ||
| qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Push to the Hub | |
| qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True) | |
| # Push to the Hub | |
| qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit") |
| self.modules_to_not_convert = list(modules_to_not_convert) if modules_to_not_convert is not None else [] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's leave it as None
| def to_dict(self) -> dict[str, Any]: | ||
| """ | ||
| Convert configuration to a plain dict that will be stored in `config.json` | ||
| under `quantization_config`. | ||
| """ | ||
| base = super().to_dict() | ||
|
|
||
| base.update( | ||
| { | ||
| "nbits": self.nbits, | ||
| "group_size": self.group_size, | ||
| "tiling_mode": self.tiling_mode, | ||
| "method": self.method, | ||
| "per_channel": self.per_channel, | ||
| "symmetric": self.symmetric, | ||
| "use_nf4": self.use_nf4, | ||
| "modules_to_not_convert": list(self.modules_to_not_convert or []), | ||
| "dtype": self.dtype, | ||
| "device": self.device, | ||
| } | ||
| ) | ||
|
|
||
| base.update(self._extra_kwargs) | ||
|
|
||
| return base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_dict should work fine no ?
| @property | ||
| def is_trainable(self) -> bool: | ||
| return True | ||
|
|
||
| @property | ||
| def is_serializable(self) -> bool: | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not neeedd
| @classmethod | ||
| def from_dict(cls, config_dict: Dict[str, Any], return_unused_kwargs: bool = False, **kwargs): | ||
| """ | ||
| Recreate SinqConfig from a dictionary (e.g. loaded from config.json). | ||
| """ | ||
| cfg = dict(config_dict) | ||
|
|
||
| cfg.pop("quant_method", None) | ||
|
|
||
| cfg.update(kwargs) | ||
|
|
||
| if return_unused_kwargs: | ||
| return cls(**cfg), {} | ||
|
|
||
| return cls(**cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what we are doing here
| @@ -0,0 +1,375 @@ | |||
| tests/models/aimv2/test_modeling_aimv2.py | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
|
Hi @SunMarc, Thank you very much for taking the time to review my code and for all your helpful comments. I’ve just pushed an updated version that incorporates several of the changes you suggested. Thanks again for your support, and I really appreciate your help! |
|
Thanks for taking the time to go through the comments, answered to a couple of them ! |
|
Hi @SunMarc, I’ve just pushed an updated version of the code with the following changes:
Please let me know if these changes are moving the PR in the right direction for approval, and if there’s anything else I should address. Thanks! |
|
Thanks a lot for this ! Architecture-specific handling. I answered your comment regarding handling particular architectures with an all-in-one approach. For now, I’ve moved away from the all-in-one strategy. As mentioned in my answer, I believe the issue is model-specific rather than a general multimodal architecture problem. Based on this, I opted to explicitly specify the vision components in modules_to_not_convert for models that raise errors due to their init_weights implementation. Answered in the thread ! But this is a separate issue from this PR and it should be quite simple to fix.
I had a quick look at it looks much better. The flow also makes more sense now. Let me know if you want a new review but the simpler the code looks like, the better it is ! |
| if not self.pre_quantized and self.quantization_config.method == "asinq": | ||
| raise ValueError("A-SINQ is not supported in HuggingFace integration") | ||
|
|
||
| sinq_quant_dict = None if self.pre_quantized else self._build_sinq_quant_dict(self.quantization_config) | ||
| device_str = _normalize_cuda_device(getattr(self.quantization_config, "device", "auto")) | ||
|
|
||
| for full_name, module in list(model.named_modules()): | ||
| if not isinstance(module, nn.Linear): | ||
| continue | ||
| if not should_convert_module(full_name, self.modules_to_not_convert): | ||
| continue | ||
|
|
||
| parent_path, _, child_name = full_name.rpartition(".") | ||
| parent = model.get_submodule(parent_path) if parent_path else model | ||
|
|
||
| # Create empty SINQLinear (no weights yet) | ||
| sinq_layer = SINQLinear( | ||
| in_features=module.in_features if not self.pre_quantized else None, | ||
| out_features=module.out_features if not self.pre_quantized else None, | ||
| bias=(module.bias is not None) if not self.pre_quantized else False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put that in a replace function like the other methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it! I added a replace_with_sinq_linear() function in sinq.py
| if self.quantization_config.method == "asinq" and not self.pre_quantized: | ||
| raise ValueError( | ||
| "You are using `method='asinq'` in the quantization config. Right now the calibrated version of SINQ" | ||
| " is not supported in Hugging Face, please refer and use the official SINQ repository " | ||
| "`to quantized a model with this method. " | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put that in validate_environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to validate_environment()
| def _resolve_tokenizer_and_model_id(self, model, kwargs): | ||
| tok = kwargs.get("tokenizer", None) | ||
| model_id = None | ||
| cache_dir = kwargs.get("cache_dir", None) | ||
|
|
||
| try: | ||
| if hasattr(model, "config") and hasattr(model.config, "_name_or_path"): | ||
| model_id = model.config._name_or_path | ||
| if model_id is None: | ||
| model_id = kwargs.get("pretrained_model_name_or_path", None) | ||
| if model_id is None and "config" in kwargs and hasattr(kwargs["config"], "_name_or_path"): | ||
| model_id = getattr(kwargs["config"], "_name_or_path", None) | ||
|
|
||
| logger.info(f"[SinqHfQuantizer] Detected model_id = {model_id}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I removed this function since it is no more used in the code.
| tests/models/albert/test_modeling_albert.py | ||
| tests/models/align/test_modeling_align.py | ||
| tests/models/altclip/test_modeling_altclip.py | ||
| tests/models/apertus/test_modeling_apertus.py | ||
| tests/models/arcee/test_modeling_arcee.py | ||
| tests/models/aria/test_modeling_aria.py | ||
| tests/models/audio_spectrogram_transformer/test_modeling_audio_spectrogram_transformer.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lot of txt files to detele
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted all these txt files
|
Thanks, @SunMarc, for your comments! I’ve submitted a new version of the code, could you please take a look and let me know if there’s anything else I should change? I think that your suggestions helped me in making it simpler and cleaner. Thanks again! |
What does this PR do?
Summary
This PR introduces an integration of the SINQ quantization method into the Hugging Face Transformers library. It follows the pattern of existing quantization integrations such as HQQ, AWQ, and others.
The goal is to enable users to configure and apply SINQ quantization directly through model configuration parameters, which can then be passed via the
from_pretrained()function within the Transformers framework.Example Usage
The pipeline then will be the following:
cfg = SinqConfig( nbits=4, group_size=64, tiling_mode="1D", method="sinq", modules_to_not_convert=["lm_head"], device="cuda:1" ) model_name = "Qwen/Qwen3-1.7B" tok = AutoTokenizer.from_pretrained(model_name) qmodel = AutoModelForCausalLM.from_pretrained( model_name, cache_dir=cache_dir, torch_dtype=torch.bfloat16, quantization_config=cfg, )Model saving and loading
Once the model has been quantized, it can be saved locally or pushed to the Hugging Face Hub.
Installation
All the required dependencies are installed by installing SINQ from the official github repository.
This pull request is a follow-up to #42151. I open a new PR to improve clarity and to keep this new integration adaptation separate. Please refer to this pull request and does not consider the previous one.
Fixes #42116
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@SunMarc @MekkCyber