Skip to content

Conversation

@ChiaraBoretti
Copy link

What does this PR do?

Summary
This PR introduces an integration of the SINQ quantization method into the Hugging Face Transformers library. It follows the pattern of existing quantization integrations such as HQQ, AWQ, and others.
The goal is to enable users to configure and apply SINQ quantization directly through model configuration parameters, which can then be passed via the from_pretrained() function within the Transformers framework.

Example Usage
The pipeline then will be the following:

cfg = SinqConfig(
    nbits=4,
    group_size=64,
    tiling_mode="1D",
    method="sinq",     
    modules_to_not_convert=["lm_head"],
    device="cuda:1"
)

model_name = "Qwen/Qwen3-1.7B"
tok = AutoTokenizer.from_pretrained(model_name)

qmodel = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=cache_dir, 
    torch_dtype=torch.bfloat16,
    quantization_config=cfg,
)

Model saving and loading
Once the model has been quantized, it can be saved locally or pushed to the Hugging Face Hub.

from sinq.hf_io import patch_hf_pretrained_io
patch_hf_pretrained_io()

hf_path = "HF_hub/path"
qmodel.save_pretrained(local_path)
qmodel.push_to_hub(hf_path, safe_serialization=True)

tok  = AutoTokenizer.from_pretrained(hf_path)
qmodel = AutoModelForCausalLM.from_pretrained(hf_path, device_map="cuda:0")

Installation
All the required dependencies are installed by installing SINQ from the official github repository.

This pull request is a follow-up to #42151. I open a new PR to improve clarity and to keep this new integration adaptation separate. Please refer to this pull request and does not consider the previous one.
Fixes #42116

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@SunMarc @MekkCyber

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: sinq

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work, really appreciate to see that you adapted your work to fit the v5 ! Left a bunch of comments. I see that sinq api is quite similar to hqq and it would be great if we can improve a bit the api so that the integration is simpler and easier to maintain in general ! We had a lot of issues in our hqq integration and I prefer that we don't go through that again ;) The best would be to take inspiration from other quants methods like bnb, fp8, mxfp4 or torchao. Also, I would be great to reduce the complexity of this PR by simplifying as much as possible. You will see that the other integrations are much more easy to go through which makes the maintenance and usability in the future much better. Feel free to ask any questions, I'm really eager to merge this PR !

| `--tiling_mode` | Weight matrix tiling strategy | str | 1D, 2D | 1D |
| `--group_size` | Weights per quantization group | int | 64, 128 | 64 |
| `--method` | Quantization method | str | sinq, asinq | sinq |
| `--dtype` | Data type of the original model | str | auto, float16, float32 | auto (bfloat16) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine to remove this as we should already have this information when calling from_pretrained

| `--method` | Quantization method | str | sinq, asinq | sinq |
| `--dtype` | Data type of the original model | str | auto, float16, float32 | auto (bfloat16) |
| `--modules_to_not_convert` | List of the layers that are NOT quantize | List of str | [lm_head, ...] | [lm_head] |
| `--device` | Device on which the model is loaded | str | cpu, cuda:0, cuda:1, etc | cuda:0 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this one, the user can just pass device_map in from_pretrained.

qmodel = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=cfg,
torch_dtype=torch.bfloat16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_dtype is deprecated in favor of dtype

Suggested change
torch_dtype=torch.bfloat16
dtype=torch.bfloat16

patch_hf_pretrained_io()
# Save sinq quantized model
model.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit")
model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True, private=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True, private=True)
model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")

Comment on lines 118 to 119
# Push to the Hub
qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Push to the Hub
qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit", safe_serialization=True)
# Push to the Hub
qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")

Comment on lines 2160 to 2161
self.modules_to_not_convert = list(modules_to_not_convert) if modules_to_not_convert is not None else []

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave it as None

Comment on lines 2194 to 2218
def to_dict(self) -> dict[str, Any]:
"""
Convert configuration to a plain dict that will be stored in `config.json`
under `quantization_config`.
"""
base = super().to_dict()

base.update(
{
"nbits": self.nbits,
"group_size": self.group_size,
"tiling_mode": self.tiling_mode,
"method": self.method,
"per_channel": self.per_channel,
"symmetric": self.symmetric,
"use_nf4": self.use_nf4,
"modules_to_not_convert": list(self.modules_to_not_convert or []),
"dtype": self.dtype,
"device": self.device,
}
)

base.update(self._extra_kwargs)

return base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_dict should work fine no ?

Comment on lines 2186 to 2192
@property
def is_trainable(self) -> bool:
return True

@property
def is_serializable(self) -> bool:
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not neeedd

Comment on lines 2220 to 2234
@classmethod
def from_dict(cls, config_dict: Dict[str, Any], return_unused_kwargs: bool = False, **kwargs):
"""
Recreate SinqConfig from a dictionary (e.g. loaded from config.json).
"""
cfg = dict(config_dict)

cfg.pop("quant_method", None)

cfg.update(kwargs)

if return_unused_kwargs:
return cls(**cfg), {}

return cls(**cfg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what we are doing here

@@ -0,0 +1,375 @@
tests/models/aimv2/test_modeling_aimv2.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@ChiaraBoretti
Copy link
Author

Hi @SunMarc,

Thank you very much for taking the time to review my code and for all your helpful comments. I’ve just pushed an updated version that incorporates several of the changes you suggested.
That said, I still have a few questions, and I ran into some difficulties while trying to implement a couple of your recommendations. I was hoping you might be able to help me with those points.
I’ll reply directly to your specific comments with my questions. I’m keen to further improve the code and, with your guidance, get this pull request ready to be merged.

Thanks again for your support, and I really appreciate your help!

@SunMarc
Copy link
Member

SunMarc commented Jan 13, 2026

Thanks for taking the time to go through the comments, answered to a couple of them !

@ChiaraBoretti
Copy link
Author

Hi @SunMarc,

I’ve just pushed an updated version of the code with the following changes:

  • Removal of asinq. I removed the asinq implementation and added an explicit error message when a user tries to set method='asinq'. The message directs users to the official SINQ repository for running the calibrated version of SINQ.

  • Architecture-specific handling. I answered your comment regarding handling particular architectures with an all-in-one approach. For now, I’ve moved away from the all-in-one strategy. As mentioned in my answer, I believe the issue is model-specific rather than a general multimodal architecture problem. Based on this, I opted to explicitly specify the vision components in modules_to_not_convert for models that raise errors due to their init_weights implementation.

  • Updated quantization flow. I reworked the quantization flow by modifying how SINQLinear is implemented, following your suggestion (I hope to understand them correctly). The new flow is: Create an empty SINQLinear -> Load the weights -> Explicitly call the weights quantization via a dedicated quantize() function instead of performing quantization when the SINQLinear is created (in the __init__ of the class).
    As a result, also the SinqQuantize and SinqDeserialize classes are now much simpler.

Please let me know if these changes are moving the PR in the right direction for approval, and if there’s anything else I should address.

Thanks!

@SunMarc
Copy link
Member

SunMarc commented Jan 15, 2026

Thanks a lot for this !

Architecture-specific handling. I answered your comment regarding handling particular architectures with an all-in-one approach. For now, I’ve moved away from the all-in-one strategy. As mentioned in my answer, I believe the issue is model-specific rather than a general multimodal architecture problem. Based on this, I opted to explicitly specify the vision components in modules_to_not_convert for models that raise errors due to their init_weights implementation.

Answered in the thread ! But this is a separate issue from this PR and it should be quite simple to fix.

Updated quantization flow. I reworked the quantization flow by modifying how SINQLinear is implemented, following your suggestion (I hope to understand them correctly). The new flow is: Create an empty SINQLinear -> Load the weights -> Explicitly call the weights quantization via a dedicated quantize() function instead of performing quantization when the SINQLinear is created (in the init of the class).
As a result, also the SinqQuantize and SinqDeserialize classes are now much simpler.

I had a quick look at it looks much better. The flow also makes more sense now.

Let me know if you want a new review but the simpler the code looks like, the better it is !

Comment on lines 264 to 283
if not self.pre_quantized and self.quantization_config.method == "asinq":
raise ValueError("A-SINQ is not supported in HuggingFace integration")

sinq_quant_dict = None if self.pre_quantized else self._build_sinq_quant_dict(self.quantization_config)
device_str = _normalize_cuda_device(getattr(self.quantization_config, "device", "auto"))

for full_name, module in list(model.named_modules()):
if not isinstance(module, nn.Linear):
continue
if not should_convert_module(full_name, self.modules_to_not_convert):
continue

parent_path, _, child_name = full_name.rpartition(".")
parent = model.get_submodule(parent_path) if parent_path else model

# Create empty SINQLinear (no weights yet)
sinq_layer = SINQLinear(
in_features=module.in_features if not self.pre_quantized else None,
out_features=module.out_features if not self.pre_quantized else None,
bias=(module.bias is not None) if not self.pre_quantized else False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put that in a replace function like the other methods

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it! I added a replace_with_sinq_linear() function in sinq.py

Comment on lines 306 to 312
if self.quantization_config.method == "asinq" and not self.pre_quantized:
raise ValueError(
"You are using `method='asinq'` in the quantization config. Right now the calibrated version of SINQ"
" is not supported in Hugging Face, please refer and use the official SINQ repository "
"`to quantized a model with this method. "
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put that in validate_environment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to validate_environment()

Comment on lines 315 to 329
def _resolve_tokenizer_and_model_id(self, model, kwargs):
tok = kwargs.get("tokenizer", None)
model_id = None
cache_dir = kwargs.get("cache_dir", None)

try:
if hasattr(model, "config") and hasattr(model.config, "_name_or_path"):
model_id = model.config._name_or_path
if model_id is None:
model_id = kwargs.get("pretrained_model_name_or_path", None)
if model_id is None and "config" in kwargs and hasattr(kwargs["config"], "_name_or_path"):
model_id = getattr(kwargs["config"], "_name_or_path", None)

logger.info(f"[SinqHfQuantizer] Detected model_id = {model_id}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I removed this function since it is no more used in the code.

Comment on lines 1 to 7
tests/models/albert/test_modeling_albert.py
tests/models/align/test_modeling_align.py
tests/models/altclip/test_modeling_altclip.py
tests/models/apertus/test_modeling_apertus.py
tests/models/arcee/test_modeling_arcee.py
tests/models/aria/test_modeling_aria.py
tests/models/audio_spectrogram_transformer/test_modeling_audio_spectrogram_transformer.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lot of txt files to detele

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted all these txt files

@ChiaraBoretti
Copy link
Author

Thanks, @SunMarc, for your comments!

I’ve submitted a new version of the code, could you please take a look and let me know if there’s anything else I should change? I think that your suggestions helped me in making it simpler and cleaner.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integration of the SINQ quantization strategy

2 participants