Skip to content
Open
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
3c49a27
sinq integration files
ChiaraBoretti Oct 31, 2025
5cab0cb
sinq integration update
ChiaraBoretti Nov 3, 2025
bcb1d6f
sinq integration no lazy import
ChiaraBoretti Nov 4, 2025
2f054e9
Tests for sinq integration
ChiaraBoretti Nov 4, 2025
f12b58d
minor changes to sinq integration
ChiaraBoretti Nov 4, 2025
82bcaa9
sinq integration documentation added
ChiaraBoretti Nov 6, 2025
296aec7
small correction to sinq documentation
ChiaraBoretti Nov 6, 2025
d34764e
small correction to sinq documentation
ChiaraBoretti Nov 6, 2025
366b1df
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
00249ad
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
638e83f
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
ff40cc3
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
462b685
Code style fix sinq integration
ChiaraBoretti Nov 10, 2025
5309c4f
minor changes in comments for sinq integration
ChiaraBoretti Nov 10, 2025
31b7699
add for documentation sinq integration
ChiaraBoretti Nov 10, 2025
6375a9d
add documentation for sinq integration
ChiaraBoretti Nov 10, 2025
7cc0c19
minor adjustment in sinq quantizer
ChiaraBoretti Nov 10, 2025
559e1d9
minor changes to sinq integration
ChiaraBoretti Nov 11, 2025
5d6b840
delete debugging print in sinq integration
ChiaraBoretti Nov 11, 2025
b3e7685
sinq integration files
ChiaraBoretti Oct 31, 2025
d914882
sinq integration update
ChiaraBoretti Nov 3, 2025
4624e0e
sinq integration no lazy import
ChiaraBoretti Nov 4, 2025
3fc92ff
Tests for sinq integration
ChiaraBoretti Nov 4, 2025
f182564
minor changes to sinq integration
ChiaraBoretti Nov 4, 2025
0ad2a84
sinq integration documentation added
ChiaraBoretti Nov 6, 2025
6b7f0b7
small correction to sinq documentation
ChiaraBoretti Nov 6, 2025
02c2dc4
small correction to sinq documentation
ChiaraBoretti Nov 6, 2025
3b60f32
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
233859a
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
9525baf
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
50a1fb0
remove auto_patch_io flag and fix the selection of the device for sin…
ChiaraBoretti Nov 7, 2025
3964e5c
Code style fix sinq integration
ChiaraBoretti Nov 10, 2025
a27526b
minor changes in comments for sinq integration
ChiaraBoretti Nov 10, 2025
8d79c14
add for documentation sinq integration
ChiaraBoretti Nov 10, 2025
46383e2
add documentation for sinq integration
ChiaraBoretti Nov 10, 2025
6f7a09e
minor adjustment in sinq quantizer
ChiaraBoretti Nov 10, 2025
aaee212
minor changes to sinq integration
ChiaraBoretti Nov 11, 2025
efc96bc
delete debugging print in sinq integration
ChiaraBoretti Nov 11, 2025
b4c11a2
Adapt sinq integration to transformers v5
ChiaraBoretti Jan 5, 2026
1063d0d
Merge remote-tracking branch 'origin/sinq_integration' into sinq_inte…
ChiaraBoretti Jan 5, 2026
d7dc7ff
sinq integration for transformers v5
ChiaraBoretti Jan 5, 2026
31ca4f7
Added part of the suggested modifications to make the code simpler
ChiaraBoretti Jan 8, 2026
1b90f6e
Modification of the quantization flow and remove of asinq option
ChiaraBoretti Jan 15, 2026
c471b35
Minor adjustments and creation of fuction to substitute quantized layers
ChiaraBoretti Jan 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,8 @@
title: SpQR
- local: quantization/vptq
title: VPTQ
- local: quantization/sinq
title: SINQ
- local: quantization/contribute
title: Contribute
title: Quantization
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Use the Space below to help you pick a quantization method depending on your har
| [HIGGS](./higgs) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 2/4 | 🔴 | 🟢 | 🟢 | https://github.com/HanGuo97/flute |
| [HQQ](./hqq) | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 1/8 | 🟢 | 🔴 | 🟢 | https://github.com/mobiusml/hqq/ |
| [optimum-quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 2/4/8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/optimum-quanto |
| [SINQ](./sinq) | 🟢 | 🟢 | 🟢 | 🟡 | 🟡 | 🟡 | 🟡 | 2/3/4/6/8 | 🔴 | 🟢 | 🟢 | https://github.com/huawei-csl/SINQ |
| [FBGEMM_FP8](./fbgemm_fp8) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
| [torchao](./torchao) | 🟢 | 🟢 | 🟢 | 🔴 | 🟡 | 🟢 | | 4/8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
| [VPTQ](./vptq) | 🔴 | 🔴 | 🟢 | 🟡 | 🔴 | 🔴 | 🟢 | 1/8 | 🔴 | 🟢 | 🟢 | https://github.com/microsoft/VPTQ |
Expand Down
11 changes: 11 additions & 0 deletions docs/source/en/quantization/selecting.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Consider the quantization methods below for inference.
| compressed-tensors | loading specific quantized formats (FP8, Sparse) |
| GPTQModel or AWQ | good 4-bit accuracy with upfront calibration |
| HQQ | fast on the fly quantization without calibration |
| SINQ | super-fast but high-quality on the fly quantization without calibration |
| torchao | flexibility and fast inference with torch.compile |

### No Calibration Required (On-the-fly Quantization)
Expand All @@ -56,6 +57,16 @@ See the [bitsandbytes documentation](./bitsandbytes) for more details.

See the [HQQ documentation](./hqq) for more details.

#### SINQ

| Pros | Cons |
|----------------------------------------------------------------------|----------------------------------------------------------------------------|
| Super-fast but high-quality quantization process, no calibration data needed. | Accuracy can degrade significantly at bit depths <=2-bit. |
| GemLite backend for faster inference. | Slower inference for 3-bit models (no gemlite kernel)
| Supports wide range of bit depths (8, 4, 3, 2 bit). | |

See the [SINQ documentation](./sinq) for more details.

#### torchao

| Pros | Cons |
Expand Down
182 changes: 182 additions & 0 deletions docs/source/en/quantization/sinq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
[![arXiv](https://img.shields.io/badge/arXiv-2509.22944-b31b1b.svg)](https://arxiv.org/abs/2509.22944)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![GitHub stars](https://img.shields.io/github/stars/huawei-csl/SINQ?label=Stars&logo=github&logoColor=white&style=flat-square)](https://github.com/huawei-csl/SINQ/stargazers)
[![hf-space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Huawei%20CSL-ffc107?color=ffc107&logoColor=white)](https://huggingface.co/huawei-csl)

# SINQ

[Sinkhorn-Normalized Quantization (SINQ)](https://github.com/huawei-csl/SINQ/tree/main) is a fast, plug-and-play, model-agnostic quantization technique delivering state-of-the-art performance for Large Language Models without sacrificing accuracy.

### 🔍 What You’ll Find Here

- [1. Quantize (and save) any LLM with SINQ](#1-quantize-any-llm-with-sinq)
- [2. How to Cite This Work](#2-how-to-cite-this-work)
- [3. Current Limitations](#3-current-limitations)

#### 📊 Feature Comparison: SINQ vs HQQ _(calibration-free)_ and A-SINQ vs AWQ _(calibrated)_

| Feature | **SINQ** | **HQQ** | **A-SINQ** | **AWQ** |
|------------|:--------:|:--------:|:----------:|:-------:|
| 🎯 Calibration | Calibration-free | Calibration-free | Calibrated | Calibrated |
| 🧮 Quantization Type | Symmetric & Asymmetric | Asymmetric only | Symmetric & Asymmetric | Symmetric & Asymmetric |
| 📦 NF4 Support | **Yes** | No | **Yes** | No |
| ⚡ Quantization Speed | ~2× **Faster** than HQQ | Slower | ~4× **Faster** than AWQ | Slower |
| 📈 Model Quality | **Higher** | Lower | **Higher** | Lower |


📄 **Want to know more?**
- Read our paper on [**arXiv**](http://arxiv.org/abs/2509.22944)
- Check the official [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository

---

## 1. Quantize any LLM with SINQ

### Setup & Quick Start

First, install the package. It can be done in two ways:
- From source using the official Github repository [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) **[Recommended]**
- Using pip package:
```bash
pip install sinq
```

---

### Quantize in a few lines

Quantizing any 🤗 Hugging Face model with SINQ is simple and takes only a few lines of code.
First, create a [`SinqConfig`] and specify the following parameters:

| Flag | Description | Type | Options | Default |
|------|-------------|---------|---------|----------|
| `--nbits` | Bit-width for weight quantization | int | 2, 3, 4, 5, 6, 8 | 4 |
| `--tiling_mode` | Weight matrix tiling strategy | str | 1D, 2D | 1D |
| `--group_size` | Weights per quantization group | int | 64, 128 | 64 |
| `--method` | Quantization method | str | sinq, asinq | sinq |
| `--modules_to_not_convert` | List of the layers that are NOT quantize | List of str | [lm_head, ...] | [lm_head] |
| `--device` | Device on which the model is loaded | str | cpu, cuda:0, cuda:1, etc | cuda:0 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this one, the user can just pass device_map in from_pretrained.


Then specify the model you want to quantize and pass the SinqConfig as quantization configuration option

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, SinqConfig

model_name = "Qwen/Qwen3-1.7B"
device = "cuda:0"

cfg = SinqConfig(
nbits=4,
group_size=64,
tiling_mode="1D",
method="sinq",
modules_to_not_convert=["lm_head"],
device=device
)

tok = AutoTokenizer.from_pretrained(model_name)
qmodel = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=cfg,
dtype=torch.bfloat16
)

```

✅ That’s it. Your model is now quantized with **SINQ** and ready for inference or saving.

> Check our official [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository to stay updated!

---

### Save & reload

If you want to reuse a quantized model later, save it to disk or push it on the HuggingFace Hub and reload it without needing base FP weights.
If you installed SINQ from source you should call *patch_hf_pretrained_io* function:
```python
from sinq.hf_io import patch_hf_pretrained_io
patch_hf_pretrained_io()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of asking users to use this, we can move this code into hf_quantizer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that your approach would be better overall. I also tried to implement your suggestion, but I ran into a few issues along the way.
In particular, I moved the relevant lines in quantizer_sinq.py to the beginning of the file so that my custom io_patch is applied as soon as the quantizer module is imported. While this works correctly for saving models, I’m encountering problems when loading already Sinq-quantized models.

At the moment, I’m working on a workaround where, during installation of the sinq package via pip, an autopatch is applied that automatically calls patch_hf_pretrained_io(). However, before going too far in that direction, I wanted to ask whether you might have any hints or suggestions on how to better implement your original approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add it in is_serializable method, so that it is only called when we want to actually save the model ?

# Save sinq quantized model
model.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit")
model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
tokenizer.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
# Reload a sinq quantized model
hf_hub_model = "HF_Hub_username/qwen3-1.7B-sinq-4bit"
tokenizer = AutoTokenizer.from_pretrained(hf_hub_model)
model = AutoModelForCausalLM.from_pretrained(hf_hub_model)
```
Otherwise, if you installed SINQ through pip, you can simply use HF built-in functions:

```python
# --- Save to a folder (sharded safetensors) ---

# 'model' must already be SINQ-quantized
# Locally save
qmodel.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit")
# Push to the Hub
qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
tok.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")

# --- Reload later--

save_dir = "/path/to/save/qwen3-1.7B-sinq-4bit"
hf_hub_model = "HF_Hub_username/qwen3-1.7B-sinq-4bit"

# From local directory
tok = AutoTokenizer.from_pretrained(save_dir)
qmodel = AutoModelForCausalLM.from_pretrained(save_dir)

# From HF Hub
tok = AutoTokenizer.from_pretrained(hf_hub_model)
qmodel = AutoModelForCausalLM.from_pretrained(hf_hub_model)

```

✅ Your model is now loaded and ready for inference!

> Note: If the model has been quantized in 4 bit and `gemlite` library is installed, gemlite faster kernel is used to run the inference.

---

### Compatible with [`lm-eval`](https://github.com/EleutherAI/lm-evaluation-harness) evaluation framework

Below is a minimal example showing how to evaluate a SINQ-quantized model on a benchmark dataset:

```python
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

# Wrap the already quantized model and tokenizer with HFLM
lm = HFLM(pretrained=qmodel, tokenizer=tok, device=device)

# Evaluate (many tasks available on lm-eval such as MMLU and HellaSwag)
results = evaluator.simple_evaluate(
model=lm,
tasks=["wikitext"], # small and fast benchmark
device=device
)
```

## 2. How to Cite This Work

If you find **SINQ** useful in your research or applications
- Support our project by putting a star ⭐️ in the [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository
- Please cite our <a href="http://arxiv.org/abs/2509.22944" target="_blank"><strong>paper</strong></a>:

```bibtex
@misc{muller2025sinq,
title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
year={2025},
eprint={2509.22944},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2509.22944}
}
```

## 3. Current Limitations

Currently, the A-SINQ method is not supported in Hugging Face. Please refer to the official [SINQ repository](https://github.com/huawei-csl/SINQ/tree/main) to quantize a model with this strategy.
At the moment the SINQ quantization strategy and SINQ quantized models do not support Multi-GPU option, so if your system counts multiple GPUs please specify which one should be used.
Empty file added doctest_list.txt
Empty file.
2 changes: 2 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@
"SpQRConfig",
"TorchAoConfig",
"VptqConfig",
"SinqConfig",
],
"video_utils": [],
"utils.kernel_config": ["KernelConfig"],
Expand Down Expand Up @@ -763,6 +764,7 @@
from .utils.quantization_config import HqqConfig as HqqConfig
from .utils.quantization_config import QuantoConfig as QuantoConfig
from .utils.quantization_config import QuarkConfig as QuarkConfig
from .utils.quantization_config import SinqConfig as SinqConfig
from .utils.quantization_config import SpQRConfig as SpQRConfig
from .utils.quantization_config import TorchAoConfig as TorchAoConfig
from .utils.quantization_config import VptqConfig as VptqConfig
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/integrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@
"quanto": ["replace_with_quanto_layers"],
"spqr": ["replace_with_spqr_linear"],
"vptq": ["replace_with_vptq_linear"],
"sinq": ["SinqQuantize", "SinqDeserialize"],
}

try:
Expand Down Expand Up @@ -268,6 +269,7 @@
from .quanto import replace_with_quanto_layers
from .spqr import replace_with_spqr_linear
from .vptq import replace_with_vptq_linear
from .sinq import SinqQuantize, SinqDeserialize

try:
if not is_torch_available():
Expand Down
120 changes: 120 additions & 0 deletions src/transformers/integrations/sinq.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import annotations

from typing import Optional, Dict, Any

from transformers.utils import is_torch_available, logging

from ..core_model_loading import ConversionOps
from ..quantizers.quantizers_utils import get_module_from_name

logger = logging.get_logger(__name__)

if is_torch_available():
import torch
import torch.nn as nn

class SinqQuantize(ConversionOps):
"""
Param-level ConversionOp for SINQ (from FP weights).

At load time, for each `Linear.weight` that should be quantized:
- The SINQLinear module already exists (created in _process_model_before_weight_loading)
- We just call quantize() on it with the loaded weight tensor
"""

def __init__(self, hf_quantizer: "SinqHfQuantizer"):
self.hf_quantizer = hf_quantizer

def convert(
self,
input_dict: Dict[str, Any],
model: Optional["torch.nn.Module"] = None,
full_layer_name: str | None = None,
missing_keys=None,
**kwargs,
) -> Dict[str, "torch.Tensor"]:

_, values = next(iter(input_dict.items()))
weight_tensor = values[0] if isinstance(values, list) else values

module, tensor_name = get_module_from_name(model, full_layer_name)

module.quantize(weight_tensor)

if missing_keys is not None:
missing_keys.discard(full_layer_name)

module._is_hf_initialized = True

return {}

class SinqDeserialize(ConversionOps):
"""
ConversionOp for loading *pre-quantized* SINQ checkpoints.

Checkpoint layout (what `SINQLinear.state_dict` produces) is, per module:
<prefix>.W_q
<prefix>.bias
<prefix>.meta

WeightConverter in the quantizer is configured so that:
- we group ".W_q", ".meta", ".bias" as input_dict
- conceptually treat them as belonging to "<prefix>.weight"
- and call this SinqDeserialize.convert to load the state into the existing SINQLinear.

The returned dict is {} because we load directly into the module.
"""

def __init__(self, hf_quantizer: "SinqHfQuantizer"):
self.hf_quantizer = hf_quantizer

def convert(
self,
input_dict: Dict[str, Any],
model: Optional["torch.nn.Module"] = None,
full_layer_name: str | None = None,
**kwargs,
) -> Dict[str, "torch.Tensor"]:

for k, v in list(input_dict.items()):
if isinstance(v, list):
input_dict[k] = v[0]

W_q = input_dict.get(".W_q", None)
meta = input_dict.get(".meta", None)
bias = input_dict.get(".bias", None)

if W_q is None or meta is None:
v = next(iter(input_dict.values()))
if isinstance(v, list):
v = v[0]
return {full_layer_name: v}

module, _ = get_module_from_name(model, full_layer_name)

state = {
"W_q": W_q,
"meta": meta,
}
if bias is not None:
state["bias"] = bias

module.load_state_dict(state)
module._is_hf_initialized = True

return {}
Loading
Loading