[Feature] Support for out-of-source quantizers

Hello,

I am Giuseppe, one of the main maintainers of [Brevitas](https://github.com/Xilinx/brevitas).

I had the pleasure of chatting with @shimmyshimmer last week during the PyTorch conference, during which the topic of QAT and Torch AO integration came up.

I was curious to see if/how it could be extended to support other quantizers (such as Brevitas), and I believe it is fairly straightforward, although there are a few changes that are required to make this a bit easier.

## Issues and possible solutions

The biggest issue is that [the function that applies quantization](https://github.com/unslothai/unsloth/blob/5314c214d21a387791decc6b0f7715ebd7c1eeb7/unsloth/models/_utils.py#L1754) at the moment is not easily overridable/modified, since it's a function call within a much bigger `staticmethod` of the `FastLlama` class.

I forked the repo, to propose a tentative solution to this problem. I am happy to accept other ideas and/or contribute with a PR, if that works with you.
These are the changes required:

https://github.com/Giuseppe5/unsloth/commit/58c22f09c7e9ef4d982861c0b45a8619dd82115d

Other smaller issues are related to the specialization around Torch AO naming scheme for quantizers (i.e, `weight_fake_quantizer` and `activation_fake_quantizer`). 
There are easier out-of-source workarounds for this but maybe it can be abstracted to something more general?

## Example

Starting from the original QAT notebook, I created a slightly modified one that works with Brevitas and my fork of unsloth.
You can find it here:

https://colab.research.google.com/drive/1HhetpDq3oKTN9VIeS3GCSWEWKi7PXG0r?usp=sharing

The main modifications are contained in a block called `Brevitas quantization`.
It is a very minimal examples, but it could be easily extended to other quantization formats.

## What comes next

There are a few (minor) missing features compared to the current integration, like fusing back LoRA adapters into the weights. We believe this is easy to implement if everything else works as planned.

The main absence in the example above is the export pathways.

Brevitas can easily decouple quantization application from quantization representation, which means we can easily adapt and implement new export formats (for example, mimicking what Torch AO does, if that is what users want).

We currently provide several export formats (e.g., ONNX through optimum), and we are planning to expand to more (e.g., export to vLLM), but we would love to hear what in your opinion are the most useful export/serialization formats we should target.



cc @nickfraser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature] Support for out-of-source quantizers #3521

Issues and possible solutions

Example

What comes next

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Support for out-of-source quantizers #3521

Description

Issues and possible solutions

Example

What comes next

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions