Skip to content

[Feature] Support for out-of-source quantizers #3521

@Giuseppe5

Description

@Giuseppe5

Hello,

I am Giuseppe, one of the main maintainers of Brevitas.

I had the pleasure of chatting with @shimmyshimmer last week during the PyTorch conference, during which the topic of QAT and Torch AO integration came up.

I was curious to see if/how it could be extended to support other quantizers (such as Brevitas), and I believe it is fairly straightforward, although there are a few changes that are required to make this a bit easier.

Issues and possible solutions

The biggest issue is that the function that applies quantization at the moment is not easily overridable/modified, since it's a function call within a much bigger staticmethod of the FastLlama class.

I forked the repo, to propose a tentative solution to this problem. I am happy to accept other ideas and/or contribute with a PR, if that works with you.
These are the changes required:

Giuseppe5@58c22f0

Other smaller issues are related to the specialization around Torch AO naming scheme for quantizers (i.e, weight_fake_quantizer and activation_fake_quantizer).
There are easier out-of-source workarounds for this but maybe it can be abstracted to something more general?

Example

Starting from the original QAT notebook, I created a slightly modified one that works with Brevitas and my fork of unsloth.
You can find it here:

https://colab.research.google.com/drive/1HhetpDq3oKTN9VIeS3GCSWEWKi7PXG0r?usp=sharing

The main modifications are contained in a block called Brevitas quantization.
It is a very minimal examples, but it could be easily extended to other quantization formats.

What comes next

There are a few (minor) missing features compared to the current integration, like fusing back LoRA adapters into the weights. We believe this is easy to implement if everything else works as planned.

The main absence in the example above is the export pathways.

Brevitas can easily decouple quantization application from quantization representation, which means we can easily adapt and implement new export formats (for example, mimicking what Torch AO does, if that is what users want).

We currently provide several export formats (e.g., ONNX through optimum), and we are planning to expand to more (e.g., export to vLLM), but we would love to hear what in your opinion are the most useful export/serialization formats we should target.

cc @nickfraser

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestFeature request pending on roadmap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions