Skip to content

[Doc]: Clarify QLoRA (Quantized Model + LoRA) Support in Documentation #13179

@AlexanderZhk

Description

@AlexanderZhk

📚 The doc issue

Two parts of the documentation appear to contradict each other, especially at first glance.

Here, it is explicitly stated that LoRA inference with a quantized model is not supported:

##### LORA and quantization
Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team!

However, here, an example is provided for running offline inference with a quantized model and a LoRA adapter:

This example shows how to use LoRA with different quantization techniques
for offline inference.

To resolve this confusion, it would be very helpful to clarify the following points directly (please correct me if I am mistaken):

  1. QLoRA is supported, but only for offline inference. This means you cannot dynamically load LoRA adapters after loading the quantized base model.
  2. QLoRA is not supported with the OpenAI-compatible server, even for a single LoRA-base model pair.

Edit:

It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback, that's why I was confused.

### Transformers fallback

#### Supported features
##### LORA and quantization

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions