-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
📚 The doc issue
Two parts of the documentation appear to contradict each other, especially at first glance.
Here, it is explicitly stated that LoRA inference with a quantized model is not supported:
vllm/docs/source/models/supported_models.md
Lines 59 to 61 in 4c0d93f
| ##### LORA and quantization | |
| Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team! |
However, here, an example is provided for running offline inference with a quantized model and a LoRA adapter:
| This example shows how to use LoRA with different quantization techniques | |
| for offline inference. |
To resolve this confusion, it would be very helpful to clarify the following points directly (please correct me if I am mistaken):
- QLoRA is supported, but only for offline inference. This means you cannot dynamically load LoRA adapters after loading the quantized base model.
- QLoRA is not supported with the OpenAI-compatible server, even for a single LoRA-base model pair.
Edit:
It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback, that's why I was confused.
| ### Transformers fallback |
vllm/docs/source/models/supported_models.md
Lines 57 to 59 in 4c0d93f
| #### Supported features | |
| ##### LORA and quantization |