Add HQQ quant loader for ooba#4888
Conversation
|
This looks promising. Have you made any tests on perplexity, speed, and maximum context size? |
modules/models.py
Outdated
| model_dir = f'{shared.args.model_dir}/{model_name}' | ||
| logger.warning(f"loading HQQ model from {model_dir}") | ||
| model = HQQModelForCausalLM.from_quantized(model_dir) | ||
| tokenizer = AutoTokenizer.from_pretrained(model_dir) |
There was a problem hiding this comment.
You could probably just return model and let ooba handle the tokenizer options like AWQ does.
There was a problem hiding this comment.
good call. changed it
the performance is not great at the moment. The PPL results should refer to @mobicham 's |
|
You should not use the Instruct model for wikitext ppl, you should use the base model. Here are the numbers with a comparison vs. bitsandbytes:
More numbers for Llama2 and OpenCLIP: https://mobiusml.github.io/hqq_blog/ |
|
Try with https://github.com/mobiusml/hqq/blob/master/examples/llama2_benchmark/eval_model.py , you should get the same numbers as the ones I posted. |
cal066
left a comment
There was a problem hiding this comment.
Thanks, loader looks good.
|
@oobabooga anymore comment before merge ? |
|
Here is a test:
The methodology is the same as in this blog post of mine: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/ The key takeaway is that no other quantization has lower VRAM and lower perplexity than this, so it is in the Pareto frontier for VRAM vs perplexity. But in terms of model size, llama-2-13b-EXL2-4.400b is 142MB smaller and has 0.001 lower perplexity. These are my thoughts:
@mobicham two questions:
|
|
@oobabooga thank you very much for your comments!
Here are some solutions:
|
|
@oobabooga I guess you tested the llama.cpp on Mac with unified memory? That might explain the fast cpu > gpu data transfer. On a Linux machine, that is not really an option, at least with Pytorch, the .cuda() is too slow. |
No, I use Linux with a 3090. llama.cpp doesn't do CPU offloading. When you don't export all layers to the GPU, the remaining layers are computed in the CPU itself. For Mixtral, since only 2 experts with 7b parameters are used at a time, the speed ends up decent even though the CPU layers are bottlenecked by the ~20 GB/s RAM bandwidth. |
|
About VLLM, another PR is open about integrating it (#4860); I don't know how practical that would be considering that the inference code in this project relies a lot on the transformers library. It's something that I have to investigate. |
Cool ! VLLM is indeed quite different. I have some ideas to make the HF models faster, will give it a try in the upcoming days.
Q4 is 4-bit, going from 4-bit to 2-bit leads to a big drop in quality indeed. How does it compare to Q2_K ? |
I haven't tested. In the 2-bit domain, I got impressive results with QuIP# for llama-2-70b-chat. Q2_K is huge in comparison (like 50% larger), as it's closer to 3-bit than 2-bit, but the author of the k-quants method used in llama.cpp claims to have a new version of the method that is closer to QuIP#. He pushed some examples to https://huggingface.co/ikawrakow/llama-v2-2bit-gguf but didn't release the code yet. See the discussion here.
Exciting news -- I look forward to trying the updated version. Mixtral 3bit would be interesting as well, as it should fit in 24GB VRAM. |
|
I did a quick comparison with QuiP# 4-bit. I forgot to compress the scaling which should reduce the memory but here's a rough comparison (PPL/Memory). Will play with Quip# 2-bit later: Regarding the new MIxtral models, here are the links: Numbers: It's true that the 3-bit models should work on 24 GB but with smaller context window. On disk, the new model is only 0.20 GB more (18.2GB vs. 18GB) but for some reason it takes an extra 1 GB in VRAM. By the way, now you can install hqq via pip: |
|
0.2GB for a 1.14 drop in ppl is a massive improvement. Very impressive. With the pypi package, the PR looks good to merge now. @mobicham If I could have two additional suggestions for future HQQ versions:
|
|
Thanks @oobabooga !
|

Checklist:
HQQ quant code: https://github.com/mobiusml/hqq
HQQ quant blog: https://mobiusml.github.io/hqq_blog/
HQQ quant for Mixtral 2b: https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-2bit_g16_s128-HQQ
it can load the whole Mixtral using 2bit with 24GB VRAM