llama.cpp: add IQ3_XXS quantization models #8

ymcui · 2024-01-31T03:31:32Z

Description

This PR introduces new GGUF quantization type IQ3_XXS, which was recently introduced by llama.cpp (ref. ggml-org/llama.cpp#5196). If you are using 3-bit quantization, you may try IQ3_XXS, as it provides better performance

IQ3_XXS GGUF models have been updated:

Chinese-Mixtral-GGUF: https://huggingface.co/hfl/chinese-mixtral-gguf
Chinese-Mixtral-Instruct-GGUF: https://huggingface.co/hfl/chinese-mixtral-instruct-gguf

Performance

Quant	Q2_K	⭐️IQ3_XXS	Q3_K	Q4_K
Model Size	16.12 GB	17.05 GB	18.96 GB	24.62 GB
BPW	2.96	3.14	3.86	4.87
Speed (PP)	10.27	26.78	12.17	10.02
Speed (TG)	20.29	20.58	21.74	21.67
PPL@Chinese-Mixtral	5.1846 +/- 0.05533	4.5990 +/- 0.04969	4.5545 +/- 0.04893	4.4488 +/- 0.04813
PPL@Chinese-Mixtral-Insttruct	4.5758 +/- 0.03959	4.0389 +/- 0.03489	4.5563 +/- 0.04126	3.9265 +/- 0.03407

Note: Speed (ms/token) is reported under A100-40G. PP: prompt processing; TG: text generation.

Related Issue

None.

ymcui added 2 commits January 31, 2024 11:30

doc: add iq3_xxs perf.

795e74b

doc: add iq3_xxs perf.

7b42536

ymcui requested a review from iMountTai January 31, 2024 03:35

iMountTai approved these changes Jan 31, 2024

View reviewed changes

ymcui merged commit 448665a into main Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.cpp: add IQ3_XXS quantization models #8

llama.cpp: add IQ3_XXS quantization models #8

Uh oh!

ymcui commented Jan 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama.cpp: add IQ3_XXS quantization models #8

llama.cpp: add IQ3_XXS quantization models #8

Uh oh!

Conversation

ymcui commented Jan 31, 2024

Description

Performance

Related Issue

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants