Change cache_4bit to cache_q4, improve descriptions#5649
Change cache_4bit to cache_q4, improve descriptions#5649oobabooga merged 12 commits intooobabooga:devfrom
Conversation
README.md
Outdated
| |`--no_flash_attn` | Force flash-attention to not be used. | | ||
| |`--cache_8bit` | Use 8-bit cache to save VRAM. | | ||
| |`--cache_4bit` | Use 4-bit cache to save VRAM. | | ||
| |`--cache_q4` | Use Q4 cache to save a lot of VRAM. Recommended over 8-bit cache, uses grouped quantization for better performance while saving even more VRAM. | |
There was a problem hiding this comment.
Is it "recommended?" I haven't seen any perplexity tests, and presumably 4-bit cache introduces a bigger loss.
There was a problem hiding this comment.
By turboderp yes, it's a much better way of compressing the kv cache, uses grouped quantization instead of just truncating the numbers like fp8
There was a problem hiding this comment.
I perplexity tested some models with and without when testing 120b 3-bit vs 70b. Perplexity for 8 and 4bit cache is basically rounding error difference. Deep in the decimal points. Don't forget that 4-bit may be slower than 8-bit.
There was a problem hiding this comment.
I also found in previous tests that the 8-bit cache didn't change perplexity at all. I find an indication that the perplexity test implemented in the repository is not capable of capturing the loss introduced by quantizing the cache, for some reason beyond my understanding.
…artowski1182-main
|
I'll keep the name as-is as "Q4" is an ill-defined term, and |
--------- Co-authored-by: oobabooga <[email protected]>
The new Q4 option is actually a better quantized version instead of just chopping off bits like fp8, which makes it more accurate than fp8 while saving even more VRAM, and I think that should be properly reflected
Checklist: