Skip to content

Change cache_4bit to cache_q4, improve descriptions#5649

Merged
oobabooga merged 12 commits intooobabooga:devfrom
bartowski1182:main
Mar 7, 2024
Merged

Change cache_4bit to cache_q4, improve descriptions#5649
oobabooga merged 12 commits intooobabooga:devfrom
bartowski1182:main

Conversation

@bartowski1182
Copy link
Copy Markdown
Contributor

The new Q4 option is actually a better quantized version instead of just chopping off bits like fp8, which makes it more accurate than fp8 while saving even more VRAM, and I think that should be properly reflected

Checklist:

README.md Outdated
|`--no_flash_attn` | Force flash-attention to not be used. |
|`--cache_8bit` | Use 8-bit cache to save VRAM. |
|`--cache_4bit` | Use 4-bit cache to save VRAM. |
|`--cache_q4` | Use Q4 cache to save a lot of VRAM. Recommended over 8-bit cache, uses grouped quantization for better performance while saving even more VRAM. |
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "recommended?" I haven't seen any perplexity tests, and presumably 4-bit cache introduces a bigger loss.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By turboderp yes, it's a much better way of compressing the kv cache, uses grouped quantization instead of just truncating the numbers like fp8

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perplexity tested some models with and without when testing 120b 3-bit vs 70b. Perplexity for 8 and 4bit cache is basically rounding error difference. Deep in the decimal points. Don't forget that 4-bit may be slower than 8-bit.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found in previous tests that the 8-bit cache didn't change perplexity at all. I find an indication that the perplexity test implemented in the repository is not capable of capturing the loss introduced by quantizing the cache, for some reason beyond my understanding.

@oobabooga
Copy link
Copy Markdown
Owner

I'll keep the name as-is as "Q4" is an ill-defined term, and cache_4bit may be reused in the future for other backends.

@oobabooga oobabooga merged commit 104573f into oobabooga:dev Mar 7, 2024
PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants