Change cache_4bit to cache_q4, improve descriptions by bartowski1182 · Pull Request #5649 · oobabooga/text-generation-webui

bartowski1182 · 2024-03-07T04:03:48Z

The new Q4 option is actually a better quantized version instead of just chopping off bits like fp8, which makes it more accurate than fp8 while saving even more VRAM, and I think that should be properly reflected

Checklist:

I have read the Contributing guidelines.

oobabooga · 2024-03-07T05:20:24Z

README.md

 |`--no_flash_attn`                     | Force flash-attention to not be used. |
 |`--cache_8bit`                        | Use 8-bit cache to save VRAM. |
-|`--cache_4bit`                        | Use 4-bit cache to save VRAM. |
+|`--cache_q4`                          | Use Q4 cache to save a lot of VRAM. Recommended over 8-bit cache, uses grouped quantization for better performance while saving even more VRAM. |


Is it "recommended?" I haven't seen any perplexity tests, and presumably 4-bit cache introduces a bigger loss.

By turboderp yes, it's a much better way of compressing the kv cache, uses grouped quantization instead of just truncating the numbers like fp8

I perplexity tested some models with and without when testing 120b 3-bit vs 70b. Perplexity for 8 and 4bit cache is basically rounding error difference. Deep in the decimal points. Don't forget that 4-bit may be slower than 8-bit.

I also found in previous tests that the 8-bit cache didn't change perplexity at all. I find an indication that the perplexity test implemented in the repository is not capable of capturing the loss introduced by quantizing the cache, for some reason beyond my understanding.

docs/04 - Model Tab.md

modules/exllamav2.py

modules/exllamav2_hf.py

…artowski1182-main

oobabooga · 2024-03-07T16:07:45Z

I'll keep the name as-is as "Q4" is an ill-defined term, and cache_4bit may be reused in the future for other backends.

--------- Co-authored-by: oobabooga <[email protected]>

bartowski1182 and others added 2 commits March 6, 2024 23:01

Add Q4 option for exllamav2

ec56c73

Merge branch 'dev' into bartowski1182-main

1b63b25

oobabooga requested changes Mar 7, 2024

View reviewed changes

oobabooga and others added 10 commits March 6, 2024 21:43

Small fix for cuda 11.8 in the one-click installer

fd7110b

Address review comments

6753e4a

Revision

4d6970c

Merge remote-tracking branch 'refs/remotes/bartowski1182/main' into b…

e0316f3

…artowski1182-main

Revision

4c331fb

Missing bit

3cdeb49

Bug

c889799

Change a comment

98c8e87

Merge branch 'dev' into bartowski1182-main

25f6492

Fix a bug

4904fe7

oobabooga merged commit 104573f into oobabooga:dev Mar 7, 2024

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024

Update cache_4bit documentation (oobabooga#5649)

dbe1076

--------- Co-authored-by: oobabooga <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change cache_4bit to cache_q4, improve descriptions#5649

Change cache_4bit to cache_q4, improve descriptions#5649
oobabooga merged 12 commits intooobabooga:devfrom
bartowski1182:main

bartowski1182 commented Mar 7, 2024

Uh oh!

oobabooga Mar 7, 2024

Uh oh!

bartowski1182 Mar 7, 2024

Uh oh!

Ph0rk0z Mar 7, 2024

Uh oh!

oobabooga Mar 7, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oobabooga commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bartowski1182 commented Mar 7, 2024

Checklist:

Uh oh!

oobabooga Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

bartowski1182 Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

Ph0rk0z Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

oobabooga Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oobabooga commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants