Bitnet changes by ikawrakow · Pull Request #106 · ikawrakow/ik_llama.cpp

ikawrakow · 2024-10-24T13:58:41Z

Change IQ1_BN and IQ2_BN to have per row scales. In that way we can handle Bitnet models with and without separate tensor scales
Remove IQ1_TN and IQ2_TN. With the above change these are now redundant. IQ1_BN and IQ2_BN are also faster, so no reason to keep these around
Change build_bitnet() to use the standard llm_build_kv() function for the self attention portion. I was hoping this would also allow to use FA, but nope, the Bitnet models have a strange head size of 100 that is not supported by the FA implementations.

Everything works except - can you guess? - Metal. There is something wrong with the dot product kernels and I simply don't see what. I have to fix Metal before merging.

On CUDA (RTX-4080) we now get 368 t/s for TG-128 with the 3.3B Bitnet model (IQ2_BN). When I first added Bitnet support we were at ~320 t/s, so quite an improvement since then.

Update

I wasted quite some time trying to figure out why the Bitnet changes don't work on Metal. At the end it turned out that it is PR #98 that breaks the Metal back-end. So, this PR reverts #98.

@agray3 Do you have the ability to investigate why #98 breaks the Metal back-end?

Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales).

Dequantize works, but there is still something wrong with the dot products.

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels.

Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn.

My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail).

This reverts commit f2d315b. As far as I can tell, the commit breaks Metal TG.

* Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (ikawrakow#98)" This reverts commit f2d315b. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Iwan Kawrakow added 12 commits October 24, 2024 12:48

Adapting iq2_bn to work without separate scale tensors

2e9b3ba

Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales).

Adapting iq1_bn to work without separate scale tensors

2db9f1e

Adapting iq2_bn: CUDA dequantize

0d17e8c

Adapting iq2_bn: CUDA works

fa5bbe5

Adapting iq1_bn: CUDA works

6191518

Adapting iq1_bn, iq2_bn: NEON

6ef979b

Adapting iq1_bn, iq2_bn: Metal

3ba962a

Dequantize works, but there is still something wrong with the dot products.

WIP

6952e67

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels.

Remove iq1_tn and iq2_tn - Part 1

5c42877

Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn.

Remove iq1_tn and iq2_tn - Part 2

d696d64

Bitnet: use the standard llm_build_kv to build self attention

5ccd33e

My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail).

Revert "Avoid rebuild of GGML graph for each token (#98)"

af4255d

This reverts commit f2d315b. As far as I can tell, the commit breaks Metal TG.

ikawrakow merged commit 6b968f3 into main Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitnet changes#106

Bitnet changes#106
ikawrakow merged 12 commits intomainfrom
ik/adapt_iq1_iq2_bn

ikawrakow commented Oct 24, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikawrakow commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ikawrakow commented Oct 24, 2024 •

edited

Loading