-
Notifications
You must be signed in to change notification settings - Fork 14.1k
llm : add Falcon support #2717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llm : add Falcon support #2717
Conversation
|
Conversion of 7b model do not work. The qkv transform needs n_kv_head = 1 for this to work |
|
main using falcon-7b: There is no norm_2 in 7b model |
|
@klosax Reconvert and it works Strangely, enabling Metal we crash here: If I disable the graph concurrency optimization, it does not crash: Any guesses what could be wrong? |
Not a clue. I can try enabling cublas to see if that works. Edit: Enabling cublas works on 40b-q4_0 model. |
|
Perplexity on 7b and 40b wont work: It look it is something with the tokenizer since using |
|
Yes, something wrong with the tokenization. Here is the stack trace: system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
Process 26597 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
frame #0: 0x000000018b3d8764 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`:
-> 0x18b3d8764 <+8>: b.lo 0x18b3d8784 ; <+40>
0x18b3d8768 <+12>: pacibsp
0x18b3d876c <+16>: stp x29, x30, [sp, #-0x10]!
0x18b3d8770 <+20>: mov x29, sp
Target 0: (perplexity) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
* frame #0: 0x000000018b3d8764 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x000000018b40fc28 libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x000000018b31dae8 libsystem_c.dylib`abort + 180
frame #3: 0x000000018b3c8b84 libc++abi.dylib`abort_message + 132
frame #4: 0x000000018b3b83b4 libc++abi.dylib`demangling_terminate_handler() + 320
frame #5: 0x000000018b08f03c libobjc.A.dylib`_objc_terminate() + 160
frame #6: 0x000000018b3c7f48 libc++abi.dylib`std::__terminate(void (*)()) + 16
frame #7: 0x000000018b3cad34 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
frame #8: 0x000000018b3cace0 libc++abi.dylib`__cxa_throw + 140
frame #9: 0x0000000100026ba4 perplexity`std::__1::__throw_out_of_range[abi:v15006](__msg="unordered_map::at: key not found") at stdexcept:268:5
frame #10: 0x00000001000619bc perplexity`std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, int, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, int> > >::at(this=0x00000001006040e0 size=65023, __k="<0xE7>") const at unordered_map:1863:9
frame #11: 0x00000001000615a8 perplexity`llama_byte_to_token(vocab=0x00000001006040d8, ch='\xe7') at llama.cpp:2852:30
frame #12: 0x000000010005cb04 perplexity`llama_tokenizer::resegment(this=0x000000016fdfdf08, symbol=0x0000000130fe5888, output=size=1581) at llama.cpp:2991:44
frame #13: 0x000000010005be4c perplexity`llama_tokenizer::tokenize(this=0x000000016fdfdf08, text=" \n = Robert Boulter = \n \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as \" Craig \" in the episode \" Teddy 's Story \" of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . \n In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of the televi"..., output=size=1581) at llama.cpp:2971:13
frame #14: 0x0000000100036bd8 perplexity`llama_tokenize_internal(vocab=0x00000001006040d8, raw_text=" \n = Robert Boulter = \n \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as \" Craig \" in the episode \" Teddy 's Story \" of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . \n In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of the televi"..., bos=true, escape=false) at llama.cpp:3055:15
frame #15: 0x00000001000367b0 perplexity`::llama_tokenize_with_model(model=0x0000000100604080, text=" \n = Robert Boulter = \n \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as \" Craig \" in the episode \" Teddy 's Story \" of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . \n In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of the televi"..., tokens=0x00000001187a0000, n_max_tokens=1290590, add_bos=true) at llama.cpp:5480:16
frame #16: 0x000000010003671c perplexity`::llama_tokenize(ctx=0x0000000102008200, text=" \n = Robert Boulter = \n \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as \" Craig \" in the episode \" Teddy 's Story \" of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . \n In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of the televi"..., tokens=0x00000001187a0000, n_max_tokens=1290590, add_bos=true) at llama.cpp:5450:12
frame #17: 0x0000000100014f50 perplexity`llama_tokenize(ctx=0x0000000102008200, text=" \n = Robert Boulter = \n \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as \" Craig \" in the episode \" Teddy 's Story \" of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . \n In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of the televi"..., add_bos=true) at common.cpp:711:16
frame #18: 0x0000000100003284 perplexity`perplexity(ctx=0x0000000102008200, params=0x000000016fdfedd8) at perplexity.cpp:35:19
frame #19: 0x0000000100006034 perplexity`main(argc=8, argv=0x000000016fdff258) at perplexity.cpp:412:9
frame #20: 0x000000018b0b7f28 dyld`start + 2236
(lldb) frame select 11
frame #11: 0x00000001000615a8 perplexity`llama_byte_to_token(vocab=0x00000001006040d8, ch='\xe7') at llama.cpp:2852:30
2849 char buf[7];
2850 int result = snprintf(buf, sizeof(buf), "<0x%02X>", ch);
2851 GGML_ASSERT(0 <= result && result < 7);
-> 2852 return vocab.token_to_id.at(buf);
2853 }
2854
2855 static std::string llama_escape_whitespace(const std::string& text) {
(lldb) print ch
(uint8_t) $0 = '\xe7'
(lldb) |
Is |
No, it is bpe. |
So I'd recommend switching vocabulary type then. That will be a first for me ;) Edit: is there an easy way for me to test this? |
|
|
Build the branch and test with Falcon-7b from here: |
|
Seeing that falcon is entirely in bfloat16, should it be converted as f32? |
These are used in test to assert low level behavior of the tokenizer. |
|
I believe the unicode implementation did not use regex for that reason? |
@ggerganov Getting NaNs using cublas without offloading. That was the reason the HellaSwag score was so low when quantizing the output tensor Q4_0. It seems to work fine without blas. Dont know if this is the cuda GELU or something else. |
|
Since you have a repro, can you try if the NaNs disappear after changing |
That seems to work. I will run some more tests. |
I created that tokenizer when I realized that without the bpe "merges" support barely any longer token is correctly tokenized, including the special tokens which frequently are used in fine tunes also any european language requires it. @klosax I did not use regex only because I did not like adding it as a huge dependency and I thought it's going to be very slow, in the end that decision cost me quite some time. Same reasoning for creating the custom unicode c++ library, just in that case I'm sure it was the right decision. |
Yes, the merges are important and adding the merges lowered the perplexity by 33%. The value of having merges in bpe is comparable to having the scores in sentencepiece.
The current llama.cpp implementation of the tokenizer uses regex for simplicity and it is slow. I guess it will be replaced later.
Yes, your unicode library is a much better choice than depending on the huge ICU for full unicode support. If this is implemented in llama.cpp it could possibly also be used by the LLaMA sentencepiece tokenizer. The importance of a good tokenizer should not be underestimated when it comes to generation quality. |
|
Falcon 180B has been released now, would be great to support that: https://huggingface.co/blog/falcon-180b |
|
o.o 360GB safetensors ...... |
|
I've got 256GB of CPU ram I could try it on, once there's a GGUF. |
|
You can try to create the GGUF yourself using the https://github.com/ggerganov/llama.cpp/blob/master/convert-falcon-hf-to-gguf.py |
|
on further inspection, the falcon convert script wont work. It might be though that the normal issues with falcon converter .py :
|
|
The normal
|
|
Looks like convert.py expects all the |
|
Seems there is code in that for processing different tensors split across files, but changing it to that still fails, due to the config.json missing |
|
If setting |
the config.json specifies:
should be save to ignore
yea, I think it is the same as falcon 40B |
|
I've been told that the model architecture should be identical to 40B, and I'm making GPTQs right now and it seems to work fine with the same code as worked with Falcon 40B. I'd assumed convert-falcon-to-gguf.py would be the one to update, rather than convert.py? I had a quick look at it myself earlier. There's a few easy fixes at the top, changing field names to match their new names in config.json. But then it's written to load from pytorch_model.bin and Falcon 180 is in Safetensors and when I looked at the convert.py code for safetensors it looked complicated, with memory mapping and such, so I gave up at that point :) If anyone can get a working script out I can have GGUFs up soon after that |
|
I got a version of convert-falcon-to-gguf.py working: https://github.com/logicchains/llama.cpp/blob/falcon180B/convert-falcon180-hf-to-gguf.py (forked because I can't make a branch here). And it seems to just work! Not too slow either; 0.8 tokens/second for the 6bit quantisation, while with llama2 70b 8bit I get around 1.5 tokens/seconds. ./main -m ./models/falcon-180B-q6_K.gguf -c 2048 --temp 0.7 -t 32 -p "The secrets to a happy marriage are as follows:"
|
|
@logicchains very nice, can you open a pr? even if its dirty, then mark it as a draft. :) |
|
Done: #3049 . It works as is if we're fine with having a lot of duplication between |
|
@logicchains thank you so much! That's awesome I used your script successfully, and am currently uploading all the quant formats to: https://huggingface.co/TheBloke/Falcon-180B-Chat-GGUF Even the Q2_K is larger than 50GB and so unfortunately I have to split the files, so there's some manual work required by the user to rejoin them after download. (Oh how I wish GGUF had implemented support for multi-part/sharded files :( ) But they work! Thanks again |
ref: #1602
This PR adds support for Falcon models in
llama.cppCurrently, I've put everything inside
llama.cppsource file.We can think about better refactoring so that we can scale the process and add more models without the source file becoming too big. But for now, the main goal was to see what it takes to add support for a new LLM and serve as an example.
The PR also implements a more accurate BPE tokenizer utilizing merges. Used a reference implementation from:
https://github.com/cmp-nct/ggllm.cpp
However, I've dropped the unicode lib for clarity and therefore the implementation does not produce exactly correct tokenization in all cases. It should be good enough for latin languages. The advantage is that the code is more compact. Hopefully in the future we will improve it. In any case, 3rd party tokenizers are always an option and work well with
llama.cpp, so if accuracy is essential - this is the recommended way.CUDA offloading still does not work due to missing RoPE NeoX kernel
TODO
llama_model_load_internal()into reusable stepsllm_load_falcon()similar tollm_load_llama()llm_build_falcon()similar tollm_build_llama()llm_build_falcon()functionbpe_gpt2_preprocess()is quite slow (regex)Usage
Performance
build: 38b16df (1052)
build: 176ea71 (1052)
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
build: 176ea71 (1052)