Conversation
|
Gigachad! :D Note : Rope cache doubles the perplexity when used. |
|
Also, quantized K cache might need to be adjusted: If I use: llama-perplexity -m GigaChat3-10B-A1.8B-Q8_0.gguf -mg 2 --override-kv deepseek2.expert_used_count=int:4 -c 512 -mqkv -gr -ctk q8_0 -ctv q8_0 --host 127.0.0.1 --port 8080 -f wiki.test.raw I get: perplexity: tokenizing the input .. |
|
I'm only testing on CPU where I'm not seeing that error (maybe makes sense given it looks like CUDA path issue). I don't even know what -mg 2 \
-mqkv \
-ger \
--override-kv deepseek2.expert_used_count=int:2 \
-ctk q8_0 \I didn't use |
|
@ubergarm : yeah, I tested with 3 experts and the PPL is not so bad. Time for inference now! -mg is main gpu, a relic of the early llama.cpp versions used to select the GPU for mono-gpu inference, or the KV cache destination with split-row. I never knew if something else was aimed at by this command, so I set it still on my fastest GPU. As for ctv, I always forget to remove it because it's either used, either irrelevant. :D |
For the K2-Thinking too :) |
This PR adds support for GigaChat3 and closes #994
The model uses the same MLA attention mechanism as DeepSeek, but with a twist, where the value length is not 128 as in DeepSeek models, but 192. I guess, everybody feels the need to make a creative alteration to an existing architecture.
Here some
sweep-benchresults for the 10GB-A1.8B variant (https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B-bf16) quantized asQ8_0.ik_llama.cpp, RTX-4080
llama.cpp, RTX-4080
ik_llama.cpp, CPU-only, Ryzen-7950X
llama.cpp, CPU-only, Ryzen-7950X