Conversation
|
What do you get on hybrid? I wanted to grab q3/q4 and was hoping the less active params let it reason at reasonable speeds. Quant is gonna take the rest of the weekend to download :( |
It still does not solve the Mimo-2 quantized cache issue.
|
If I use -khad for the KV cache:
Then I get that: Otherwise, it works. |
|
OK, so, my system: AMD 7800x3d 8 core Command: build/bin/llama-server -m models/MiMo-V2-Flash-IQ3_XS.gguf -ot "blk.(?:[0-9]|[1-3][0-9]|[4][0]).ffn.*=CPU" -c 32768 -b 8192 -ub 8192 -ctk q8_0 -ctv q8_0 --threads 7 -ngl 95 -sp -amb 512 --host 0.0.0.0 --port 8080 --webui none --repeat-last-n 2048 -mqkv --jinja Performance: This is slower then minimax m2.1, which with the same settings gives me about 15 t/s. Is MTP working? Also, the model doesn't think. Which is a problem because without thinking this model is kind of dumb. On silly tavern I have the thinking settings with chat completion to maximum, but it doesn't seem to work. Edit: OK the model clearly has coherence problems. It's overall quite nonsensical, no matter the context size. Edit2: Apparently the first layer is dense, so my -ot becomes -ot "blk.(?:[1-9]|[1-3][0-9]|[4][0]).ffn._exps.=CPU". |
|
@MrHills-2 Neither
Haven't you learned yet that in the time age of LLMs, everybody shamelessly and massively exaggerates the utility of the thing that they have done? |
It looks like there is still an issue with SWA. I'm looking into it. |
|
Something is not quite right, so converted to draft. |
|
OK, PPL is the same as mainline (actually it is slightly lower). Checked for a few context lengths, and it is fine. If I had a bug in the SWA attention mask preparation, one would see it in PPL. The issue is that when generating after a while the model starts endlessly repeating the same thing again and again. I thought there is an issue with my implementation. But I have now observed the exact same behavior also in mainline. The probability for endless repetition appears to be very sensitive to the temperature. So, my best guess at this point is that my implementation is fine, but the So, I'll remove the draft status. Would appreciate test reports from more users. |
|
I think there's something off again. The model is def better, but it's still a little incoherent, and it doesn't follow simple prompts consistently, like keeping the answer beneath 100 words in length (it's super yappy). The weirdest thing is that I'm getting 16t/s at 29000 tokens of context, but only 12t/s at 12000 tokens of context. Also, we need a way to turn thinking on and off. |
|
Well.. it is working. I have to check how the outputs are.
edit: ok.. the output is a little off.. and the template seems like it's just chatML. I dunno if it's my sampling or what. Gonna have to experiment. reasoning is supposed to be able to be turned off with jinja template switches.. like in sillytavern body params I can send
|
On the huggingface page they tell you to use 0.8 temperature normally and 0.3 for agentic tasks. At 0.6 already I often have massive looping and repetition problems, and the output often goes on forever. The model does quite bad in instruction following, it would be interesting to run an if benchmark to see how it compares to non quantized. I'd do it but I'm not home unfortunately. I'm doing everything with ssh on a phone and it's a bit hard. |
|
The iq2_xxs model I downloaded is basically useless, so possibly the model does not quantize very well. It would be interesting to see how iq5 behaves. I'll be taking a week break, so it will be next year. |
|
Ran through it a bunch on OR and locally it only sometimes returns the same quality. Also used temp of 1.0 on openrouter with the full model. Doubtful they reduce it automatically deepseek style.. My instructions are being followed, it's just thinking devolves into repetition and sometimes outputs do too. I can try to lower temperature but probably not that simple. I d/l the bartowski quant. |
This PR adds support for Mimo-V2-Flash (https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash), and closes #1076.
Unlike the mainline PR 18328, which does not support flash attention (FA), FA is supported here.
Split mode "graph" is not supported for now. It turns out my splitting logic for the attention tensors only works when the K- and V attention head size is the same, which is not true for Mimo-V2. So, this will have to be a follow up PR. Also, I did not add support for HF->GGUF conversion, so mainline will need to be used for that.
Another limitation of this PR is that quantized KV cache cannot be used on CUDA(we get NaNs). It works fine on the CPU, so will need to investigate why quantized KV cache fails on CUDA.Fixed with latest commit.The other caveat is that the large saving in KV cache size that could be possible due to the aggressive SWA used by Mimo-V2 is not realized, so here mainline has advantage.
On the other hand, because mainline does not support FA for Mimo-V2, I was still able to go to a much larger context than with mainline. I downloaded the
IQ2_XXSquantization from Bartowski. I picked that one so that I can use full GPU offload on the 4x3090 system. With mainline the best I could do before OOM was a context of 8192 with u-batch size of 1024. Withik_llama.cppI can go up to a context of 32k tokens using u-batch size of 2048. Correspondingly performance here is quite a bit better than over there (see sweep bench results below).CPU-only performance is quite decent: I get 115 t/s for PP-2048 and 21.8 t/s for TG-128 on a Ryzen-3995WX CPU.
ik_llama.cpp, Mimo-V2-Flash, IQ2_XXS, 4x3090
llama.cpp, Mimo-V2-Flash, IQ2_XXS, 4x3090