Support for DeepseekV32ForCausalLM with DeepSeek Sparse Attention (DSA) by fairydreaming · Pull Request #21149 · ggml-org/llama.cpp

fairydreaming · 2026-03-29T12:56:48Z

Overview

This PR adds support for DeepseekV32ForCausalLM (DeepSeek V3.2 Exp, DeepSeek V3.2, DeepSeek V3.2 Speciale) models. It contains implementation of the lightning indexer and DeepSeek Sparse Attention (DSA) - both implemented in the simplest possible way as a proof of concept. So far only CPU and CUDA backends are supported.

Due to the way it's currently implemented it doesn't improve long context performance yet, more work is needed for this.

Some GGUFs for testing are available here (-light models), I uploaded Q8_0/Q4_K_M quants, so you need over 700GB/400GB of RAM/VRAM to run them.

I also created a 16GB baby DeepSeek V3.2 GGUF for VRAM-deprived people. It outputs incoherent gibberish, but should be useful for testing and optimizing this implementation even with limited resources.

I really could use some help with verifying the implementation correctness. If you have large GPU cluster and can run some benchmarks to compare results with official reported benchmark results for DeepSeek V3.2 models then go for it. More details in #21183.

Fixes #16331, #20363

Additional information

Decisions I made when implementing this:

new model arch DEEPSEEK32 was added (mostly a copy of existing GLM_DSA arch),
sparse attention was implemented by masking KQ mask entries corresponding to tokens that are not in the set of top-k tokens selected by the lightning indexer,
for this purpose I added new GGML op GGML_OP_SCATTER that works similar to torch scatter_ operation but is currently limited to setting tensor elements at specified indices to a given scalar value,
Hadamard transform was added as another new GGML op GGML_OP_HADAMARD with implementation borrowed from ik_llama.cpp (thx @ikawrakow),
KV cache was implemented as a new llama_kv_cache_dsa class which aggregates the usual llama_kv_cache that caches MLA latent representations (same as before for DeepSeek V3) and another new llama_ik_cache class (basically a copy of llama_kv_cache stripped of code related to V vector) that caches lightning indexer keys,
since there are no official jinja templates for V3.2 and V3.2 Speciale, I simply decided to ignore this problem for now. You have to explicitly set chat template for these models (using jinja template from V3.2 Exp with these models will allow you to chat but tool calls won't work correctly).

Requirements

Due to limitations of the current CUDA ggml_top_k() implementation NVIDIA CUDA CCCL library (version >3.2) and enabling GGML_CUDA_USE_CUB during CUDA backend compilation is needed, otherwise the CUDA implementation will crash for context sizes larger than (I think) 1024 tokens. I use it with CUDA 13.2 and CCCL 13.2.27.
Bug in ggml_top_k() is now fixed, fix is merged, so it should work even on 2.[89] CUDA without CCCL.

Also if you want to convert the model by yourself, set add_bos_token to true in tokenizer_config.json before the model conversion - this is needed for DeepSeek V3.2 and DeepSeek V3.2 Speciale. The conversion script has assert that checks this.

Next Steps

I'd like to confirm my architectural choices regarding the implementation,
If they are accepted I will clean up the code if needed, merge with the current master and it will be ready for code review,
If not then So Long, and Thanks for All the Fish. Just joking, we can talk about this.

I have read and agree with the contributing guidelines
AI usage disclosure: YES, AI was used as an assistant helping me find bugs in CUDA kernel implementations.

…e attention). Needs manual change of add_bos_token to true in tokenizer_config.json before conversion.

…I think it's best not to quantize them.

…DeepSeek V3.2.

… ik_llama.cpp)

…er implementation

…indexer implementation since the former fails for large tensors even when using CCCL.

…ion.

… of llama_kv_cache and new llama_ik_cache (lightning indexer key cache). model : used new llama_kv_cache_dsa instead of modified llama_kv_cache with indexer keys in DeepseekV32ForCausalLM model : removed non-MLA path in DeepseekV32ForCausalLM

…lar to torch scatter_ operation.

…e can get rid of ggml_cast() calls in sparse attention implementation

…rm implementations

…orCausalLM-based models.

CISC · 2026-03-29T13:16:59Z

Due to limitations of the current CUDA ggml_top_k() implementation NVIDIA CUDA CCCL library (version >3.2) and enabling GGML_CUDA_USE_CUB during CUDA backend compilation is needed, otherwise the CUDA implementation will crash for context sizes larger than (I think) 1024 tokens. I use it with CUDA 13.2 and CCCL 13.2.27.

Hmmm, it should not crash, but fall back...

fairydreaming · 2026-03-29T13:25:40Z

Due to limitations of the current CUDA ggml_top_k() implementation NVIDIA CUDA CCCL library (version >3.2) and enabling GGML_CUDA_USE_CUB during CUDA backend compilation is needed, otherwise the CUDA implementation will crash for context sizes larger than (I think) 1024 tokens. I use it with CUDA 13.2 and CCCL 13.2.27.

Hmmm, it should not crash, but fall back...

I will check it again and report a bug if needed.

CISC · 2026-03-29T13:32:44Z

* Hadamard transform was added as another new GGML op `GGML_OP_HADAMARD` with implementation borrowed from [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) (thx @ikawrakow),

This was already added in #21038 just not as an op, I suggest removing this one for obvious reasons and instead moving that implementation to an op, leaving backend implementation to others.

fairydreaming · 2026-03-29T13:53:23Z

* Hadamard transform was added as another new GGML op `GGML_OP_HADAMARD` with implementation borrowed from [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) (thx @ikawrakow),
This was already added in #21038 just not as an op, I suggest removing this one for obvious reasons and instead moving that implementation to an op, leaving backend implementation to others.

Lol, so now I'm supposed to choose between @ikawrakow and @ggerganov Hadamard transform implementation? Thanks @CISC, very helpful of you. 😅

src/llama-model.cpp

CISC · 2026-03-29T13:59:57Z

Lol, so now I'm supposed to choose between @ikawrakow and @ggerganov Hadamard transform implementation? Thanks @CISC, very helpful of you. 😅

It's not a matter of choosing, I am genuinely being helpful, you know why this is contentious, and the implementation is already here and quite trivial, besides it's preferable (and in our policy) not to include backend changes in non-backend PRs.

fairydreaming · 2026-03-29T17:35:06Z

Lol, so now I'm supposed to choose between @ikawrakow and @ggerganov Hadamard transform implementation? Thanks @CISC, very helpful of you. 😅

It's not a matter of choosing, I am genuinely being helpful, you know why this is contentious

So I shouldn't include code from you-know-who in my PR because you-know-why? 😂 (btw I have Iwan permission to use this code in llama.cpp)

If there is an official set of project-wide rules to follow regarding this (apparently highly radioactive) matter then it probably should be formalized in CONTRIBUTING.md file so that:

everyone will have a crystal clear picture of the current situation and any changes,
people will learn about it as early as possible,
it will affect their work as little as possible.

That's what I would call helpful in this matter.

AesSedai · 2026-03-29T21:15:30Z

@fairydreaming Hi, I'm /u/digger412 on reddit, figured I'd migrate the convo here. I've got the electrical outlet installed last week and waiting on a new rack case to arrive to house everything. I think I can have 4 of the 6000 Pros up and running later today (with some hodgepodging and jank setup).

If you can upload a quant that will fit into 384GiB of VRAM then I can try to run it, or I guess I could download the weights and convert it myself with your PR 🤔

Might take a few days but I will get to test this, I promise!

pwilkin · 2026-03-29T21:30:00Z

So I shouldn't include code from you-know-who in my PR because you-know-why?

That's the current state of affairs, yes ;)

ngxson · 2026-03-29T21:54:36Z

src/llama-ik-cache.h

I could be wrong, but I assume the IK cache (I assume you mean index K) is K-only cache; Can this be replaced by using this instead? #19067

@ngxson Yes, but llama_kv_cache reads tensor dimensions, number of heads etc directly from hparams, so I can't simply instantiate another instance of the cache with different parameter values. I'm not satisfied with my current solution either, as it duplicates a lot of code.

Alternative solution would be to stuff the indexer key tensors in existing kv cache along with currently stored MLA latent representation + RoPE prefix tensors and make a view with an offset to read the cache. But that would make both MLA KV cache and indexer cache non-contiguous, not sure if that's a good idea.

Hmm yeah I see. Because the indexer uses different size than the main attention block, duplicating the class is probably the cleanest way we can do for now.

In near future, we can also refactor KV cache, such that K and V are 2 separated llama-vec-cache. The "vector cache" can be reused across different types of cache, including index K, KV, iswa. CC @ggerganov for visibility

We can decouple the kv cache implementation from the struct llama_model and struct llama_hparams. Would need to introduce struct llama_kv_cache_params and use that within the implementation without reference the model and it's hparams.

This way you should be able to instantiate two different KV caches with different llama_kv_cache_params. Would that work?

@ggerganov Yes, that should work. But I did a quick check and llama_kv_cache.cpp currently uses:

hparams.has_kv(il) hparams.is_mla() hparams.is_n_embd_v_gqa_variable() hparams.is_swa(il) hparams.n_embd_head_k(il) hparams.n_embd_head_v(il) hparams.n_embd_k_gqa(il) hparams.n_embd_v_gqa(il) hparams.n_embd_v_gqa_max() hparams.n_head_kv(il) hparams.n_layer hparams.n_layer_kv() hparams.n_lora_kv hparams.no_alloc hparams.n_pos_per_embd() hparams.n_rel_attn_bkts hparams.n_rot(il) hparams.rope_type hparams.use_alibi model.arch model.dev_layer(il) model.get_rope_factors() model.get_rope_freq_base() model.get_rope_freq_scale()

So I'm afraid it's not a trivial endeavor, but a major refactoring effort.

As a first step, you can try to pass hparams separately from model and see if this will help deduplicate the llama_kv_cache/llama_ik_cache implementations.

So add a constructor:

llama_kv_cache( const llama_model & model, const llama_hparams hparams, // <--- custom hparams, can be overridden for indexing caches ggml_type type_k, ggml_type type_v, bool v_trans, bool offload, bool unified, uint32_t kv_size, uint32_t n_seq_max, uint32_t n_pad, uint32_t n_swa, llama_swa_type swa_type, const layer_filter_cb & filter, const layer_reuse_cb & reuse);

This should be a small change and if it works, we can prepare a small refactor to support that.

As a first step, you can try to pass hparams separately from model and see if this will help deduplicate the llama_kv_cache/llama_ik_cache implementations.

So add a constructor:

llama_kv_cache( const llama_model & model, const llama_hparams hparams, // <--- custom hparams, can be overridden for indexing caches ggml_type type_k, ggml_type type_v, bool v_trans, bool offload, bool unified, uint32_t kv_size, uint32_t n_seq_max, uint32_t n_pad, uint32_t n_swa, llama_swa_type swa_type, const layer_filter_cb & filter, const layer_reuse_cb & reuse);

This should be a small change and if it works, we can prepare a small refactor to support that.

@ggerganov So basically: make a copy of this huge pile of parameters, tweak some of them so that the second llama_kv_cache instance works as intended for caching indexer tensors and hope it won't break in the future? Horrible solution looking from the software engineering point of view, but matches the llama.cpp spirit well. Will try.

ngxson · 2026-03-29T21:55:56Z

So I shouldn't include code from you-know-who in my PR because you-know-why?

Probably you can get more context from PR number 19726

fairydreaming · 2026-03-29T22:35:25Z

@fairydreaming Hi, I'm /u/digger412 on reddit, figured I'd migrate the convo here. I've got the electrical outlet installed last week and waiting on a new rack case to arrive to house everything. I think I can have 4 of the 6000 Pros up and running later today (with some hodgepodging and jank setup).

If you can upload a quant that will fit into 384GiB of VRAM then I can try to run it, or I guess I could download the weights and convert it myself with your PR 🤔

Might take a few days but I will get to test this, I promise!

Great to hear from you! No need to hurry, I think I'd rather prefer some larger quant tested with all 8 cards, so that quantization won't affect the model cognitive performance. Also more VRAM = more concurrent requests. It's getting late today, so tomorrow I will create a discussion about testing the implementation and we can plan there in details.

fairydreaming · 2026-03-30T08:54:35Z

So I shouldn't include code from you-know-who in my PR because you-know-why?

Probably you can get more context from PR number 19726

@ngxson Initially I did some reading on this and the origins, but I had more questions than answers afterwards and overall it just made me sad.

am17an · 2026-03-30T14:26:02Z

I just looked at the CUDA code briefly. For the scatter, you should extend GGML_OP_FILL to take a tensor of positions to copy. For the hadamard rotation, an OP is not the correct way #21038 (comment). Most likely it will added in some form before this PR is ready, so you can just use it when it happens. So no need to feel sad.

fairydreaming · 2026-03-30T16:24:18Z

I just looked at the CUDA code briefly. For the scatter, you should extend GGML_OP_FILL to take a tensor of positions to copy. For the hadamard rotation, an OP is not the correct way #21038 (comment). Most likely it will added in some form before this PR is ready, so you can just use it when it happens. So no need to feel sad.

@am17an I've read #21038 in more detail today and this approach indeed may be applicable to my PR. I suppose I just have to wait until the dust in llama_kv_cache settles and then clone it to llama_ik_cache to use in Hadamard transforms of indexer query and key vectors.

ggerganov · 2026-03-31T10:30:38Z

src/llama-graph.cpp

+
+    const auto & kq_mask = inp->get_kq_mask();
+
+    // prepare new kq mask - starts filled with -INFINITY
+    ggml_tensor * kq_mask_all = ggml_fill(ctx0, kq_mask, -INFINITY);
+
+    // modify it by unmasking tokens that are in top_k indices
+    ggml_tensor * kq_mask_top_k = ggml_scatter(ctx0, kq_mask_all, top_k, 0);
+
+    // combine with the original kq mask
+    kq_mask_top_k = ggml_add(ctx0, kq_mask_top_k, kq_mask);
+


I wonder, instead of masking the KV cache, wouldn't it be more efficient to extract a new K and KQ mask using ggml_get_rows(..., top_k) and perform the attention on those smaller tensors?

I wonder, instead of masking the KV cache, wouldn't it be more efficient to extract a new K and KQ mask using ggml_get_rows(..., top_k) and perform the attention on those smaller tensors?

@ggerganov I thought about this solution, but decided to go with the simplest possible one for now. By the way I think for KQ mask in this case we would need something like "for each row get elements that are in the corresponding top_k indices row". Do we have GGML OP like this?

I see, it's not so simple as I thought.

Having ggml_scatter() seems useful to have anyway.

GGML_OP_FILL can be extended to provide a list of indices to fill?

I see, it's not so simple as I thought.

Having ggml_scatter() seems useful to have anyway.

@ggerganov AFAIK torch gather works like I mentioned - gathers values from an axis based on specified indices (the way it's needed for KQ mask in this case), so it would be another new GGML OP (kind of symmetric to scatter that puts values on axis based on specified indices). My scatter is somewhat crippled anyway since it only accepts single scalar value, not tensor of values. So maybe it's a better idea to implement GGML_OP_GATHER to get KQ mask elements indicated by top_k and then use ggml_get_rows() to perform attention only on cached vectors that are in top_k indices.

I'm currently waiting for #21038 regarding the Hadamard transform implementation, so I can try to implement this solution in the meantime and see what comes of it.

How large is the n_top_k typical?

How large is the n_top_k typical?

@ggerganov DeepSeek V3.2 sets it to 2048. If n_kv is shorter than 2048 the result will be shorter too but for long sequences it would always take top 2048 cached k/v vectors.

Do we know that the indexer improves prefill performance? I remember reading (and it is also obvious) that decoding (i.e. batch size 1) will be much faster with the indexer, but I think that for large batch sizes, we won't benefit much compared to simply doing the regular masked attention without re-gathering the indexed KV data. The reason is that at batch size 512 for example, each token "activating" 2048 KV cells would usually activate the entire cache anyway.

Just want to know if we should focus on a solution that works for small batches (e.g. less than 32), which might be much simpler.

@ggerganov The DeepSeek V3.2 paper said:

for short-sequence prefilling, we specially implement a masked MHA mode to simulate DSA, which can achieve higher efficiency under short-context conditions

So I guess the optimal solution is a hybrid one and we need both (masked dense attention for short sequences and sparse attention for longer) - and that applies both for prefill and decode.

Regarding your remark about entire cache activation - I doubt the lightning indexer top k position selection would activate the entire cache, likely it's trained to attend only to most relevant positions and omit irrelevant ones, so if you had n_kv of 100k the activated n_top_k cache positions would be similar (largely overlapping) for all 512 ubatch tokens. But I can't support this with any data, this is just my intuition.

Another idea that I think does not require any changes in ggml_get_rows() or reshaping flash attn arguments:

Find all KV cache indices that at least one of the ubatch tokens attends to (this will be union of top k indices for a whole ubatch).

Remove KQ mask columns that are not in this set (these columns will be all -INF anyway).

Perform ggml_get_rows() on K and V cache with indices from point 1 to get only cells contributing the the attention output for at least one ubatch token.

Do attention as usual.

This wastes more compute than my previous approach, but maybe would be good enough. Depends on the structure of top k indices for a whole ubatch.

I guess I will do some experiments first to see how top k indices look like when processing a whole ubatch with large KV cache.

fairydreaming · 2026-04-01T12:00:55Z

I implemented @ggerganov idea to get rid of llama_ik_cache by creating another llama_kv_cache instance with tweaked hparams and it works - but during testing I started encountering llama-server crashes - already twice (sorry no detailed debug info, but looks like calling a method on deleted object or a corrupted pointer):

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00007ffff7528eb8 in llama_kv_cache::slot_info::size() const ()
   from /home/phm/projects/llama.cpp-deepseek-dsa/build-cuda/bin/libllama.so.0
(gdb) up
#1  0x00007ffff7518638 in llama_kv_cache::set_input_k_idxs(ggml_tensor*, llama_ubatch const*, llama_kv_cache::slot_info const&) const ()
   from /home/phm/projects/llama.cpp-deepseek-dsa/build-cuda/bin/libllama.so.0
(gdb) up
#2  0x00007ffff7506553 in llm_graph_input_attn_k::set_input(llama_ubatch const*) ()
   from /home/phm/projects/llama.cpp-deepseek-dsa/build-cuda/bin/libllama.so.0
(gdb) up
#3  0x00007ffff7509190 in llm_graph_result::set_inputs(llama_ubatch const*) ()
   from /home/phm/projects/llama.cpp-deepseek-dsa/build-cuda/bin/libllama.so.0
(gdb) 
#4  0x00007ffff74cff7a in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
   from /home/phm/projects/llama.cpp-deepseek-dsa/build-cuda/bin/libllama.so.0
(gdb) 
#5  0x00007ffff74d72c0 in llama_context::decode(llama_batch const&) ()
   from /home/phm/projects/llama.cpp-deepseek-dsa/build-cuda/bin/libllama.so.0
(gdb) 
#6  0x00007ffff74d8d8f in llama_decode () from /home/phm/projects/llama.cpp-deepseek-dsa/build-cuda/bin/libllama.so.0
(gdb) up
#7  0x00005555556b966f in server_context_impl::update_slots() ()
(gdb) 
#8  0x00005555557461be in server_queue::start_loop(long) ()
(gdb) 
#9  0x00005555556104aa in main ()

They are "works OK for 2 hours and then suddenly dies" crashes and I'm not sure if it's my fault (could be) or some code from recent rebase, so I'm leaving it here in (unlikely) case someone knows what's going on. Back to debugging.

Update: I have some leads, looks like llama_kv_cache_dsa_context * is being static_cast to llama_kv_cache_context * in some places and this wreaks havoc. My fault for being lazy and not implementing llm_graph_input_attn_kv_dsa.

Update 2: I think it's fixed now, no more crashes observed so far. Also switched from ggml_hadamard() to rotation matrices multiplication, all looks good.

Successfully generated 20 of 20 quiz solutions.
|   Nr | model_name                    |   lineage |   lineage-128 |
|-----:|:------------------------------|----------:|--------------:|
|    1 | deepseek-ai/DeepSeek-V3.2-Exp |     0.950 |         0.950 |

…based on tweaked hparams.

…ase/no suffix was used for MLA part and _dsa/_ik were used for lightning indexer part, to make names more obvious I renamed _base/no suffix to _mla and _dsa/_ik to _lid.

…matrix multiplication. ggml : remove unused GGML_OP_HADAMARD

sszymczy added 26 commits March 12, 2026 13:15

model : Initial support for DeepseekV32ForCausalLM (for now with dens…

a337ebd

…e attention). Needs manual change of add_bos_token to true in tokenizer_config.json before conversion.

model : added indexer q and k calculation in DeepseekV32ForCausalLM.

e467684

ggml : add Hadamard transform GGML OP and implementation

723f0ce

kv-cache : add cache for indexer keys (temporary solution)

72b7214

convert : DSA indexer weights are bf16 in the original fp8 model, so …

961bc95

…I think it's best not to quantize them.

model : crude proof-of-concept implementation of the DSA indexer for …

9a63e7a

…DeepSeek V3.2.

ggml : add CUDA Hadamard transformation implementation (borrowed from…

3eb340e

… ik_llama.cpp)

ggml : add new GGML_OP_WHERE_ID (akin to torch where but using indices)

08dc7fd

model : used new GGML_OP_WHERE_ID op in DeepSeek V3.2 lightning index…

998f496

…er implementation

model : handle multiple streams in DeepSeek V3.2 lightning indexer

6c9d773

ggml : handle multiple streams in CUDA GGML_OP_WHERE_ID implementation

cb94b56

kv-cache : fix crashes for models without indexer

02c2159

model : replaced ggml_argsort_top_k with ggml_top_k in DeepSeek V3.2 …

e7aa89a

…indexer implementation since the former fails for large tensors even when using CCCL.

model : added comments in DeepSeek V3.2 lightning indexer implementat…

1874ac9

…ion.

ggml : replaced GGML_OP_WHERE_ID with GGML_OP_SCATTER that works simi…

9b0a4ee

…lar to torch scatter_ operation.

ggml : added inplace version of GGML_OP_SCATTER and tests for this OP

0ee5d80

gguf-py : removed obsolete KV_B tensor from DEEPSEEK32 arch

7f5578f

convert : make pyright happy

54945c7

ggml : added f16 version of GGML_OP_SCATTER

5677f08

ggml : added f16 version of GGML_OP_FILL

1c830a1

model : GGML_OP_SCATTER AND GGML_OP_FILL now work with f16 data, so w…

83a0313

…e can get rid of ggml_cast() calls in sparse attention implementation

ggml : fix bug in CUDA Hadamard transform implementation

6011bdd

ggml : simplified testing for nh being power of 2 in Hadamard transfo…

4aec6a8

…rm implementations

ggml : added test for GGML_OP_HADAMARD

a74d83a

convert : check if add_bos_token is true when converting DeepseekV32F…

5b9ce6c

…orCausalLM-based models.

fairydreaming requested review from a team, CISC and ggerganov as code owners March 29, 2026 12:56

fairydreaming marked this pull request as draft March 29, 2026 12:56

This was referenced Mar 29, 2026

Feature Request: DSA lightning indexer support #20363

Open

Feature Request: DeepSeek V3.2-Exp support #16331

Closed

CISC mentioned this pull request Mar 29, 2026

Deepseek v3.2 dense attention support from @fairydreaming #18849

Closed

CISC reviewed Mar 29, 2026

View reviewed changes

src/llama-model.cpp Show resolved Hide resolved

fairydreaming mentioned this pull request Mar 29, 2026

Misc. bug: CUDA ggml_top_k() implementation crashes for large tensor shapes #21162

Closed

ngxson reviewed Mar 29, 2026

View reviewed changes

ddh0 mentioned this pull request Mar 29, 2026

contrib : clarify code origin guidelines #21165

Open

Merge remote-tracking branch 'upstream/master' into deepseek-dsa

57a8def

ggerganov reviewed Mar 31, 2026

View reviewed changes

sszymczy added 6 commits April 1, 2026 17:13

graph : replaced llama_ik_cache with llama_kv_cache instance created …

6959bcf

…based on tweaked hparams.

graph : implemented llm_graph_input_attn_k_dsa

f443d0c

graph : renamed DSA-related suffixes, since in DSA-related classes _b…

d3236d8

…ase/no suffix was used for MLA part and _dsa/_ik were used for lightning indexer part, to make names more obvious I renamed _base/no suffix to _mla and _dsa/_ik to _lid.

Merge remote-tracking branch 'upstream/master' into deepseek-dsa

346c2b4

llama : handle LLM_ARCH_DEEPSEEK32 in test-llama-archs

5086217

model : replace ggml_hadamard() in DEEPSEEK32 with Hadamard rotation …

a7820f6

…matrix multiplication. ggml : remove unused GGML_OP_HADAMARD

LilySu mentioned this pull request Apr 5, 2026

ggml : add GGML_OP_GATHER for DeepSeek Sparse Attention (DSA) #21149 #21458

Open

Conversation

fairydreaming commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Next Steps

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

fairydreaming commented Mar 29, 2026

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

fairydreaming commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

CISC commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fairydreaming commented Mar 29, 2026

Uh oh!

AesSedai commented Mar 29, 2026

Uh oh!

pwilkin commented Mar 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fairydreaming Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Mar 29, 2026

Uh oh!

fairydreaming commented Mar 29, 2026

Uh oh!

fairydreaming commented Mar 30, 2026

Uh oh!

am17an commented Mar 30, 2026

Uh oh!

fairydreaming commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fairydreaming commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

fairydreaming commented Mar 29, 2026 •

edited

Loading

fairydreaming commented Mar 29, 2026 •

edited

Loading

CISC commented Mar 29, 2026 •

edited

Loading

fairydreaming Mar 30, 2026 •

edited

Loading

ngxson Mar 30, 2026 •

edited

Loading

fairydreaming commented Mar 30, 2026 •

edited

Loading

fairydreaming commented Apr 1, 2026 •

edited

Loading