vulkan: fix noncontig check for mat_mul_id splitting#14683
vulkan: fix noncontig check for mat_mul_id splitting#146830cc4m merged 2 commits intoggml-org:masterfrom
Conversation
Remove supports_op check for > 4096 (splitting fixes this)
| return | ||
| tensor->nb[0] == ggml_type_size(tensor->type) && | ||
| tensor->nb[1] == (tensor->nb[0]*tensor->ne[0])/ggml_blck_size(tensor->type) && | ||
| tensor->nb[3] == tensor->nb[2]*tensor->ne[2]; |
There was a problem hiding this comment.
@0cc4m do you recall where there is a check for dim3 here at all? Based on the function name it seems like it should only care about dims 0,1.
There was a problem hiding this comment.
Yeah, it should. I'm not 100% sure, but it was maybe related to multiple mul_mat calls or broadcasting. When this was written the mul_mat shader handled only the first two dimensions and was called multiple times to do the other dimensions.
There was a problem hiding this comment.
If I remove the last part of the check, there are some failures in mul_mat tests. Maybe worth looking into, but I think this change is OK for now.
There was a problem hiding this comment.
Probably because it falls back to dequant to fp16 + matmul in a few cases due to the third check.
I found that this was hitting the dequant path in mul_mat and was only dequantizing the first batch. Most recent commit fixes this. I still can see some failures in IQ quants if I force this path, but those happen even when the batch dimension is 1. |
* vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K
Reported at ikawrakow/ik_llama.cpp#608 (comment), but a different fix.
I'm still seeing flash attention fail with this model, but I'll look into that separately.