Skip to content

Conversation

@ikawrakow
Copy link
Owner

For motivation, see the CUDA performance graphs in #417 and #418.

Implementation for AVX2, Zen4, ARM_NEON, CUDA, Metal.

The AVX2 implementation suffers from int16_t overflow, and so do the IQ4_K, IQ5_K, IQ6_K and IQ4_KS, so I will have to fix all of these in a follow up PR.

I also want to add interleaved variant IQ5_KS_R4 before giving more performance and accuracy details.

@ubergarm
Copy link
Contributor

ubergarm commented May 18, 2025

Just did some testing of a mixed IQ5_KS / IQ4_KS quant of Qwen3-14B dense showing some Perplexity and Speed comparisons for full CUDA offload in this new quant cookers guide (just scroll to bottom, can't link anchors in gh discussions...)

Thanks for adding, the quality looks really good for the size!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants