Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #525 - CUDA HardSwish Kernel OptimizationCondition Assessment: Condition 2 applies - Minor changes with limited scope affecting a single non-core activation function. SummaryThis PR replaces the template-based HardSwish activation implementation with a specialized CUDA kernel in Inference Impact: No effect on tokens per second. HardSwish is an activation function used in specific model architectures (MobileNetV3, EfficientNet) but not in the tokenization or primary inference pipeline. The llama_decode, llama_encode, and llama_tokenize functions remain unmodified. |
d1b57fc to
d582acc
Compare
ec69147 to
883e4ba
Compare
Mirrored from ggml-org/llama.cpp#17943
What this PR does: Implements a highly optimized, custom CUDA kernel for the GGML_UNARY_OP_HARDSWISH activation function.
Why it is valuable: This enables faster inference for any model using the HardSwish activation when running on NVIDIA GPUs via the GGML CUDA backend.
Testing: Verified functionality on NVIDIA GeForce MX250. Test passes (OK/Mismatch) when memory-support logic is bypassed.