Skip to content

UPSTREAM PR #17943: feat(cuda): Add highly optimized CUDA kernel for HardSwish activation#525

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17943-branch_Chandan-Sugreevu-feature/cuda-hardswish
Open

UPSTREAM PR #17943: feat(cuda): Add highly optimized CUDA kernel for HardSwish activation#525
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17943-branch_Chandan-Sugreevu-feature/cuda-hardswish

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17943

What this PR does: Implements a highly optimized, custom CUDA kernel for the GGML_UNARY_OP_HARDSWISH activation function.

Why it is valuable: This enables faster inference for any model using the HardSwish activation when running on NVIDIA GPUs via the GGML CUDA backend.

Testing: Verified functionality on NVIDIA GeForce MX250. Test passes (OK/Mismatch) when memory-support logic is bypassed.

@loci-review
Copy link

loci-review bot commented Dec 11, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #525 - CUDA HardSwish Kernel Optimization

Condition Assessment: Condition 2 applies - Minor changes with limited scope affecting a single non-core activation function.

Summary

This PR replaces the template-based HardSwish activation implementation with a specialized CUDA kernel in ggml/src/ggml-cuda/unary.cu. The changes affect only the HardSwish activation path within the CUDA backend's unary operations. No modifications were made to core inference functions (llama_decode, llama_encode, llama_tokenize) or tokenization paths. The implementation removes F16 type support, handling only F32 tensors. An unrelated modification in ggml-cuda.cu removes the explicit return statement for GGML_OP_LOG, causing it to fall through to GGML_OP_SSM_SCAN's conditional logic. Power consumption analysis shows zero measurable change across all 16 binaries, with differences below 1 nanojoule. The CMakeLists.txt change removes a deprecated build option check without performance impact.

Inference Impact: No effect on tokens per second. HardSwish is an activation function used in specific model architectures (MobileNetV3, EfficientNet) but not in the tokenization or primary inference pipeline. The llama_decode, llama_encode, and llama_tokenize functions remain unmodified.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from d1b57fc to d582acc Compare December 15, 2025 11:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from ec69147 to 883e4ba Compare December 19, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants