UPSTREAM PR #17943: feat(cuda): Add highly optimized CUDA kernel for HardSwish activation by loci-dev · Pull Request #525 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-11T19:35:28Z

What this PR does: Implements a highly optimized, custom CUDA kernel for the GGML_UNARY_OP_HARDSWISH activation function.

Why it is valuable: This enables faster inference for any model using the HardSwish activation when running on NVIDIA GPUs via the GGML CUDA backend.

Testing: Verified functionality on NVIDIA GeForce MX250. Test passes (OK/Mismatch) when memory-support logic is bypassed.

loci-review · 2025-12-11T20:17:07Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #525 - CUDA HardSwish Kernel Optimization

Condition Assessment: Condition 2 applies - Minor changes with limited scope affecting a single non-core activation function.

Summary

This PR replaces the template-based HardSwish activation implementation with a specialized CUDA kernel in ggml/src/ggml-cuda/unary.cu. The changes affect only the HardSwish activation path within the CUDA backend's unary operations. No modifications were made to core inference functions (llama_decode, llama_encode, llama_tokenize) or tokenization paths. The implementation removes F16 type support, handling only F32 tensors. An unrelated modification in ggml-cuda.cu removes the explicit return statement for GGML_OP_LOG, causing it to fall through to GGML_OP_SSM_SCAN's conditional logic. Power consumption analysis shows zero measurable change across all 16 binaries, with differences below 1 nanojoule. The CMakeLists.txt change removes a deprecated build option check without performance impact.

Inference Impact: No effect on tokens per second. HardSwish is an activation function used in specific model architectures (MobileNetV3, EfficientNet) but not in the tokenization or primary inference pipeline. The llama_decode, llama_encode, and llama_tokenize functions remain unmodified.

feat(cuda): Add highly optimized CUDA kernel for HardSwish activation

32c0d2c

loci-dev temporarily deployed to PROD__AL_DEMO December 11, 2025 19:35 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from d1b57fc to d582acc Compare December 15, 2025 11:08

loci-dev force-pushed the main branch 30 times, most recently from ec69147 to 883e4ba Compare December 19, 2025 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17943: feat(cuda): Add highly optimized CUDA kernel for HardSwish activation#525

UPSTREAM PR #17943: feat(cuda): Add highly optimized CUDA kernel for HardSwish activation#525
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17943-branch_Chandan-Sugreevu-feature/cuda-hardswish

loci-dev commented Dec 11, 2025

Uh oh!

loci-review bot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 11, 2025

Uh oh!

loci-review bot commented Dec 11, 2025

Performance Analysis Summary: PR #525 - CUDA HardSwish Kernel Optimization

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants