Skip to content

feat(cuda): Add highly optimized CUDA kernel for HardSwish activation#17943

Closed
Chandan-Sugreevu wants to merge 1 commit intoggml-org:masterfrom
Chandan-Sugreevu:feature/cuda-hardswish
Closed

feat(cuda): Add highly optimized CUDA kernel for HardSwish activation#17943
Chandan-Sugreevu wants to merge 1 commit intoggml-org:masterfrom
Chandan-Sugreevu:feature/cuda-hardswish

Conversation

@Chandan-Sugreevu
Copy link

What this PR does: Implements a highly optimized, custom CUDA kernel for the GGML_UNARY_OP_HARDSWISH activation function.

Why it is valuable: This enables faster inference for any model using the HardSwish activation when running on NVIDIA GPUs via the GGML CUDA backend.

Testing: Verified functionality on NVIDIA GeForce MX250. Test passes (OK/Mismatch) when memory-support logic is bypassed.

@github-actions github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 11, 2025
Copy link
Contributor

@aviallon aviallon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few Code Quality comments, but I did not test the performance impact of this PR.

case GGML_OP_COS:
case GGML_OP_CLAMP:
case GGML_OP_LOG:
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to remove this return true; here?
Because it makes the switch fall-through the GGML_OP_SSM_SCAN implementation check, which is most probably not what you intended.

}


// --- Custom HardSwish Implementation by Chandan ---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the little comment here?

cudaStream_t stream = ctx.stream();
k_hardswish<<<grid_size, block_size, 0, stream>>>(src_d, dst_d, num_elements);
}
// --------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary decoration.

endif()
endfunction()

llama_option_depr(FATAL_ERROR LLAMA_CUBLAS GGML_CUDA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this? This is unrelated with your PR.
If you have a fatal error, just use the correct option.

case GGML_UNARY_OP_HARDSIGMOID:
ggml_cuda_op_hardsigmoid(ctx, dst);
break;
case GGML_UNARY_OP_HARDSWISH:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't move this block, this is unnecessary noise.




void ggml_cuda_op_hardswish(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at how things are done in this repo: we don't define prototypes here, but in an op-specific file.

@aviallon
Copy link
Contributor

@CISC did you close because it was a low effort PR?

@CISC
Copy link
Member

CISC commented Dec 12, 2025

@CISC did you close because it was a low effort PR?

I closed it because it was utter nonsense.

Sad to say you just wasted 5 minutes reviewing pure unadulterated slop. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants