-
Notifications
You must be signed in to change notification settings - Fork 148
cktile weight preshuffle test and auto tuning for a8w8 #1400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces CKTILE-based weight preshuffle functionality for FP8 (a8w8) GEMM operations, including comprehensive test coverage and auto-tuning infrastructure. The implementation provides an alternative kernel backend for quantized matrix multiplication operations on AMD ROCm GPUs.
Key changes:
- New CKTILE-based FP8 GEMM kernel implementation with weight pre-shuffling support
- Auto-tuning infrastructure with 139+ kernel configurations for gfx942 and 133+ for gfx950 architectures
- Python API integration and JIT compilation support for dynamic kernel selection
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
| op_tests/test_gemm_a8w8.py | Adds test function for new CKTILE bpreshuffle kernel |
| csrc/rocm_ops.cpp | Includes new CKTILE header and fixes include ordering |
| csrc/pybind/*.cu | PyBind11 bindings for main and tuning interfaces |
| csrc/include/rocm_ops.hpp | Macro definitions for Python binding registration |
| csrc/cktile_gemm_a8w8_bpreshuffle/*.cu | Core kernel implementation and tuning dispatch logic |
| csrc/cktile_gemm_a8w8_bpreshuffle/*.cuh | Common utilities and template configurations |
| csrc/cktile_gemm_a8w8_bpreshuffle/*.py | Code generation and kernel tuning scripts |
| aiter/ops/gemm_op_a8w8.py | High-level Python API and dispatch functions |
| aiter/jit/*.py & *.json | JIT compilation configuration |
| aiter/configs/*.csv | Tuned and untuned kernel configuration databases |
| csrc/cktile_gemm_a8w8_bpreshuffle/README.md | Usage documentation for tuning workflow |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
csrc/cktile_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_cktile_tune.cu
Outdated
Show resolved
Hide resolved
csrc/cktile_gemm_a8w8_bpreshuffle/include/gemm_a8w8_bpreshuffle_cktile_common.cuh
Outdated
Show resolved
Hide resolved
csrc/cktile_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_cktile_common.py
Show resolved
Hide resolved
|
let's havee some data compare for deepseek's shape |
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist