Skip to content

Conversation

@solinzby1
Copy link
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings November 13, 2025 06:33
Copilot finished reviewing on behalf of solinzby1 November 13, 2025 06:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces CKTILE-based weight preshuffle functionality for FP8 (a8w8) GEMM operations, including comprehensive test coverage and auto-tuning infrastructure. The implementation provides an alternative kernel backend for quantized matrix multiplication operations on AMD ROCm GPUs.

Key changes:

  • New CKTILE-based FP8 GEMM kernel implementation with weight pre-shuffling support
  • Auto-tuning infrastructure with 139+ kernel configurations for gfx942 and 133+ for gfx950 architectures
  • Python API integration and JIT compilation support for dynamic kernel selection

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 21 comments.

Show a summary per file
File Description
op_tests/test_gemm_a8w8.py Adds test function for new CKTILE bpreshuffle kernel
csrc/rocm_ops.cpp Includes new CKTILE header and fixes include ordering
csrc/pybind/*.cu PyBind11 bindings for main and tuning interfaces
csrc/include/rocm_ops.hpp Macro definitions for Python binding registration
csrc/cktile_gemm_a8w8_bpreshuffle/*.cu Core kernel implementation and tuning dispatch logic
csrc/cktile_gemm_a8w8_bpreshuffle/*.cuh Common utilities and template configurations
csrc/cktile_gemm_a8w8_bpreshuffle/*.py Code generation and kernel tuning scripts
aiter/ops/gemm_op_a8w8.py High-level Python API and dispatch functions
aiter/jit/*.py & *.json JIT compilation configuration
aiter/configs/*.csv Tuned and untuned kernel configuration databases
csrc/cktile_gemm_a8w8_bpreshuffle/README.md Usage documentation for tuning workflow

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@valarLip
Copy link
Collaborator

let's havee some data compare for deepseek's shape

@valarLip valarLip self-assigned this Nov 19, 2025
@valarLip valarLip merged commit b08834e into main Nov 21, 2025
32 of 36 checks passed
@valarLip valarLip deleted the flatmm_merge branch November 21, 2025 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants