[Codegen] Emulate gather_to_lds when it has narrow element types #23758
Open
[Codegen] Emulate gather_to_lds when it has narrow element types #23758
gather_to_lds when it has narrow element types #23758Conversation
…ulation First step to support DMA for scaled GEMMs. * Add ConvertGatherToLDS pattern to AMDGPUEmulateNarrowType pass. * Adjust subbyte element type to i8. e.g. vector<32xf4E2M1FN> -> vector<16xi8>
25dcfcf to
5c5a70d
Compare
gather_to_lds when it has narrow sub-byte element types gather_to_lds when it has narrow element types
There was a problem hiding this comment.
Pull request overview
Adds support in the AMDGPU narrow-type emulation pipeline to rewrite amdgpu.gather_to_lds when its source/destination memrefs are converted from sub-byte element types to byte-sized (i8) types, enabling upcoming DMA support for scaled GEMMs.
Changes:
- Add a
ConvertGatherToLDSconversion pattern to rewriteamdgpu.gather_to_ldsfor sub-byte element types. - Linearize multidimensional indices into a 1D packed-byte index and adjust the transfer vector type accordingly.
- Extend MLIR FileCheck coverage for
gather_to_ldsconversions (including async forms and various sub-byte element types).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| compiler/src/iree/compiler/Codegen/LLVMGPU/test/amdgpu_emulate_narrow_type.mlir | Adds FileCheck tests for gather_to_lds sub-byte element type conversion to i8. |
| compiler/src/iree/compiler/Codegen/LLVMGPU/AMDGPUEmulateNarrowType.cpp | Introduces ConvertGatherToLDS pattern to linearize/pack indices and update transfer types during narrow type emulation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
compiler/src/iree/compiler/Codegen/LLVMGPU/AMDGPUEmulateNarrowType.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/AMDGPUEmulateNarrowType.cpp
Outdated
Show resolved
Hide resolved
Contributor
Author
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Contributor
…emulation (#23763) `ConvertGatherToLDS` had several correctness issues in `linearizeAndPack` that could silently produce wrong IR or crash in non-assert builds. **Fixes:** - **Offset ignored**: Memref layout offset was never incorporated into the linearized index. Now checks for dynamic offset (fails the pattern) and initializes `linearIdx = offset + sum(idx[i] * stride[i])`. - **Silent rank mismatch**: `llvm::zip(indices, strides)` silently truncated when sizes differed. Added explicit `indices.size() != strides.size()` guard. - **Assert in rewrite path**: `assert(newBits > origBits && newBits % origBits == 0)` would crash in debug and silently miscompile in release. Replaced with `return nullptr` (propagated as `notifyMatchFailure` by callers). - **Misleading error message**: `"not a multiple of byte width"` described the wrong invariant; corrected to `"not a multiple of the new element bit width"` to match the actual check (`totalBits % newSrcBits != 0`). - **Unsafe early return**: Removed the `origBits == newBits && 1D` fast-path that bypassed offset handling entirely. <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/iree-org/iree/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lialan <[email protected]>
krzysz00
reviewed
Mar 13, 2026
compiler/src/iree/compiler/Codegen/LLVMGPU/AMDGPUEmulateNarrowType.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/AMDGPUEmulateNarrowType.cpp
Outdated
Show resolved
Hide resolved
* Remove unnecessary cast<MemRefType> on op accessors (already typed) * Pass async attribute directly to GatherToLDSOp builder * Add comment explaining dynamic offset/stride rejection * Add assert for transfer size divisibility in convertTransferType Co-Authored-By: Claude Opus 4.6 <[email protected]>
Contributor
Author
|
@krzysz00 no offence, I was testing claude automation, and it replied your review comments all by itself. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First step to support DMA for scaled GEMMs.
ConvertGatherToLDSpattern toAMDGPUEmulateNarrowType pass.gather_to_ldsop, adjust subbyte element type to i8. e.g.vector<32xf4E2M1FN>->vector<16xi8>.