Skip to content

[Codegen] Emulate gather_to_lds when it has narrow element types #23758

Open
lialan wants to merge 4 commits intomainfrom
users/lialan/subbyte_gather_to_lds
Open

[Codegen] Emulate gather_to_lds when it has narrow element types #23758
lialan wants to merge 4 commits intomainfrom
users/lialan/subbyte_gather_to_lds

Conversation

@lialan
Copy link
Contributor

@lialan lialan commented Mar 12, 2026

First step to support DMA for scaled GEMMs.

  • Add ConvertGatherToLDS pattern to AMDGPUEmulateNarrowType pass.
  • In the pass focusing on gather_to_lds op, adjust subbyte element type to i8. e.g. vector<32xf4E2M1FN> -> vector<16xi8>.
  • Semantically the same before and after.

…ulation

First step to support DMA for scaled GEMMs.

* Add ConvertGatherToLDS pattern to AMDGPUEmulateNarrowType pass.
* Adjust subbyte element type to i8. e.g. vector<32xf4E2M1FN> -> vector<16xi8>
@lialan lialan force-pushed the users/lialan/subbyte_gather_to_lds branch from 25dcfcf to 5c5a70d Compare March 12, 2026 19:31
@lialan lialan changed the title [Codegen] Emulate gather_to_lds when it has narrow sub-byte element types [Codegen] Emulate gather_to_lds when it has narrow element types Mar 12, 2026
@lialan lialan requested a review from Copilot March 12, 2026 20:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support in the AMDGPU narrow-type emulation pipeline to rewrite amdgpu.gather_to_lds when its source/destination memrefs are converted from sub-byte element types to byte-sized (i8) types, enabling upcoming DMA support for scaled GEMMs.

Changes:

  • Add a ConvertGatherToLDS conversion pattern to rewrite amdgpu.gather_to_lds for sub-byte element types.
  • Linearize multidimensional indices into a 1D packed-byte index and adjust the transfer vector type accordingly.
  • Extend MLIR FileCheck coverage for gather_to_lds conversions (including async forms and various sub-byte element types).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
compiler/src/iree/compiler/Codegen/LLVMGPU/test/amdgpu_emulate_narrow_type.mlir Adds FileCheck tests for gather_to_lds sub-byte element type conversion to i8.
compiler/src/iree/compiler/Codegen/LLVMGPU/AMDGPUEmulateNarrowType.cpp Introduces ConvertGatherToLDS pattern to linearize/pack indices and update transfer types during narrow type emulation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@lialan
Copy link
Contributor Author

lialan commented Mar 12, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Contributor

Copilot AI commented Mar 12, 2026

@lialan I've opened a new pull request, #23763, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits March 12, 2026 16:33
…emulation (#23763)

`ConvertGatherToLDS` had several correctness issues in
`linearizeAndPack` that could silently produce wrong IR or crash in
non-assert builds.

**Fixes:**
- **Offset ignored**: Memref layout offset was never incorporated into
the linearized index. Now checks for dynamic offset (fails the pattern)
and initializes `linearIdx = offset + sum(idx[i] * stride[i])`.
- **Silent rank mismatch**: `llvm::zip(indices, strides)` silently
truncated when sizes differed. Added explicit `indices.size() !=
strides.size()` guard.
- **Assert in rewrite path**: `assert(newBits > origBits && newBits %
origBits == 0)` would crash in debug and silently miscompile in release.
Replaced with `return nullptr` (propagated as `notifyMatchFailure` by
callers).
- **Misleading error message**: `"not a multiple of byte width"`
described the wrong invariant; corrected to `"not a multiple of the new
element bit width"` to match the actual check (`totalBits % newSrcBits
!= 0`).
- **Unsafe early return**: Removed the `origBits == newBits && 1D`
fast-path that bypassed offset handling entirely.

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for
you](https://github.com/iree-org/iree/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lialan <[email protected]>
@lialan lialan marked this pull request as ready for review March 13, 2026 00:49
* Remove unnecessary cast<MemRefType> on op accessors (already typed)
* Pass async attribute directly to GatherToLDSOp builder
* Add comment explaining dynamic offset/stride rejection
* Add assert for transfer size divisibility in convertTransferType

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@lialan
Copy link
Contributor Author

lialan commented Mar 13, 2026

@krzysz00 no offence, I was testing claude automation, and it replied your review comments all by itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants