Skip to content

Conversation

@yingluAMD
Copy link
Contributor

@yingluAMD yingluAMD commented Nov 3, 2025

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@yingluAMD yingluAMD marked this pull request as draft November 4, 2025 06:45
* @brief blockwise gemm xdlops with bf16x3 simulate tf32
* in/out/acc are all float;
* one input is separated to 2 bf16 registers.
* 3 xdlops gemm output regs are same, as accumulation of 3 xdlops gemm results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please document the algorithm or add a link to the description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This comment is changed to a simple algorithm description.

@yingluAMD yingluAMD marked this pull request as ready for review November 5, 2025 12:02
@aosewski
Copy link
Collaborator

aosewski commented Nov 7, 2025

For me the fundamental thing is: can we stop developing new features in CK and switch to CK Tile ??

@ThomasNing
Copy link
Contributor

@yingluAMD Thank you for the contribution!

  1. Could I know if there is any application demanded from the model level for this feature?
  2. The realization looks good to me, but please have a clang format and also delete the commented code.
  3. I agree with Adam. Please apply it on CK Tile as the primary goal.

@yingluAMD
Copy link
Contributor Author

For me the fundamental thing is: can we stop developing new features in CK and switch to CK Tile ??

Hi @aosewski , I start this work on CK first because previous work is also in CK. It will be implemented in CK Tile later. In the further development, I will consider develop new feature on CK Tile.
The whole story is: Per customer's requirement, I developed TF32 version convolution to keep pace with competitors on gfx942. PR includes CK PRs/ MIOpen PR. CK instances are widely used in MIOpen on gfx942, so all the changes are in CK and this new feature development demonstrate in CK first. I also received the requirement of deploying BF16x3 algo in CK Tile which is next step plan.

@yingluAMD
Copy link
Contributor Author

@yingluAMD Thank you for the contribution!

  1. Could I know if there is any application demanded from the model level for this feature?
  2. The realization looks good to me, but please have a clang format and also delete the commented code.
  3. I agree with Adam. Please apply it on CK Tile as the primary goal.

Hi @ThomasNing ,

  1. I will send you a mail about requirements.
  2. Sure. All the code are self-reviewed now.
  3. Same as above. Will consider develop new feature on CK Tile in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants