Open
Conversation
- Only DGEMM at this moment. - Prefetch whole lines. - Scatter prefetching insts.
Instead of clearing C rows, Deploy first-k FMUL so that instructions are saved.
Instead of loading from stack, directly pass regs in. Arm64 has 30 regs for use. This may or may not speed up a tiny bit.
Forget to commit header for ad73717.
- Init k-loop clears C. - Scattered C preloading.
Member
|
Thanks @xrq-phys! I've asked Jeff to take a look at the new kernel for feedback. I think he and his application could stand to benefit from this, given the inherent advantage row-preferring kernel have with left-sided Happy holidays! 🎄 🎁 🍾 |
|
Hi there, I know this is a bit old but came across this change from this paper. I was just wondering what the status was for having this (and other changes) merged upstream and/or if there was a plan to do so? |
Member
|
Hey @GodTamIt, thanks for your inquiry. I guess we're still waiting on @jdiamondGitHub to look over this PR. I'll reach out to him separately as well. |
fgvanzee
added a commit
that referenced
this pull request
Oct 6, 2023
Details: - Integrated changes from PR #698 to enable testing in the context of the 'stable' branch. These changes add row-preferential sgemm and dgemm microkernels for the armv8a kernel set. - Updated the 'altra' subconfig to easily switch between the previous (column-preferential) ukernel and the aforementioned row-pref ukernel.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
This is a 8x6 row-major kernel for ARMv8-A so its internal structure is basically the same as the current 6x8 column-preferring one.
Updates
k-loop usingfmulinstead offmla. Codepath within assembly is handled to (basically) not introduce additional branching cost.Restrictions
This kernel assumes hardware prefetching for packed A/B blocks (so as not to bother the pipeline with additional instructions or the DMA with additional loads).
Older chips like ThunderX2 may not perform well with it since they may have no hardware prefetching at all, while newer ones like Amazon's C6g tend to be happier with it.
This update also contains somehow prerequisite changes for my
gemmsup+packmwork here which I'd also like to merge later as a BLIS sandbox.