-
Notifications
You must be signed in to change notification settings - Fork 51
Grab-bag of improvements to the IR we send to LLVM #1243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also, @giuseros, if you've got the command for benchmarking that post-large-tensors slowdown handy, I can run it on this change. |
bdb0356 to
0c292d6
Compare
|
Update: preliminary perf checks are showing that setting |
6e2cfa7 to
97f9201
Compare
giuseros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine, I am a bit worried about introducing further analysis in the back-end that might slow compilation time. In particular, I am worried about the invariant.load attribute (see rust-lang/rust#103070). Could you try to tune some configurations and verify that the compile time does not get too bad?
| return; | ||
| unsigned argNo = funcArg.getArgNumber(); | ||
| if (auto load = dyn_cast<LLVM::LoadOp>(aliasOp)) | ||
| load.setInvariantLoad(isReadonly[argNo]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much is relevant, but I stumbled upon this: rust-lang/rust#103070 , which talks about invariant.start. Do you mind making sure that the compilation time is not heavily affected by this?
| op.setFailureOrdering(LLVM::AtomicOrdering::monotonic); | ||
| }); | ||
|
|
||
| // 4. TODO: add some invariant.start calls once MLIR's got them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above :)
This is the command I was using to investigate compilation slow downs: |
|
It d be much easier to review if you can break the things you port from upstream and what you've implemented new in this PR, because the former would be a no brainer to get in. |
|
@manupak You can set the review view to be filtered by commits, if that helps |
97f9201 to
17a9624
Compare
17a9624 to
2d70b0a
Compare
giuseros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
38eeb42 to
ee74198
Compare
The `allocsize` attribute is weird because it packs two 32-bit values into a 64-bit value. It also turns out that the passthrough attribute exporter was using `int`, which is incorrectly handling 64-bit integers. Reviewed By: rriddle Differential Revision: https://reviews.llvm.org/D156574
Since vector loads and stores from scalar memrefs translate to llvm.load/store, add the ability to tag said loads and stores as nontemporal. This mirrors functionality available in memref.load/store.
Expand the copying of attributes on GPU kernel arguments during LLVM lowering. Support copying attributes from values that are already LLVM pointers. Support copying attributes, like `noundef`, that aren't specific to (the pointer parts of) arguments.
Add support for !invariant.load metadata (by way of a unit attribute) to the MLIR representation of llvm.load.
A lot of our existing annotation scheme for memory accesses was scattered all over the place. This commit unifies things into -rock-analyze-memory-use. This pass covers: 1. Walking through global_load and global_store operations in order to find all the "needs 64 bit index" attributes and moving that declaration to the function level. That means we won't have issues with the fact that that runs in blockwise-gemm-to-threadwise. 2. Detecting readonly and writeonly arguments. This is done by looking for arguments that are only (ignoring views) arguments to global_load or global_store(set) respectively, and then putting those annotations onto the function arguments for lowering to LLVM. 3. Setting all sorts of relevant LLVM annotations (noalias, nonnull, noundef, dereferencable, and so on) in order to promise the compiler things we know are true anyway (if they're not, that's on the client). Then, this commit also expands our LLVM preparation pass. The pass is made to run on llvm.func ops, and we use the nesting mechanism to ensure that we run that on functions inside GPU modules specifically. The preparation pass is updated to 1. Update the alignment of vector loads and stores to be the alignment of the vector, not its component, since MLIR's lowering is pessimistic in a way we don't want to be. 2. Set !invariant.load on loads from readonly values to given out even stronger hints about the ability to hoist or sink operations. 3. Work around a shortcoming in how the AMDGPU backend handles `noalias` (specifically, it gets dropped on kernel arguments during target-specific lowering) by adding our own alias scopes, which indicate that loads from one pointer or buffer can't alias the loads from the other ones. This will hopefully reduce the number of scheduling DAG edges and therefore give us some of our compile performance back ... or at least it'll give us better codegen thanks to more accurate alias information. This commit also adds the "nontemporal" attribute to rock.global_store so that -rock-sugar-to-loops can set it on the underlying stores.
ee74198 to
fe17f44
Compare
Notes to reviewers:
DistinctAttrand the handling ofAliasScopeAttrand the like) are direct ports from upstream: I figured I'd port in now because the new solution is nicer and it'll save us the pain during the next upstream merge.nontemporaltovector.loadandvector.store, improving the-gpu-to-llvmattribute handling, and adding!invariant.loadtollvm.load) are mine and are upstreaming candidates that should be reviewed as such.-rock-analyze-memory-useand -rock-prepare-llvm`. I've copied out the commit message for that last commit below.A lot of our existing annotation scheme for memory accesses was
scattered all over the place. This commit unifies things into
-rock-analyze-memory-use.
This pass covers:
find all the "needs 64 bit index" attributes and moving that
declaration to the function level. That means we won't have issues
with the fact that that runs in blockwise-gemm-to-threadwise.
for arguments that are only (ignoring views) arguments to global_load
or global_store(set) respectively, and then putting those annotations
onto the function arguments for lowering to LLVM.
2b. Setting nontemporal flags on stores to writeonly values. If we
won't be reading it, we want the write to skip cache, and nontemporal
is a hint to make that happen. (It doesn't apply to buffer stores yet
since we don't have address space 7, but it's better than nothing).
noundef, dereferencable, and so on) in order to promise the compiler
things we know are true anyway (if they're not, that's on the client).
Then, this commit also expands our LLVM preparation pass.
The pass is made to run on llvm.func ops, and we use the nesting
mechanism to ensure that we run that on functions inside GPU modules
specifically.
The preparation pass is updated to
of the vector, not its component, since MLIR's lowering is
pessimistic in a way we don't want to be.
stronger hints about the ability to hoist or sink operations.
noalias(specifically, it gets dropped on kernel arguments duringtarget-specific lowering) by adding our own alias scopes, which
indicate that loads from one pointer or buffer can't alias the loads
from the other ones. This will hopefully reduce the number of
scheduling DAG edges and therefore give us some of our compile
performance back ... or at least it'll give us better codegen thanks
to more accurate alias information.
This commit also adds the "nontemporal" attribute to rock.global_store
so that -rock-sugar-to-loops can set it on the underlying stores.