Skip to content

Conversation

@krzysz00
Copy link
Collaborator

@krzysz00 krzysz00 commented Sep 14, 2023

Notes to reviewers:

  • The first four commits (related to DistinctAttr and the handling of AliasScopeAttr and the like) are direct ports from upstream: I figured I'd port in now because the new solution is nicer and it'll save us the pain during the next upstream merge.
  • The three commits after that (related to adding nontemporal to vector.load and vector.store, improving the -gpu-to-llvm attribute handling, and adding !invariant.load to llvm.load) are mine and are upstreaming candidates that should be reviewed as such.
  • The last commit is all the internal plumbing, both in terms of -rock-analyze-memory-use and -rock-prepare-llvm`. I've copied out the commit message for that last commit below.

A lot of our existing annotation scheme for memory accesses was
scattered all over the place. This commit unifies things into
-rock-analyze-memory-use.

This pass covers:

  1. Walking through global_load and global_store operations in order to
    find all the "needs 64 bit index" attributes and moving that
    declaration to the function level. That means we won't have issues
    with the fact that that runs in blockwise-gemm-to-threadwise.
  2. Detecting readonly and writeonly arguments. This is done by looking
    for arguments that are only (ignoring views) arguments to global_load
    or global_store(set) respectively, and then putting those annotations
    onto the function arguments for lowering to LLVM.
    2b. Setting nontemporal flags on stores to writeonly values. If we
    won't be reading it, we want the write to skip cache, and nontemporal
    is a hint to make that happen. (It doesn't apply to buffer stores yet
    since we don't have address space 7, but it's better than nothing).
  3. Setting all sorts of relevant LLVM annotations (noalias, nonnull,
    noundef, dereferencable, and so on) in order to promise the compiler
    things we know are true anyway (if they're not, that's on the client).

Then, this commit also expands our LLVM preparation pass.
The pass is made to run on llvm.func ops, and we use the nesting
mechanism to ensure that we run that on functions inside GPU modules
specifically.

The preparation pass is updated to

  1. Update the alignment of vector loads and stores to be the alignment
    of the vector, not its component, since MLIR's lowering is
    pessimistic in a way we don't want to be.
  2. Set !invariant.load on loads from readonly values to given out even
    stronger hints about the ability to hoist or sink operations.
  3. Work around a shortcoming in how the AMDGPU backend handles
    noalias (specifically, it gets dropped on kernel arguments during
    target-specific lowering) by adding our own alias scopes, which
    indicate that loads from one pointer or buffer can't alias the loads
    from the other ones. This will hopefully reduce the number of
    scheduling DAG edges and therefore give us some of our compile
    performance back ... or at least it'll give us better codegen thanks
    to more accurate alias information.

This commit also adds the "nontemporal" attribute to rock.global_store
so that -rock-sugar-to-loops can set it on the underlying stores.

@krzysz00
Copy link
Collaborator Author

Also, @giuseros, if you've got the command for benchmarking that post-large-tensors slowdown handy, I can run it on this change.

@krzysz00
Copy link
Collaborator Author

Update: preliminary perf checks are showing that setting nontemporal is usually, but not uniformly, a bad idea in our contexts, so that part of the PR will probably be removed (though I'll keep the upstream changes)

@krzysz00 krzysz00 force-pushed the memory-analysis branch 2 times, most recently from 6e2cfa7 to 97f9201 Compare September 20, 2023 17:04
Copy link
Contributor

@giuseros giuseros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine, I am a bit worried about introducing further analysis in the back-end that might slow compilation time. In particular, I am worried about the invariant.load attribute (see rust-lang/rust#103070). Could you try to tune some configurations and verify that the compile time does not get too bad?

return;
unsigned argNo = funcArg.getArgNumber();
if (auto load = dyn_cast<LLVM::LoadOp>(aliasOp))
load.setInvariantLoad(isReadonly[argNo]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much is relevant, but I stumbled upon this: rust-lang/rust#103070 , which talks about invariant.start. Do you mind making sure that the compilation time is not heavily affected by this?

op.setFailureOrdering(LLVM::AtomicOrdering::monotonic);
});

// 4. TODO: add some invariant.start calls once MLIR's got them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above :)

@giuseros
Copy link
Contributor

Also, @giuseros, if you've got the command for benchmarking that post-large-tensors slowdown handy, I can run it on this change.

This is the command I was using to investigate compilation slow downs:

time python3 ./bin/tuningRunner.py --op gemm --config="-g 64 -m 1024 -k 1024 -n 384 -t f32 -out_datatype f32 -transA 0 -transB 0"

@manupak
Copy link
Contributor

manupak commented Sep 27, 2023

It d be much easier to review if you can break the things you port from upstream and what you've implemented new in this PR, because the former would be a no brainer to get in.

@krzysz00
Copy link
Collaborator Author

@manupak You can set the review view to be filtered by commits, if that helps

Copy link
Contributor

@giuseros giuseros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@krzysz00 krzysz00 force-pushed the memory-analysis branch 2 times, most recently from 38eeb42 to ee74198 Compare December 6, 2023 17:22
Mogball and others added 5 commits December 20, 2023 16:03
The `allocsize` attribute is weird because it packs two 32-bit values
into a 64-bit value. It also turns out that the passthrough attribute
exporter was using `int`, which is incorrectly handling 64-bit integers.

Reviewed By: rriddle

Differential Revision: https://reviews.llvm.org/D156574
Since vector loads and stores from scalar memrefs translate to
llvm.load/store, add the ability to tag said loads and stores as
nontemporal. This mirrors functionality available in
memref.load/store.
Expand the copying of attributes on GPU kernel arguments during LLVM
lowering.

Support copying attributes from values that are already LLVM pointers.

Support copying attributes, like `noundef`, that aren't specific to
(the pointer parts of) arguments.
Add support for !invariant.load metadata (by way of a unit attribute)
to the MLIR representation of llvm.load.
A lot of our existing annotation scheme for memory accesses was
scattered all over the place. This commit unifies things into
-rock-analyze-memory-use.

This pass covers:
1. Walking through global_load and global_store operations in order to
find all the "needs 64 bit index" attributes and moving that
declaration to the function level. That means we won't have issues
with the fact that that runs in blockwise-gemm-to-threadwise.
2. Detecting readonly and writeonly arguments. This is done by looking
for arguments that are only (ignoring views) arguments to global_load
or global_store(set) respectively, and then putting those annotations
onto the function arguments for lowering to LLVM.
3. Setting all sorts of relevant LLVM annotations (noalias, nonnull,
noundef, dereferencable, and so on) in order to promise the compiler
things we know are true anyway (if they're not, that's on the client).

Then, this commit also expands our LLVM preparation pass.
The pass is made to run on llvm.func ops, and we use the nesting
mechanism to ensure that we run that on functions inside GPU modules
specifically.

The preparation pass is updated to
1. Update the alignment of vector loads and stores to be the alignment
of the vector, not its component, since MLIR's lowering is
pessimistic in a way we don't want to be.
2. Set !invariant.load on loads from readonly values to given out even
stronger hints about the ability to hoist or sink operations.
3. Work around a shortcoming in how the AMDGPU backend handles
`noalias` (specifically, it gets dropped on kernel arguments during
target-specific lowering) by adding our own alias scopes, which
indicate that loads from one pointer or buffer can't alias the loads
from the other ones. This will hopefully reduce the number of
scheduling DAG edges and therefore give us some of our compile
performance back ... or at least it'll give us better codegen thanks
to more accurate alias information.

This commit also adds the "nontemporal" attribute to rock.global_store
so that -rock-sugar-to-loops can set it on the underlying stores.
@krzysz00 krzysz00 merged commit d28f6df into develop Jan 2, 2024
@krzysz00 krzysz00 deleted the memory-analysis branch January 2, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants