forked from triton-lang/triton
-
Notifications
You must be signed in to change notification settings - Fork 0
merge from main #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
merge from main #11
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `Existing tests are sufficient`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Replacing use of deprecated `ast.Num` class with `ast.Constant`. This change is necessary to allow compatibility with Python 3.14, in which `ast.Num` was removed. Co-authored-by: Rich Coombs <[email protected]>
Add support for LDS memory barrier ops on gfx1250 and expose them to gluon.
…8633) This PR fixes nondeterministic results in Triton's atomic RMW unit tests on RDNA. The issue was caused by an unnecessary non-offset LDS store of the atomic return ("old") that every lane executed to the same LDS address, creating a race without any functional use. %16 = atomicrmw add ptr addrspace(1) %dst, i64 %val syncscope("agent") acq_rel store i64 %16, ptr addrspace(3) @global_smem ; **<-- non-offset LDS write** . . . %8 = (lane_id & 63) << 3 %19 = getelementptr i8, ptr addrspace(3) @global_smem, i32 %8 store <1 x i64> %21, ptr addrspace(3) %19 %24 = load i64, ptr addrspace(3) %19 There is no read from @global_smem at offset 0 after the first store. Thus, the zero-offset store is dead, but every lane still writes it, causing write-write contention. For lane 0, the same location later used for the real value. **Mismatched elements: 1 / 64 (1.56%)** Max absolute difference among violations: 1.87127712 Max relative difference among violations: 1.52464041 **ACTUAL:** array([**-0.643921**, -0.221742, 1.451132, -2.439096, 1.336105, 0.748346, 0.725985, -0.604978, 2.061625, 1.040246, -0.511721, -0.546404, -0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,... **DESIRED:** array([ **1.227356**, -0.221742, 1.451132, -2.439096, 1.336105, 0.748346, 0.725985, -0.604978, 2.061625, 1.040246, -0.511721, -0.546404, -0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,... ============================== 1 failed in 3.73s =============================== A non-offset LDS store (e.g., writing to LDS[0]) only makes sense in a true scalar scenario when You deliberately perform one atomic per wave (single output). **Fix** We removed/guarded the non-offset LDS write so it is not emitted when it is unsafe or unnecessary: No non-offset LDS staging for tensor/multi-output atomics. No LDS staging when the atomic return is unused.
…rs (triton-lang#8639) As part of using the `triton.knobs.runtime.jit_cache_hook`, the `JITFunction` class performs JSON serialization on the specialization data. The serialized specialization data is then expected to be used as part of the `preload()` function, where it will be deserialized and used to compile the Triton kernel. However, this process fails to account for the following cases: - When part of the Triton python signature is a Python tuple, the serialization process will transform it into a list (Because JSON serializes tuples as lists); the deserialization process does not transform it back into a tuple, leading to a parsing failure when `ast_to_ttir()` is invoked. - When the constants contain a `tl.constexpr` value, the serialization process will raise an error, because `tl.constexpr` is not serializable. This PR addresses both of these issues by: - Applying the reverse transformation in the deserialization from lists to tuples for signatures. We can do this unconditionally because lists are not accepted as part of the signature of a Triton kernel. - Adding a special case for `constexpr` for constants in the specialization data, so that it can be serialized and deserialized without losing its type. - Adding a test that is the exact same as `test_passing_nested_tuple_with_constexpr`, but with the JIT hook setup so that we can verify that the serialization/deserialization round-trip works as intended.
…8703) visit_AugAssign works by creating fake AST nodes for the binary operation and the assignment and visiting those. Prior to this commit, the nodes lack the correct lineno and col_offset so any errors raised here don't include the helpful '^' pointer to the offending line of code; this makes debugging painful.
Consan-generated functions are cached by the signature type. So far the buffer/mbarrier memdesc was part of the signature, and since types of these generally will be different for different buffers being accessed in the kernel, we were ending up with duplicated functions where the only difference was the type of the memdesc. This PR moves the memdesc to i64 cast to before the function call, so functions always take i64 param, reducing the duplication and the code size. This saves some compilation time. H100 (no warp spcialization): | | base | with consan | optimizing memdesc_to_i64 | | --- | --- | --- | --- | | make_ttir | 4,503 | 6,359 | 6,347 | | make_ttgir | 25,823 | 32,835 | 32,785 | | make_llir | 249,535 | 2,609,209 | 1,926,120 | | make_ptx | 124,729 | 808,051 | 642,082 | | make_cubin | 393,597 | 9,313,567 | 8,659,112 | GB200 (WS): | | base | with consan | optimizing memdesc_to_i64 | | --- | --- | --- | --- | | make_ttir | 5,120 | 5,063 | 5,322 | | make_ttgir | 181,326 | 181,617 | 186,404 | | make_llir | 261,414 | 9,591,225 | 7,421,986 | | make_ptx | 66,653 | 2,208,114 | 1,878,720 | | make_cubin | 250,249 | 57,137,738 | 41,390,814 |
Now, the call paths of graph-launched kernels consist of two components:
the **launch context** and the **captured context**.
For example:
```
└─ 40480.000 40.000 test
└─ 40480.000 40.000 <captured_at>
├─ 4256.000 4.000 iter_0
│ ├─ 2176.000 2.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIfEESt5arrayIPcLm1EEEEviT0_T1_
│ ├─ 1152.000 1.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_15CUDAFunctor_addIfEESt5arrayIPcLm3EEEEviT0_T1_
│ └─ 928.000 1.000 foo
```
The example above shows that the `foo` kernel is launched under the
`test` frame and was captured within the `iter_0` frame.
Additional updates include improved tracking of graph node counts,
replacing macros with `constexpr std::array` for activity registration,
distinguishing between cloned and newly created graph nodes, and
initializing stream capture event handling.
…ize` (triton-lang#8713) We assume function metadata is only initialized once per execution, but `proton.finalize` might be called multiple times, so don't release the metadata map
```
scopeA.start
{
scopeB.start
scopeB.end
}
scopeA.end
```
If scopeA is scopeB's parent, `scopeA.end` should post-dominate
`scopeB.end`
…g#5455) (triton-lang#8717) Signed-off-by: Anatoly Myachev <[email protected]>
…riton-lang#8704) It's not fully NFC, but as if it were. Before, we relied on the CTALayout information to populate the CTALayout correctly for NVMMA. Now, we serialise and deserialise a `rank` attribute to this effect. This will still be necessary when we move to representing CTALayouts with LinearLayouts, as this will give the number of output dimensions of the associates LinearLayout (even if empty). Weirdly enough, I found that on `ws-code-partition.mlir` there are some `nvmma` layouts with `rank = 1`. I think those are wrong, but I didn't want to keep touching things.
Dropped unused header files along the way.
This adds a new layout type that is like auto layout, except that the layout is set automatically by any load or store op to be the layout that the coalesce pass would have chosen.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD.Select one of the following.
/testforlittests/unittestfor C++ tests/python/testfor end-to-end testsFILL THIS IN.Select one of the following.
littests.littests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)