merge from main #11

jungpark-mlir · 2025-11-13T18:32:16Z

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

# New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `Existing tests are sufficient`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Replacing use of deprecated `ast.Num` class with `ast.Constant`. This change is necessary to allow compatibility with Python 3.14, in which `ast.Num` was removed. Co-authored-by: Rich Coombs <[email protected]>

Add support for LDS memory barrier ops on gfx1250 and expose them to gluon.

…8633) This PR fixes nondeterministic results in Triton's atomic RMW unit tests on RDNA. The issue was caused by an unnecessary non-offset LDS store of the atomic return ("old") that every lane executed to the same LDS address, creating a race without any functional use. %16 = atomicrmw add ptr addrspace(1) %dst, i64 %val syncscope("agent") acq_rel store i64 %16, ptr addrspace(3) @global_smem ; **<-- non-offset LDS write** . . . %8 = (lane_id & 63) << 3 %19 = getelementptr i8, ptr addrspace(3) @global_smem, i32 %8 store <1 x i64> %21, ptr addrspace(3) %19 %24 = load i64, ptr addrspace(3) %19 There is no read from @global_smem at offset 0 after the first store. Thus, the zero-offset store is dead, but every lane still writes it, causing write-write contention. For lane 0, the same location later used for the real value. **Mismatched elements: 1 / 64 (1.56%)** Max absolute difference among violations: 1.87127712 Max relative difference among violations: 1.52464041 **ACTUAL:** array([**-0.643921**, -0.221742, 1.451132, -2.439096, 1.336105, 0.748346, 0.725985, -0.604978, 2.061625, 1.040246, -0.511721, -0.546404, -0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,... **DESIRED:** array([ **1.227356**, -0.221742, 1.451132, -2.439096, 1.336105, 0.748346, 0.725985, -0.604978, 2.061625, 1.040246, -0.511721, -0.546404, -0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,... ============================== 1 failed in 3.73s =============================== A non-offset LDS store (e.g., writing to LDS[0]) only makes sense in a true scalar scenario when You deliberately perform one atomic per wave (single output). **Fix** We removed/guarded the non-offset LDS write so it is not emitted when it is unsafe or unnecessary: No non-offset LDS staging for tensor/multi-output atomics. No LDS staging when the atomic return is unused.

…rs (triton-lang#8639) As part of using the `triton.knobs.runtime.jit_cache_hook`, the `JITFunction` class performs JSON serialization on the specialization data. The serialized specialization data is then expected to be used as part of the `preload()` function, where it will be deserialized and used to compile the Triton kernel. However, this process fails to account for the following cases: - When part of the Triton python signature is a Python tuple, the serialization process will transform it into a list (Because JSON serializes tuples as lists); the deserialization process does not transform it back into a tuple, leading to a parsing failure when `ast_to_ttir()` is invoked. - When the constants contain a `tl.constexpr` value, the serialization process will raise an error, because `tl.constexpr` is not serializable. This PR addresses both of these issues by: - Applying the reverse transformation in the deserialization from lists to tuples for signatures. We can do this unconditionally because lists are not accepted as part of the signature of a Triton kernel. - Adding a special case for `constexpr` for constants in the specialization data, so that it can be serialized and deserialized without losing its type. - Adding a test that is the exact same as `test_passing_nested_tuple_with_constexpr`, but with the JIT hook setup so that we can verify that the serialization/deserialization round-trip works as intended.

…iton-lang#8696)

…8703) visit_AugAssign works by creating fake AST nodes for the binary operation and the assignment and visiting those. Prior to this commit, the nodes lack the correct lineno and col_offset so any errors raised here don't include the helpful '^' pointer to the offending line of code; this makes debugging painful.

…g#8699)

)

Consan-generated functions are cached by the signature type. So far the buffer/mbarrier memdesc was part of the signature, and since types of these generally will be different for different buffers being accessed in the kernel, we were ending up with duplicated functions where the only difference was the type of the memdesc. This PR moves the memdesc to i64 cast to before the function call, so functions always take i64 param, reducing the duplication and the code size. This saves some compilation time. H100 (no warp spcialization): | | base | with consan | optimizing memdesc_to_i64 | | --- | --- | --- | --- | | make_ttir | 4,503 | 6,359 | 6,347 | | make_ttgir | 25,823 | 32,835 | 32,785 | | make_llir | 249,535 | 2,609,209 | 1,926,120 | | make_ptx | 124,729 | 808,051 | 642,082 | | make_cubin | 393,597 | 9,313,567 | 8,659,112 | GB200 (WS): | | base | with consan | optimizing memdesc_to_i64 | | --- | --- | --- | --- | | make_ttir | 5,120 | 5,063 | 5,322 | | make_ttgir | 181,326 | 181,617 | 186,404 | | make_llir | 261,414 | 9,591,225 | 7,421,986 | | make_ptx | 66,653 | 2,208,114 | 1,878,720 | | make_cubin | 250,249 | 57,137,738 | 41,390,814 |

Now, the call paths of graph-launched kernels consist of two components: the **launch context** and the **captured context**. For example: ``` └─ 40480.000 40.000 test └─ 40480.000 40.000 <captured_at> ├─ 4256.000 4.000 iter_0 │ ├─ 2176.000 2.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIfEESt5arrayIPcLm1EEEEviT0_T1_ │ ├─ 1152.000 1.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_15CUDAFunctor_addIfEESt5arrayIPcLm3EEEEviT0_T1_ │ └─ 928.000 1.000 foo ``` The example above shows that the `foo` kernel is launched under the `test` frame and was captured within the `iter_0` frame. Additional updates include improved tracking of graph node counts, replacing macros with `constexpr std::array` for activity registration, distinguishing between cloned and newly created graph nodes, and initializing stream capture event handling.

…ize` (triton-lang#8713) We assume function metadata is only initialized once per execution, but `proton.finalize` might be called multiple times, so don't release the metadata map

``` scopeA.start { scopeB.start scopeB.end } scopeA.end ``` If scopeA is scopeB's parent, `scopeA.end` should post-dominate `scopeB.end`

…g#8716)

…g#5455) (triton-lang#8717) Signed-off-by: Anatoly Myachev <[email protected]>

…riton-lang#8704) It's not fully NFC, but as if it were. Before, we relied on the CTALayout information to populate the CTALayout correctly for NVMMA. Now, we serialise and deserialise a `rank` attribute to this effect. This will still be necessary when we move to representing CTALayouts with LinearLayouts, as this will give the number of output dimensions of the associates LinearLayout (even if empty). Weirdly enough, I found that on `ws-code-partition.mlir` there are some `nvmma` layouts with `rank = 1`. I think those are wrong, but I didn't want to keep touching things.

Dropped unused header files along the way.

This adds a new layout type that is like auto layout, except that the layout is set automatically by any load or store op to be the layout that the coalesce pass would have chosen.

SoftwareBoi and others added 19 commits November 11, 2025 12:04

[AMD] Add MXFP GEMM Gluon Kernel for GFX1250 (triton-lang#8694)

9526cdb

[AMD] Add support for LDS memory barriers on gfx1250 (triton-lang#8681)

b78e1de

Add support for LDS memory barrier ops on gfx1250 and expose them to gluon.

[TMEM] Fix missing log2 in 32x32b_split and fix the error message (tr…

ac0d9b0

…iton-lang#8696)

[BENCH] Add the missing reduction_n=2 to bench_mlp.py (triton-lan…

3f3a12d

…g#8699)

[triton_kernels] Make layout classes dataclasses (NFC) (triton-lang#8690

9d41d0a

)

[PROTON][INSTRUMENTATION] Do not clean up function metadata at `final…

a187ecd

…ize` (triton-lang#8713) We assume function metadata is only initialized once per execution, but `proton.finalize` might be called multiple times, so don't release the metadata map

[PROTON] Fix dominance analysis (triton-lang#8712)

c3c65b9

``` scopeA.start { scopeB.start scopeB.end } scopeA.end ``` If scopeA is scopeB's parent, `scopeA.end` should post-dominate `scopeB.end`

[AMD] Drop deprecated pattern in OptimizeDotOperands pass (triton-lan…

cbd5d48

…g#8716)

Remove hardcoded cuda in triton.testing.get_dram_gbps (triton-lan…

449e014

…g#5455) (triton-lang#8717) Signed-off-by: Anatoly Myachev <[email protected]>

[AMD] Update HIP header files to 7.1 (triton-lang#8709)

0a4cbff

Dropped unused header files along the way.

[GLUON] Add coalesced layout (triton-lang#8604)

91ca177

This adds a new layout type that is like auto layout, except that the layout is set automatically by any load or store op to be the layout that the coalesce pass would have chosen.

Merge branch 'newpp' into merge1

33ef770

jungpark-mlir merged commit 66a1cdd into newpp Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from main #11

merge from main #11

Uh oh!

jungpark-mlir commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

merge from main #11

merge from main #11

Uh oh!

Conversation

jungpark-mlir commented Nov 13, 2025

New contributor declaration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants