Skip to content

Conversation

@jungpark-mlir
Copy link
Owner

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

SoftwareBoi and others added 19 commits November 11, 2025 12:04
<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `Existing tests are
sufficient`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Replacing use of deprecated `ast.Num` class with `ast.Constant`. This
change is necessary to allow compatibility with Python 3.14, in which
`ast.Num` was removed.

Co-authored-by: Rich Coombs <[email protected]>
Add support for LDS memory barrier ops on gfx1250 and
expose them to gluon.
…8633)

This PR fixes nondeterministic results in Triton's atomic RMW unit tests
on RDNA. The issue was caused by an unnecessary non-offset LDS store of
the atomic return ("old") that every lane executed to the same LDS
address, creating a race without any functional use.

%16 = atomicrmw add ptr addrspace(1) %dst, i64 %val syncscope("agent")
acq_rel
store i64 %16, ptr addrspace(3) @global_smem ; **<-- non-offset LDS
write**

.
.
.

%8  = (lane_id & 63) << 3
%19 = getelementptr i8, ptr addrspace(3) @global_smem, i32 %8
store <1 x i64> %21, ptr addrspace(3) %19
%24 = load  i64, ptr addrspace(3) %19

There is no read from @global_smem at offset 0 after the first store.
Thus, the zero-offset store is dead, but every lane still writes it,
causing write-write contention. For lane 0, the same location later used
for the real value.

**Mismatched elements: 1 / 64 (1.56%)**
Max absolute difference among violations: 1.87127712
Max relative difference among violations: 1.52464041
**ACTUAL:** array([**-0.643921**, -0.221742, 1.451132, -2.439096,
1.336105, 0.748346,
        0.725985, -0.604978,  2.061625,  1.040246, -0.511721, -0.546404,
-0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,...
**DESIRED:** array([ **1.227356**, -0.221742, 1.451132, -2.439096,
1.336105, 0.748346,
        0.725985, -0.604978,  2.061625,  1.040246, -0.511721, -0.546404,
-0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,...
============================== 1 failed in 3.73s
===============================


A non-offset LDS store (e.g., writing to LDS[0]) only makes sense in a
true scalar scenario when You deliberately perform one atomic per wave
(single output).

**Fix**
We removed/guarded the non-offset LDS write so it is not emitted when it
is unsafe or unnecessary:
No non-offset LDS staging for tensor/multi-output atomics.
No LDS staging when the atomic return is unused.
…rs (triton-lang#8639)

As part of using the `triton.knobs.runtime.jit_cache_hook`, the
`JITFunction` class performs JSON serialization on the specialization
data. The serialized specialization data is then expected to be used as
part of the `preload()` function, where it will be deserialized and used
to compile the Triton kernel.

However, this process fails to account for the following cases:
- When part of the Triton python signature is a Python tuple, the
serialization process will transform it into a list (Because JSON
serializes tuples as lists); the deserialization process does not
transform it back into a tuple, leading to a parsing failure when
`ast_to_ttir()` is invoked.
- When the constants contain a `tl.constexpr` value, the serialization
process will raise an error, because `tl.constexpr` is not serializable.

This PR addresses both of these issues by:
- Applying the reverse transformation in the deserialization from lists
to tuples for signatures. We can do this unconditionally because lists
are not accepted as part of the signature of a Triton kernel.
- Adding a special case for `constexpr` for constants in the
specialization data, so that it can be serialized and deserialized
without losing its type.
- Adding a test that is the exact same as
`test_passing_nested_tuple_with_constexpr`, but with the JIT hook setup
so that we can verify that the serialization/deserialization round-trip
works as intended.
…8703)

visit_AugAssign works by creating fake AST nodes for the binary
operation and the assignment and visiting those. Prior to this commit,
the nodes lack the correct lineno and col_offset so any errors raised
here don't include the helpful '^' pointer to the offending line of
code; this makes debugging painful.
Consan-generated functions are cached by the signature type. So far the
buffer/mbarrier memdesc was part of the signature, and since types of
these generally will be different for different buffers being accessed
in the kernel, we were ending up with duplicated functions where the
only difference was the type of the memdesc.
This PR moves the memdesc to i64 cast to before the function call, so
functions always take i64 param, reducing the duplication and the code
size. This saves some compilation time.

H100 (no warp spcialization):

|  | base | with consan | optimizing memdesc_to_i64 |
| --- | --- | --- | --- |
| make_ttir | 4,503 | 6,359 | 6,347 |
| make_ttgir | 25,823 | 32,835 | 32,785 |
| make_llir | 249,535 | 2,609,209 | 1,926,120 |
| make_ptx | 124,729 | 808,051 | 642,082 |
| make_cubin | 393,597 | 9,313,567 | 8,659,112 |

GB200 (WS):

|  | base | with consan | optimizing memdesc_to_i64 |
| --- | --- | --- | --- |
| make_ttir | 5,120 | 5,063 | 5,322 |
| make_ttgir | 181,326 | 181,617 | 186,404 |
| make_llir | 261,414 | 9,591,225 | 7,421,986 |
| make_ptx | 66,653 | 2,208,114 | 1,878,720 |
| make_cubin | 250,249 | 57,137,738 | 41,390,814 |
Now, the call paths of graph-launched kernels consist of two components:
the **launch context** and the **captured context**.
For example:

```
└─ 40480.000 40.000 test
   └─ 40480.000 40.000 <captured_at>
      ├─ 4256.000 4.000 iter_0
      │  ├─ 2176.000 2.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIfEESt5arrayIPcLm1EEEEviT0_T1_
      │  ├─ 1152.000 1.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_15CUDAFunctor_addIfEESt5arrayIPcLm3EEEEviT0_T1_
      │  └─ 928.000 1.000 foo
```

The example above shows that the `foo` kernel is launched under the
`test` frame and was captured within the `iter_0` frame.

Additional updates include improved tracking of graph node counts,
replacing macros with `constexpr std::array` for activity registration,
distinguishing between cloned and newly created graph nodes, and
initializing stream capture event handling.
…ize` (triton-lang#8713)

We assume function metadata is only initialized once per execution, but
`proton.finalize` might be called multiple times, so don't release the
metadata map
```
scopeA.start
{
  scopeB.start
  scopeB.end
}
scopeA.end
```

If scopeA is scopeB's parent, `scopeA.end` should post-dominate
`scopeB.end`
…riton-lang#8704)

It's not fully NFC, but as if it were. Before, we relied on the
CTALayout information to populate the CTALayout correctly for NVMMA.
Now, we serialise and deserialise a `rank` attribute to this effect.

This will still be necessary when we move to representing CTALayouts
with LinearLayouts, as this will give the number of output dimensions of
the associates LinearLayout (even if empty).

Weirdly enough, I found that on `ws-code-partition.mlir` there are some
`nvmma` layouts with `rank = 1`. I think those are wrong, but I didn't
want to keep touching things.
Dropped unused header files along the way.
This adds a new layout type that is like auto layout, except that the layout is set automatically by any load or store op to be the layout that the coalesce pass would have chosen.
@jungpark-mlir jungpark-mlir merged commit 66a1cdd into newpp Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.