Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1488,6 +1488,10 @@ struct CollectiveMmaArrayMixedInput<
template <class... TMs>
CUTLASS_DEVICE void
tensormaps_cp_fence_release(TensorMapStorage& shared_tensormaps, cute::tuple<TMs...> const& input_tensormaps) {
if (cute::elect_one_sync()) {
cute::tma_desc_commit_group();
cute::tma_desc_wait_group();
}
Comment on lines +1491 to +1494
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The cute::tma_desc_wait_group() function is a collective operation that all threads in the warp must execute to synchronize. Placing it inside the if (cute::elect_one_sync()) block means only the elected thread will perform the wait, while other threads in the warp will proceed without synchronization. This can lead to a race condition where other threads use stale TMA descriptors, or a deadlock. To fix this, the tma_desc_wait_group() call should be moved outside of the conditional block.

    if (cute::elect_one_sync()) {
      cute::tma_desc_commit_group();
    }
    cute::tma_desc_wait_group();

// Entire warp must do this (i.e. it's aligned)
tma_descriptor_cp_fence_release(get<0>(input_tensormaps), shared_tensormaps.smem_tensormap_A);
tma_descriptor_cp_fence_release(get<1>(input_tensormaps), shared_tensormaps.smem_tensormap_B);
Expand Down
Loading