Skip to content

Conversation

Copy link

Copilot AI commented Nov 12, 2025

Plan to fix redundant float-to-float cast in broadcast_in_dim

  • Explore repository and understand the issue
  • Locate the root cause in csrc/ops/alias.cpp broadcast function
  • Create tests to verify the fix
  • Implement the fix to remove redundant set operation
  • Build and test the changes
  • Run code review and security scan
  • Verify the fix resolves the issue
  • Address PR feedback

Issue Summary

When using broadcast_in_dim with broadcast_dims that include all input dimensions (meaning no actual broadcasting is needed), nvFuser was introducing a redundant Set operation (float-to-float cast) in the generated kernel.

Root Cause

In csrc/ops/alias.cpp at lines 1025-1030, the broadcast function was calling set(inp) when n_broadcasts == 0, creating an unnecessary LoadStoreOp in the kernel even though no transformation was needed.

Solution Implemented

Changed in csrc/ops/alias.cpp:

  • When n_broadcasts == 0, now returns inp directly instead of calling set(inp)
  • This eliminates the redundant Set operation while maintaining correct semantics

Tests Added:

  1. tests/cpp/test_alias.cpp: BroadcastInDimNoRedundantSet - Verifies no LoadStoreOp with type Set is created when no broadcasting is needed. Updated to add an abs() operation to ensure fusion has expressions to test.
  2. tests/python/test_python_frontend.py: test_broadcast_in_dim_no_redundant_set - Compares broadcast_in_dim with expand using direct string comparison since they should be identical when broadcast is a no-op.

Recent Changes (PR Feedback)

  • Simplified Python test to use direct IR string comparison (assert str(fd_bid) == str(fd_exp))
  • Enhanced C++ test to add an abs() operation after broadcast to ensure fusion has expressions

Impact

  • Eliminates unnecessary Set operations when no broadcasting is performed
  • Improves clarity of generated kernels
  • Minimal code change (replaced 6 lines with 1 line)
  • No risk to existing functionality - only affects the n_broadcasts==0 case
  • Security scan passed with no vulnerabilities found
Original prompt

This section details on the original issue you should resolve

<issue_title>Does redundant float-to-float cast via Set impact kernel performance when using broadcast_in_dim?</issue_title>
<issue_description>Could float->float cast of tensors which is translated to set potentially impact kernel performance? (cast(tv0, dtype=DataType.Float) when tv0 is already Float)

Let's take a look at two fusions involving a broadcast operation implemented with ops.broadcast_in_dim and ops.expand:

import nvfuser_direct
print(nvfuser_direct.__version__)

from nvfuser_direct import FusionDefinition, DataType

def nvfuser_fusion(fd : FusionDefinition) -> None :
    tv0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False)
    c4 = fd.define_scalar(None, dtype=DataType.Int)
    c5 = fd.define_scalar(None, dtype=DataType.Int)
    tv1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False)
    tv3 = fd.ops.broadcast_in_dim(tv0, (c4, c5), (0, 1))
    tv4 = fd.ops.add(tv3, tv1)
    fd.add_output(tv4)

with FusionDefinition() as fd1:
    nvfuser_fusion(fd1)

print(fd1)
print(fd1.fusion.print_math())

and

from nvfuser_direct import FusionDefinition, DataType
import nvfuser_direct
print(nvfuser_direct.__version__)

def nvfuser_fusion(fd : FusionDefinition) -> None :
    tv0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False)
    c4 = fd.define_scalar(None, dtype=DataType.Int)
    c5 = fd.define_scalar(None, dtype=DataType.Int)
    tv1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False)
    tv3 = fd.ops.expand(tv0, shape=[c4, c5])
    tv4 = fd.ops.add(tv3, tv1)
    fd.add_output(tv4)

with FusionDefinition() as fd2:
    nvfuser_fusion(fd2)

print(fd2)
print(fd2.fusion.print_math())

When fusion definitions are printed they differ only in one additional cast operation tv2 = fd.ops.cast(tv0, dtype=DataType.Float) for the broadcast_in_dim version. The first fusion definition would print

def nvfuser_fusion(fd : FusionDefinition) -> None :
    tv0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False)
    c4 = fd.define_scalar(None, dtype=DataType.Int)
    c5 = fd.define_scalar(None, dtype=DataType.Int)
    tv1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False)
    tv2 = fd.ops.cast(tv0, dtype=DataType.Float)
    c12 = fd.ops.cast(c4, dtype=DataType.Int)
    c1 = fd.ops.size(tv2, dim=1)
    tv3 = fd.ops.expand(tv2, shape=[c12, c14])
    tv4 = fd.ops.add(tv3, tv1)
    c14 = fd.ops.cast(c5, dtype=DataType.Int)
    fd.add_output(tv4)

while the second one is

def nvfuser_fusion(fd : FusionDefinition) -> None :
    tv0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False)
    c4 = fd.define_scalar(None, dtype=DataType.Int)
    c5 = fd.define_scalar(None, dtype=DataType.Int)
    tv1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False)
    c8 = fd.ops.cast(c4, dtype=DataType.Int)
    c1 = fd.ops.size(tv0, dim=1)
    tv2 = fd.ops.expand(tv0, shape=[c8, c10])
    tv3 = fd.ops.add(tv2, tv1)
    c10 = fd.ops.cast(c5, dtype=DataType.Int)
    fd.add_output(tv3)

Here are their kernel math representation:
1.

Inputs:
  T0_g_float[bS0{1}, iS1{i1}]
  i4
  i5
  T1_g_float[iS2{i6}, iS3{i7}]
Outputs:
  T4_g_float[iS8{i6}, iS9{( __to_index(i5) )}]

%kernel_math {
T2_l_float[bS4{1}, iS5{i1}]
   = Set( T0_g_float[bS0{1}, iS1{i1}], cache_op=Streaming )
i12 = __to_index(i4);
T3_l_float[bS6{1 ex ( __to_index(i4) )}, iS7{( __to_index(i5) )}] = expand( T2_l_float[bS4{1}, iS5{i1}] )
T4_g_float[iS8{i6}, iS9{( __to_index(i5) )}]
   = T3_l_float[bS6{1 ex ( __to_index(i4) )}, iS7{( __to_index(i5) )}]
   + T1_g_float[iS2{i6}, iS3{i7}];
} // %kernel_math 

and second fusion:

Inputs:
  T0_g_float[bS0{1}, iS1{i1}]
  i4
  i5
  T1_g_float[iS2{i6}, iS3{i7}]
Outputs:
  T3_g_float[iS6{i6}, iS7{( __to_index(i5) )}]

%kernel_math {
i8 = __to_index(i4);
T2_l_float[bS4{1 ex ( __to_index(i4) )}, iS5{( __to_index(i5) )}] = expand( T0_g_float[bS0{1}, iS1{i1}] )
T3_g_float[iS6{i6}, iS7{( __to_index(i5) )}]
   = T2_l_float[bS4{1 ex ( __to_index(i4) )}, iS5{( __to_index(i5) )}]
   + T1_g_float[iS2{i6}, iS3{i7}];
} // %kernel_math 

The difference between the two is in the T2_l_float[bS4{1}, iS5{i1}] = Set( T0_g_float[bS0{1}, iS1{i1}], cache_op=Streaming ) operation.

Ideally, broadcast_in_dim wouldn't introduce unnecessary set operations to the fusion even if performance impact is minimal or nonexistent to avoid confusion.

I'm using nvFuser version 0.2.34+gitb90eb75.</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Fix redundant float-to-float cast in kernel performance Remove redundant Set operation from broadcast when no broadcasting occurs Nov 12, 2025
Copilot AI requested a review from IvanYashchuk November 12, 2025 18:26
Copilot finished work on behalf of IvanYashchuk November 12, 2025 18:26
@IvanYashchuk IvanYashchuk requested a review from Copilot November 12, 2025 19:50
@github-actions
Copy link

Description

  • Eliminate redundant Set operation in broadcast when no broadcasting occurs

  • Return input directly in broadcast if n_broadcasts == 0

  • Add C++ and Python tests to verify no redundant LoadStoreOp

  • Ensure broadcast_in_dim and expand generate equivalent IR

Changes walkthrough

Relevant files
Bug fix
alias.cpp
Skip Set op when broadcast is a no-op                                       

csrc/ops/alias.cpp

  • Return input directly when no broadcasting is needed (n_broadcasts ==
    0)
  • Remove redundant set(inp) call that caused unnecessary float-to-float
    cast
  • Avoid inserting a Set LoadStoreOp in the kernel IR
  • +1/-5     
    Tests
    test_alias.cpp
    Test no Set op in no-op broadcast                                               

    tests/cpp/test_alias.cpp

  • Add new test BroadcastInDimNoRedundantSet to validate fix
  • Verify no LoadStoreOp of type Set exists in fusion IR
  • Confirm correct behavior with testValidate using random input
  • +36/-0   
    test_python_frontend.py
    Python test for redundant cast in broadcast                           

    tests/python/test_python_frontend.py

  • Add test test_broadcast_in_dim_no_redundant_set
  • Compare IR from broadcast_in_dim and expand for equivalence
  • Count fd.ops.cast(t occurrences to detect redundant casts
  • Validate both fusions produce same output
  • +55/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Direct Return Safety

    Returning the input tensor directly when no broadcasting occurs should be evaluated for any potential side effects on tensor view semantics or fusion optimization passes that might expect a new TensorView instance.

    if (n_broadcasts == 0) {
      return inp;
    }
    Test Coverage

    The new test verifies absence of Set operations but could also explicitly validate that the output tensor is the same as the input tensor in terms of properties when no broadcast occurs.

    TEST_F(AliasTest, BroadcastInDimNoRedundantSet) {
      // Test that broadcast with no actual broadcasting does not introduce
      // a redundant Set operation
      auto fusion = std::make_unique<Fusion>();
      FusionGuard fg(fusion.get());
    
      TensorView* in = makeContigConcreteTensor({2, 3});
      fusion->addInput(in);
    
      // Call broadcast with all dims marked as non-broadcast
      // This should not introduce a Set operation
      std::vector<bool> is_broadcast_dim = {false, false};
      TensorView* out = broadcast(in, is_broadcast_dim);
    
      fusion->addOutput(out);
    
      // Verify that no LoadStoreOp with type Set is in the fusion
      auto exprs = fusion->exprs();
      for (auto expr : exprs) {
        if (auto load_store = dynamic_cast<LoadStoreOp*>(expr)) {
          EXPECT_NE(load_store->opType(), LoadStoreOpType::Set)
              << "Unexpected Set operation found in fusion with no-op broadcast";
        }
      }
    
      // Verify the fusion still works correctly
      FusionExecutorCache executor_cache(std::move(fusion));
      at::Tensor in_tensor =
          at::randn({2, 3}, at::dtype(at::kFloat).device(at::kCUDA));
      auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});
    
      testValidate(
          executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
    }
    IR Comparison Robustness

    The test relies on string comparison of IR to count cast operations, which may be fragile to formatting changes; a more robust method could be considered.

    def test_broadcast_in_dim_no_redundant_set(self):
        """
        Test that broadcast_in_dim doesn't introduce redundant Set operations
        when all input dimensions are in broadcast_dims (i.e., no actual broadcast).
    
        This verifies the fix for the issue where broadcast_in_dim would create
        a redundant float-to-float cast operation via Set when the input already
        had the correct shape.
        """
        inputs = [
            torch.ones(1, 4, device="cuda"),
            torch.randn(2, 4, device="cuda"),
        ]
    
        def fusion_with_broadcast_in_dim(fd: FusionDefinition):
            t0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True])
            t1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True])
            # broadcast_in_dim with broadcast_dims=[0, 1] means no new dims are added
            t2 = fd.ops.broadcast_in_dim(t0, t1.shape(), [0, 1])
            t3 = fd.ops.add(t2, t1)
            fd.add_output(t3)
    
        def fusion_with_expand(fd: FusionDefinition):
            t0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True])
            t1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True])
            # Direct expand without broadcast_in_dim
            t2 = fd.ops.expand(t0, t1.shape())
            t3 = fd.ops.add(t2, t1)
            fd.add_output(t3)
    
        # Execute both fusions and verify they produce the same result
        nvf_out_bid, fd_bid = self.exec_nvfuser(fusion_with_broadcast_in_dim, inputs)
        nvf_out_exp, fd_exp = self.exec_nvfuser(fusion_with_expand, inputs)
    
        # Verify correctness
        eager_out = inputs[0] + inputs[1]
        self.assertEqual(eager_out, nvf_out_bid[0])
        self.assertEqual(eager_out, nvf_out_exp[0])
    
        # Check that the broadcast_in_dim fusion doesn't have a redundant Set operation
        # by comparing the IR string representations
        bid_str = str(fd_bid)
        exp_str = str(fd_exp)
    
        # Count tensor cast operations (not scalar casts)
        bid_tensor_casts = bid_str.count("fd.ops.cast(t")
        exp_tensor_casts = exp_str.count("fd.ops.cast(t")
    
        # They should have the same number of tensor casts
        self.assertEqual(
            bid_tensor_casts,
            exp_tensor_casts,
            f"broadcast_in_dim has {bid_tensor_casts} tensor casts but expand has {exp_tensor_casts}"
        )

    Copilot finished reviewing on behalf of IvanYashchuk November 12, 2025 19:54
    Copy link

    Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR removes a redundant Set operation that was being inserted by the broadcast function when no actual broadcasting occurs (i.e., when all dimensions in is_broadcast_dim are false).

    Key Changes:

    • Modified the broadcast function to return the input tensor directly when n_broadcasts == 0 instead of wrapping it in a set() operation
    • Added C++ test to verify no LoadStoreOp with type Set is created for no-op broadcasts
    • Added Python test comparing broadcast_in_dim and expand IR to ensure they're equivalent when no broadcasting occurs

    Reviewed Changes

    Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

    File Description
    csrc/ops/alias.cpp Returns input directly when no broadcasts needed, eliminating redundant Set operation
    tests/cpp/test_alias.cpp Adds test verifying no Set operation is created for no-op broadcast and includes necessary header
    tests/python/test_python_frontend.py Adds test comparing broadcast_in_dim and expand IR to ensure no redundant cast operations

    💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    def fusion_with_broadcast_in_dim(fd: FusionDefinition):
    t0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True])
    t1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True])
    # broadcast_in_dim with broadcast_dims=[0, 1] means no new dims are added
    Copy link

    Copilot AI Nov 12, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    [nitpick] The comment states "broadcast_in_dim with broadcast_dims=[0, 1] means no new dims are added" but could be more precise. While technically correct that no new dimensions are added to the tensor (input is 2D, output is 2D), this comment could be clearer. Consider rewording to: "broadcast_in_dim with all input dims in broadcast_dims means no broadcasting operation occurs" or "broadcast_dims=[0, 1] for a 2D input maps all input dimensions, so no new broadcast dimensions are created".

    Suggested change
    # broadcast_in_dim with broadcast_dims=[0, 1] means no new dims are added
    # broadcast_in_dim with all input dims in broadcast_dims means no broadcasting operation occurs

    Copilot uses AI. Check for mistakes.
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The review comment is not correct, broadcasting operation occurs and it involves expanding 1-sized dimension.

    Copy link
    Collaborator

    @IvanYashchuk IvanYashchuk left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    What's the process for checking performance impact of pull requests?

    Comment on lines 709 to 716
    # Check that the broadcast_in_dim fusion doesn't have a redundant Set operation
    # by comparing the IR string representations
    bid_str = str(fd_bid)
    exp_str = str(fd_exp)

    # Count tensor cast operations (not scalar casts)
    bid_tensor_casts = bid_str.count("fd.ops.cast(t")
    exp_tensor_casts = exp_str.count("fd.ops.cast(t")
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Is there a better way to test what nodes a fusion definition consists of?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    IIUC, the broadcast is a no-op, so the fusion definitions should be the same.

    Suggested change
    # Check that the broadcast_in_dim fusion doesn't have a redundant Set operation
    # by comparing the IR string representations
    bid_str = str(fd_bid)
    exp_str = str(fd_exp)
    # Count tensor cast operations (not scalar casts)
    bid_tensor_casts = bid_str.count("fd.ops.cast(t")
    exp_tensor_casts = exp_str.count("fd.ops.cast(t")
    # Check that the broadcast_in_dim fusion doesn't have a redundant Set operation
    # by comparing the IR string representations
    assert str(fd_bid) == str(fd_exp)

    Copy link
    Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Updated to use direct string comparison as suggested. Since the broadcast is a no-op, the fusion definitions should indeed be identical. Commit: 3b44991

    identity->getValType().value() == ValType::TensorView,
    "Expected identity op, but didn't get a TensorView back.");
    return identity->as<TensorView>();
    return inp;
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The change seems reasonable to me.

    @IvanYashchuk
    Copy link
    Collaborator

    @rdspring1, could you please review the original issue and the Copilot-proposed fix?

    @IvanYashchuk
    Copy link
    Collaborator

    IvanYashchuk commented Nov 12, 2025

    !test

    (maybe since I don't have any membership from this repo the command won't work)

    // Call broadcast with all dims marked as non-broadcast
    // This should not introduce a Set operation
    std::vector<bool> is_broadcast_dim = {false, false};
    TensorView* out = broadcast(in, is_broadcast_dim);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The fusion is empty, so the verification seems silly.

    Suggested change
    TensorView* out = broadcast(in, is_broadcast_dim);
    TensorView* maybe_bcast = broadcast(in, is_broadcast_dim);
    TensorView* out = abs(maybe_bcast);

    Or just check:
    EXPECT_EQ(fusion->exprs(), 0);

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @copilot, please implement the change according to feedback from Ryan.

    Copy link
    Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Added abs() operation after the broadcast and updated the comment as suggested. This ensures the fusion has expressions to verify. Commit: 3b44991

    Copy link
    Collaborator

    @rdspring1 rdspring1 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    What's the process for checking performance impact of pull requests?

    Normally, we run !test --pybench to check performance differences. For this change, I doubt you'd see any change. I'd use compiler explorer to check the PTX for any differences.

    Copilot AI requested a review from IvanYashchuk November 13, 2025 12:05
    Copilot finished work on behalf of IvanYashchuk November 13, 2025 12:05
    Copy link
    Collaborator

    @jacobhinkle jacobhinkle left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I don't have any strong objection to this PR. I don't think it's necessary for performance since an additional assignment should get elided by the compiler. But I'm inclined to accept the PR as it simplifies the IR a bit.

    Note that I think the root of the confusion in the case of #5506 is that thunder has a broadcast_in_dim which we lower as a broadcast_in_dim op in the nvfuser executor but which is really just a broadcast+expand. If it were broadcast+expand everywhere in the expected behavior would be more clear.

    I think we only need one test or the other and I'd probably lean toward keeping just the python test.

    Comment on lines +700 to +707
    # Execute both fusions and verify they produce the same result
    nvf_out_bid, fd_bid = self.exec_nvfuser(fusion_with_broadcast_in_dim, inputs)
    nvf_out_exp, fd_exp = self.exec_nvfuser(fusion_with_expand, inputs)

    # Verify correctness
    eager_out = inputs[0] + inputs[1]
    self.assertEqual(eager_out, nvf_out_bid[0])
    self.assertEqual(eager_out, nvf_out_exp[0])
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    These checks (and inputs) are probably not necessary since the point of the test is just that the IR should match exactly whether we use broadcast_in_dim or expand whenever there is no new broadcast.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    Does redundant float-to-float cast via Set impact kernel performance when using broadcast_in_dim?

    4 participants