-
Notifications
You must be signed in to change notification settings - Fork 70
Remove redundant Set operation from broadcast when no broadcasting occurs #5507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: IvanYashchuk <[email protected]>
Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Bug fix |
| ||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Direct Return Safety
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes a redundant Set operation that was being inserted by the broadcast function when no actual broadcasting occurs (i.e., when all dimensions in is_broadcast_dim are false).
Key Changes:
- Modified the
broadcastfunction to return the input tensor directly whenn_broadcasts == 0instead of wrapping it in aset()operation - Added C++ test to verify no LoadStoreOp with type Set is created for no-op broadcasts
- Added Python test comparing
broadcast_in_dimandexpandIR to ensure they're equivalent when no broadcasting occurs
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| csrc/ops/alias.cpp | Returns input directly when no broadcasts needed, eliminating redundant Set operation |
| tests/cpp/test_alias.cpp | Adds test verifying no Set operation is created for no-op broadcast and includes necessary header |
| tests/python/test_python_frontend.py | Adds test comparing broadcast_in_dim and expand IR to ensure no redundant cast operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def fusion_with_broadcast_in_dim(fd: FusionDefinition): | ||
| t0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True]) | ||
| t1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True]) | ||
| # broadcast_in_dim with broadcast_dims=[0, 1] means no new dims are added |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment states "broadcast_in_dim with broadcast_dims=[0, 1] means no new dims are added" but could be more precise. While technically correct that no new dimensions are added to the tensor (input is 2D, output is 2D), this comment could be clearer. Consider rewording to: "broadcast_in_dim with all input dims in broadcast_dims means no broadcasting operation occurs" or "broadcast_dims=[0, 1] for a 2D input maps all input dimensions, so no new broadcast dimensions are created".
| # broadcast_in_dim with broadcast_dims=[0, 1] means no new dims are added | |
| # broadcast_in_dim with all input dims in broadcast_dims means no broadcasting operation occurs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The review comment is not correct, broadcasting operation occurs and it involves expanding 1-sized dimension.
IvanYashchuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the process for checking performance impact of pull requests?
tests/python/test_python_frontend.py
Outdated
| # Check that the broadcast_in_dim fusion doesn't have a redundant Set operation | ||
| # by comparing the IR string representations | ||
| bid_str = str(fd_bid) | ||
| exp_str = str(fd_exp) | ||
|
|
||
| # Count tensor cast operations (not scalar casts) | ||
| bid_tensor_casts = bid_str.count("fd.ops.cast(t") | ||
| exp_tensor_casts = exp_str.count("fd.ops.cast(t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to test what nodes a fusion definition consists of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the broadcast is a no-op, so the fusion definitions should be the same.
| # Check that the broadcast_in_dim fusion doesn't have a redundant Set operation | |
| # by comparing the IR string representations | |
| bid_str = str(fd_bid) | |
| exp_str = str(fd_exp) | |
| # Count tensor cast operations (not scalar casts) | |
| bid_tensor_casts = bid_str.count("fd.ops.cast(t") | |
| exp_tensor_casts = exp_str.count("fd.ops.cast(t") | |
| # Check that the broadcast_in_dim fusion doesn't have a redundant Set operation | |
| # by comparing the IR string representations | |
| assert str(fd_bid) == str(fd_exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use direct string comparison as suggested. Since the broadcast is a no-op, the fusion definitions should indeed be identical. Commit: 3b44991
| identity->getValType().value() == ValType::TensorView, | ||
| "Expected identity op, but didn't get a TensorView back."); | ||
| return identity->as<TensorView>(); | ||
| return inp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change seems reasonable to me.
|
@rdspring1, could you please review the original issue and the Copilot-proposed fix? |
|
!test (maybe since I don't have any membership from this repo the command won't work) |
tests/cpp/test_alias.cpp
Outdated
| // Call broadcast with all dims marked as non-broadcast | ||
| // This should not introduce a Set operation | ||
| std::vector<bool> is_broadcast_dim = {false, false}; | ||
| TensorView* out = broadcast(in, is_broadcast_dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fusion is empty, so the verification seems silly.
| TensorView* out = broadcast(in, is_broadcast_dim); | |
| TensorView* maybe_bcast = broadcast(in, is_broadcast_dim); | |
| TensorView* out = abs(maybe_bcast); |
Or just check:
EXPECT_EQ(fusion->exprs(), 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot, please implement the change according to feedback from Ryan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added abs() operation after the broadcast and updated the comment as suggested. This ensures the fusion has expressions to verify. Commit: 3b44991
rdspring1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the process for checking performance impact of pull requests?
Normally, we run !test --pybench to check performance differences. For this change, I doubt you'd see any change. I'd use compiler explorer to check the PTX for any differences.
Co-authored-by: IvanYashchuk <[email protected]>
jacobhinkle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong objection to this PR. I don't think it's necessary for performance since an additional assignment should get elided by the compiler. But I'm inclined to accept the PR as it simplifies the IR a bit.
Note that I think the root of the confusion in the case of #5506 is that thunder has a broadcast_in_dim which we lower as a broadcast_in_dim op in the nvfuser executor but which is really just a broadcast+expand. If it were broadcast+expand everywhere in the expected behavior would be more clear.
I think we only need one test or the other and I'd probably lean toward keeping just the python test.
| # Execute both fusions and verify they produce the same result | ||
| nvf_out_bid, fd_bid = self.exec_nvfuser(fusion_with_broadcast_in_dim, inputs) | ||
| nvf_out_exp, fd_exp = self.exec_nvfuser(fusion_with_expand, inputs) | ||
|
|
||
| # Verify correctness | ||
| eager_out = inputs[0] + inputs[1] | ||
| self.assertEqual(eager_out, nvf_out_bid[0]) | ||
| self.assertEqual(eager_out, nvf_out_exp[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks (and inputs) are probably not necessary since the point of the test is just that the IR should match exactly whether we use broadcast_in_dim or expand whenever there is no new broadcast.
Plan to fix redundant float-to-float cast in broadcast_in_dim
csrc/ops/alias.cppbroadcast functionIssue Summary
When using
broadcast_in_dimwith broadcast_dims that include all input dimensions (meaning no actual broadcasting is needed), nvFuser was introducing a redundant Set operation (float-to-float cast) in the generated kernel.Root Cause
In
csrc/ops/alias.cppat lines 1025-1030, thebroadcastfunction was callingset(inp)whenn_broadcasts == 0, creating an unnecessary LoadStoreOp in the kernel even though no transformation was needed.Solution Implemented
Changed in csrc/ops/alias.cpp:
n_broadcasts == 0, now returnsinpdirectly instead of callingset(inp)Tests Added:
BroadcastInDimNoRedundantSet- Verifies no LoadStoreOp with type Set is created when no broadcasting is needed. Updated to add an abs() operation to ensure fusion has expressions to test.test_broadcast_in_dim_no_redundant_set- Compares broadcast_in_dim with expand using direct string comparison since they should be identical when broadcast is a no-op.Recent Changes (PR Feedback)
assert str(fd_bid) == str(fd_exp))Impact
Original prompt
This section details on the original issue you should resolve
<issue_title>Does redundant float-to-float cast via Set impact kernel performance when using broadcast_in_dim?</issue_title>
<issue_description>Could float->float cast of tensors which is translated to
setpotentially impact kernel performance? (cast(tv0, dtype=DataType.Float)when tv0 is already Float)Let's take a look at two fusions involving a broadcast operation implemented with
ops.broadcast_in_dimandops.expand:and
When fusion definitions are printed they differ only in one additional cast operation
tv2 = fd.ops.cast(tv0, dtype=DataType.Float)for the broadcast_in_dim version. The first fusion definition would printwhile the second one is
Here are their kernel math representation:
1.
and second fusion:
The difference between the two is in the
T2_l_float[bS4{1}, iS5{i1}] = Set( T0_g_float[bS0{1}, iS1{i1}], cache_op=Streaming )operation.Ideally, broadcast_in_dim wouldn't introduce unnecessary set operations to the fusion even if performance impact is minimal or nonexistent to avoid confusion.
I'm using nvFuser version
0.2.34+gitb90eb75.</issue_description>Comments on the Issue (you are @copilot in this section)
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.