Fix integer overflow in FIL#7727
Conversation
|
Do we have before-and-after benchmarks for regular |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds runtime overflow/size guards to FIL CPU and GPU inference paths to prevent per-row output index overflow before parallel/kernel launches, tightens Treelite importer assertion messages to cast indices to int for clearer diagnostics, and updates SPDX years to 2026. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
@hcho3 Did you run a basic benchmark to test the performance regression? If the performance hit is significant we should aim to mitigate that. Depending on the severity we can do that either in a follow-up or try to mitigate the problem in a different way (for example issue a warning on large inputs). |
csadorf
left a comment
There was a problem hiding this comment.
We should run a basic benchmark to understand the severity of the potential performance regression.
|
@csadorf I'm running the benchmark now |
|
@csadorf I ran basic benchmark on my end and here is the result. Performance impact of using
|
|
@hcho3 that seems significant, but was wondering if you benchmarked different models and/or batch sizes just to have a complete picture of the perf impact? |
|
Yes, I used your benchmark script. The reported performance is an average, and the slowdown is fairly consistent across the board |
|
I'd argue that the slow-down is outside the acceptable range considering the severity of the bug. Would it be possible to use int64 conditionally based on the dataset input size? |
|
Yeah, a specialized implementation for large input size may be the way to go. Adding a specialized implementation will increase the binary size, so we should run benchmark to measure performance impact. |
|
Update. I was able to derive an upper bound on Working backwards from the following formula: cuml/cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp Lines 143 to 148 in f9928c4
|
|
Option 1. Throw an error for large inputs. Why? The upper bound on Option 2. Create a specialized implementation for large inputs. |
My recommendation is to implement option 1 as a first defensive measure immediately. That's significantly better than the current overflow. Let's capture option 2 in an issue and evaluate merits and implementation of that separately. |
|
This may be better discussed on the follow-up issue for Option 2, but I'm wondering if batching would be preferable to a specialized implementation. After implementing Option 1, you'll already have the machinery for detecting what the appropriate batch size should be, so I believe it would be a relatively small change from there. On the other hand, as @hcho3 pointed out, the compile times and binary size are already pretty inflated by all the FIL specializations as is. Doubling that just to get 64 bit index specializations seems like a pretty high cost if batching is an available solution. |
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cpp/include/cuml/fil/detail/infer/gpu.cuh`:
- Around line 217-222: The division can still divide by zero when num_grove
(derived from forest.tree_count()/task_count/etc.) is 0; before computing
max_num_row or doing ASSERT, add a guard that computes the divisor (e.g.,
uint64_t denom = output_count * num_grove) and if denom == 0 handle the
degenerate case (either early return, set a safe max_num_row, or ASSERT with a
clear message) to prevent UB; update the logic around
num_grove/output_count/ceildiv/task_count/threads_per_block so the check covers
the scenario where forest.tree_count() == 0 and use the checked denom in the
max_num_row calculation and subsequent ASSERT.
|
In general, I think coderabbit led us a little far afield with some of its suggestions here. The limits that it's trying to avoid are either not realistic, guarded against elsewhere, or else precluded by some other limit that we would hit long before we got to these. My high level recommendation is that we should validate for limits on e.g. the number of trees or degenerate models when we import the model. Sprinkling those checks all throughout the code is a significant departure from the original design goal of doing expensive checks up front and failing fast before we get to actual inference. The Platonic ideal here would probably be that we do our checks up front and then use custom types to ensure that the checks had already been performed in the places we care about them later in the code. E.g. if you need an unsigned integer that you know you can subtract 3 from, you can construct a More practically speaking, it's probably sufficient to perform the necessary validation at import time with comments/docs explaining the checks. Regardless, I would recommend doing so in a separate PR. I'll provide inline comments for a few other things in case you want to proceed with the recommendations from coderabbit as is. |
wphicks
left a comment
There was a problem hiding this comment.
See my other comment for more general thoughts on the latest round of changes, but this review covers the code assuming you want to go ahead with coderabbit's suggestions.
My only other question is if we have any benchmarks covering this new change? I have less specific performance concerns here but always try to benchmark any FIL change that touches the actual inference code, since we used to occasionally miss regressions in legacy FIL.
Co-authored-by: William Hicks <whicks@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp`:
- Around line 102-118: The overflow guard runs after allocating output_workspace
which can wrap the 32-bit product and cause std::bad_alloc before the ASSERT;
move the guard block (the computation of max_num_row and ASSERT(row_count <=
max_num_row)) to before the allocation of output_workspace and compute the
allocation size using 64-bit arithmetic (e.g., cast row_count, num_outputs,
num_grove to uint64_t) to check and then only construct output_workspace/compute
task_count once the size is validated, updating any uses of task_count to use
the safe 64-bit-checked value.
- Around line 112-113: Avoid the integer division-by-zero by checking the
denominator before computing max_num_row: compute a uint64_t denom = num_outputs
* static_cast<std::uint64_t>(num_grove) (where num_grove comes from
ceildiv(num_tree, grove_size) and ultimately depends on forest.tree_count()),
and if denom == 0 set max_num_row to 0 (or a safe sentinel) otherwise compute
max_num_row = static_cast<std::uint64_t>(std::numeric_limits<index_type>::max())
/ denom; update the assignment site of max_num_row in the infer_kernel (the line
using ceildiv/num_grove) to use this guarded logic.
---
Duplicate comments:
In `@cpp/include/cuml/fil/detail/infer/gpu.cuh`:
- Around line 220-221: The division can still divide by zero when output_count *
num_grove == 0 (e.g., infer_kind::default_kind with forest.tree_count()==0);
modify the max_num_row computation to guard against a zero divisor by checking
if output_count==0 || num_grove==0 and in that case set max_num_row to
std::numeric_limits<std::uint64_t>::max() (or another safe large value) instead
of performing the division, otherwise perform the existing static_cast division;
update the code around the max_num_row calculation (the variables max_num_row,
output_count, num_grove) to implement this conditional.
|
I ran the benchmark again. The impact of this code change has <1% difference in the throughput.
|
csadorf
left a comment
There was a problem hiding this comment.
Much better! However, we should use the correct exception type.
|
/merge |
Closes #45. Port of rapidsai/cuml#7727 Throw an exception when the input is large enough to create integer overflow. Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #63
Closes #7711