Skip to content

Fix integer overflow in FIL#7727

Merged
rapids-bot[bot] merged 22 commits intorapidsai:mainfrom
hcho3:fix_fil_overflow
Mar 4, 2026
Merged

Fix integer overflow in FIL#7727
rapids-bot[bot] merged 22 commits intorapidsai:mainfrom
hcho3:fix_fil_overflow

Conversation

@hcho3
Copy link
Copy Markdown
Contributor

@hcho3 hcho3 commented Jan 28, 2026

Closes #7711

  • Check the input size to ensure that the index type does not overflow. If the input is too large, throw an error.

@hcho3 hcho3 requested a review from a team as a code owner January 28, 2026 09:02
@hcho3 hcho3 requested review from dantegd and lowener January 28, 2026 09:03
@hcho3 hcho3 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 29, 2026
@wphicks
Copy link
Copy Markdown
Contributor

wphicks commented Jan 31, 2026

Do we have before-and-after benchmarks for regular predict calls with this change? I ask because early on I had tried using 64-bit indexes, and the performance hit was significant. If that's not the case anymore, great! On the other hand, if it's still a problem, we could probably do a more targeted change that would fix the bug without requiring 64-bit indexes in performance-critical code. The easiest thing would be to batch any inputs that exceed the cutoff for 32-bit indexes, but we might be able to do something more elegant in the predict_per_tree or apply internals.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Feb 3, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 3, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds runtime overflow/size guards to FIL CPU and GPU inference paths to prevent per-row output index overflow before parallel/kernel launches, tightens Treelite importer assertion messages to cast indices to int for clearer diagnostics, and updates SPDX years to 2026.

Changes

Cohort / File(s) Summary
Inference overflow guards (GPU)
cpp/include/cuml/fil/detail/infer/gpu.cuh
Adds chunk_size/task_count logic, computes a 64-bit max_num_row based on index type, outputs and grove count, bounds row_count with an ASSERT before kernel launch; uses 64-bit intermediates; adds raft/core/error.hpp and <cinttypes> includes; SPDX year bumped.
Inference overflow guards (CPU)
cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp
Introduces pre-parallel-region computation of a 64-bit max_num_row to prevent integer overflow when computing per-row outputs and asserts row_count ≤ max_num_row; adds raft/core/error.hpp and <cinttypes> includes; SPDX year bumped.
Treelite assertion message refinements
cpp/include/cuml/fil/treelite_importer.hpp
Tightens assertion/diagnostic messages by casting indices/class IDs with static_cast<int>(...) for clearer printed values; SPDX year bumped.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix integer overflow in FIL' directly corresponds to the main objective of the PR, which addresses integer overflow issues in the FIL library.
Description check ✅ Passed The description references issue #7711 and clearly explains the fix: checking input size to prevent index type overflow and throwing an error for oversized inputs.
Linked Issues check ✅ Passed The PR addresses the core requirement from issue #7711 by implementing overflow protection logic with max_num_row bounds and assertions across GPU and CPU inference paths [#7711], matching the discussed Option 1 approach.
Out of Scope Changes check ✅ Passed All changes are narrowly scoped to adding overflow protection: copyright year updates, include additions, and overflow guard logic in GPU/CPU inference paths, all directly addressing the integer overflow bug.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Feb 3, 2026

Do we have before-and-after benchmarks for regular predict calls with this change? I ask because early on I had tried using 64-bit indexes, and the performance hit was significant. If that's not the case anymore, great! On the other hand, if it's still a problem, we could probably do a more targeted change that would fix the bug without requiring 64-bit indexes in performance-critical code. The easiest thing would be to batch any inputs that exceed the cutoff for 32-bit indexes, but we might be able to do something more elegant in the predict_per_tree or apply internals.

@hcho3 Did you run a basic benchmark to test the performance regression? If the performance hit is significant we should aim to mitigate that. Depending on the severity we can do that either in a follow-up or try to mitigate the problem in a different way (for example issue a warning on large inputs).

Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run a basic benchmark to understand the severity of the potential performance regression.

@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 3, 2026

@csadorf I'm running the benchmark now

@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 4, 2026

@csadorf I ran basic benchmark on my end and here is the result.

Performance impact of using uint64_t:

  • predict: 2.8% increase in the run time (2.7% decrease in throughput)
  • predict_per_tree: 12.3% increase in the run time (11.0% decrease in the throughput)

@dantegd
Copy link
Copy Markdown
Member

dantegd commented Feb 4, 2026

@hcho3 that seems significant, but was wondering if you benchmarked different models and/or batch sizes just to have a complete picture of the perf impact?

@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 4, 2026

Yes, I used your benchmark script. The reported performance is an average, and the slowdown is fairly consistent across the board

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Feb 4, 2026

I'd argue that the slow-down is outside the acceptable range considering the severity of the bug. Would it be possible to use int64 conditionally based on the dataset input size?

@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 5, 2026

Yeah, a specialized implementation for large input size may be the way to go. Adding a specialized implementation will increase the binary size, so we should run benchmark to measure performance impact.

@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 10, 2026

Update. I was able to derive an upper bound on row_count that's needed to ensure that the 32-bit index variable output_offset does not overflow.

Working backwards from the following formula:

auto output_offset =
(row_index * num_outputs * num_grove +
(tree_index % default_num_outputs) * num_grove *
(infer_type == infer_kind::default_kind) +
tree_index * num_grove * (infer_type == infer_kind::per_tree) + grove_index);
output_workspace[output_offset] += tree_output;

row_count should be no more than the following:

  • max_uint32_val // (num_outputs * num_grove) - 2, when using ordinary predict;
  • max_uint32_val // (num_outputs * num_trees * num_grove) - 3, when using predict_per_tree.
  • max_uint32_val // (num_trees * num_grove) - 2, when using predict_leaf.

@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 10, 2026

Option 1. Throw an error for large inputs.

Why? The upper bound on row_count is much laxer for the ordinary predict method than predict_leaf or predict_per_tree. So users would hardly every see integer overflows when using predict, and the bug only surfaced when we added predict_leaf and predict_per_tree. Given that predict_leaf and predict_per_tree are relatively niche applications (compared to predict), the added cost of carrying the extra 64-bit implementation may be excessive.

Option 2. Create a specialized implementation for large inputs.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Feb 10, 2026

Option 1. Throw an error for large inputs.

Why? The upper bound on row_count is much laxer for the ordinary predict method than predict_leaf or predict_per_tree. So users would hardly every see integer overflows when using predict, and the bug only surfaced when we added predict_leaf and predict_per_tree. Given that predict_leaf and predict_per_tree are relatively niche applications (compared to predict), the added cost of carrying the extra 64-bit implementation may be excessive.

Option 2. Create a specialized implementation for large inputs.

My recommendation is to implement option 1 as a first defensive measure immediately. That's significantly better than the current overflow.

Let's capture option 2 in an issue and evaluate merits and implementation of that separately.

@wphicks
Copy link
Copy Markdown
Contributor

wphicks commented Feb 10, 2026

This may be better discussed on the follow-up issue for Option 2, but I'm wondering if batching would be preferable to a specialized implementation. After implementing Option 1, you'll already have the machinery for detecting what the appropriate batch size should be, so I believe it would be a relatively small change from there. On the other hand, as @hcho3 pointed out, the compile times and binary size are already pretty inflated by all the FIL specializations as is. Doubling that just to get 64 bit index specializations seems like a pretty high cost if batching is an available solution.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/include/cuml/fil/detail/infer/gpu.cuh`:
- Around line 217-222: The division can still divide by zero when num_grove
(derived from forest.tree_count()/task_count/etc.) is 0; before computing
max_num_row or doing ASSERT, add a guard that computes the divisor (e.g.,
uint64_t denom = output_count * num_grove) and if denom == 0 handle the
degenerate case (either early return, set a safe max_num_row, or ASSERT with a
clear message) to prevent UB; update the logic around
num_grove/output_count/ceildiv/task_count/threads_per_block so the check covers
the scenario where forest.tree_count() == 0 and use the checked denom in the
max_num_row calculation and subsequent ASSERT.

@wphicks
Copy link
Copy Markdown
Contributor

wphicks commented Feb 20, 2026

In general, I think coderabbit led us a little far afield with some of its suggestions here. The limits that it's trying to avoid are either not realistic, guarded against elsewhere, or else precluded by some other limit that we would hit long before we got to these.

My high level recommendation is that we should validate for limits on e.g. the number of trees or degenerate models when we import the model. Sprinkling those checks all throughout the code is a significant departure from the original design goal of doing expensive checks up front and failing fast before we get to actual inference.

The Platonic ideal here would probably be that we do our checks up front and then use custom types to ensure that the checks had already been performed in the places we care about them later in the code. E.g. if you need an unsigned integer that you know you can subtract 3 from, you can construct a uint32_above_2 type early on and pass that into performance critical code. For something like a type that you'll use for keeping track of tree counts, you can bake all such validation into a tree_count_t.

More practically speaking, it's probably sufficient to perform the necessary validation at import time with comments/docs explaining the checks. Regardless, I would recommend doing so in a separate PR.

I'll provide inline comments for a few other things in case you want to proceed with the recommendations from coderabbit as is.

Copy link
Copy Markdown
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment for more general thoughts on the latest round of changes, but this review covers the code assuming you want to go ahead with coderabbit's suggestions.

My only other question is if we have any benchmarks covering this new change? I have less specific performance concerns here but always try to benchmark any FIL change that touches the actual inference code, since we used to occasionally miss regressions in legacy FIL.

Comment thread cpp/include/cuml/fil/detail/infer/gpu.cuh Outdated
Comment thread cpp/include/cuml/fil/detail/infer/gpu.cuh Outdated
Comment thread cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp Outdated
@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 21, 2026

@wphicks I created #7821 so that we can add import-time validation to ensure num_trees >= 1, num_outputs >= 1. For now, I will address the remaining comments.

hcho3 and others added 2 commits February 20, 2026 17:41
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp`:
- Around line 102-118: The overflow guard runs after allocating output_workspace
which can wrap the 32-bit product and cause std::bad_alloc before the ASSERT;
move the guard block (the computation of max_num_row and ASSERT(row_count <=
max_num_row)) to before the allocation of output_workspace and compute the
allocation size using 64-bit arithmetic (e.g., cast row_count, num_outputs,
num_grove to uint64_t) to check and then only construct output_workspace/compute
task_count once the size is validated, updating any uses of task_count to use
the safe 64-bit-checked value.
- Around line 112-113: Avoid the integer division-by-zero by checking the
denominator before computing max_num_row: compute a uint64_t denom = num_outputs
* static_cast<std::uint64_t>(num_grove) (where num_grove comes from
ceildiv(num_tree, grove_size) and ultimately depends on forest.tree_count()),
and if denom == 0 set max_num_row to 0 (or a safe sentinel) otherwise compute
max_num_row = static_cast<std::uint64_t>(std::numeric_limits<index_type>::max())
/ denom; update the assignment site of max_num_row in the infer_kernel (the line
using ceildiv/num_grove) to use this guarded logic.

---

Duplicate comments:
In `@cpp/include/cuml/fil/detail/infer/gpu.cuh`:
- Around line 220-221: The division can still divide by zero when output_count *
num_grove == 0 (e.g., infer_kind::default_kind with forest.tree_count()==0);
modify the max_num_row computation to guard against a zero divisor by checking
if output_count==0 || num_grove==0 and in that case set max_num_row to
std::numeric_limits<std::uint64_t>::max() (or another safe large value) instead
of performing the division, otherwise perform the existing static_cast division;
update the code around the max_num_row calculation (the variables max_num_row,
output_count, num_grove) to implement this conditional.

Comment thread cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp Outdated
Comment thread cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp
@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Feb 23, 2026

I ran the benchmark again. The impact of this code change has <1% difference in the throughput.

  • predict: 0.7% increase in the run time
  • predict_per_tree: 0.2% increase in the run time

@hcho3 hcho3 requested a review from csadorf February 23, 2026 20:49
Comment thread cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp Outdated
Comment thread cpp/include/cuml/fil/detail/infer/gpu.cuh Outdated
Comment thread cpp/include/cuml/fil/treelite_importer.hpp
Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! However, we should use the correct exception type.

Comment thread cpp/include/cuml/fil/detail/infer_kernel/cpu.hpp Outdated
@hcho3
Copy link
Copy Markdown
Contributor Author

hcho3 commented Mar 4, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 1e6dfb7 into rapidsai:main Mar 4, 2026
99 checks passed
@hcho3 hcho3 deleted the fix_fil_overflow branch March 5, 2026 00:01
hcho3 added a commit to hcho3/nvforest that referenced this pull request Mar 5, 2026
rapids-bot Bot pushed a commit to rapidsai/nvforest that referenced this pull request Mar 5, 2026
Closes #45.

Port of rapidsai/cuml#7727

Throw an exception when the input is large enough to create integer overflow.

Authors:
  - Philip Hyunsu Cho (https://github.com/hcho3)

Approvers:
  - Simon Adorf (https://github.com/csadorf)

URL: #63
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] predict_per_tree inconsistent behavior

6 participants