Skip to content

[BUG] Fix RandomForest Builder Sampling#7422

Merged
rapids-bot[bot] merged 19 commits intorapidsai:mainfrom
tarang-jain:fix-rf-sampling
Nov 7, 2025
Merged

[BUG] Fix RandomForest Builder Sampling#7422
rapids-bot[bot] merged 19 commits intorapidsai:mainfrom
tarang-jain:fix-rf-sampling

Conversation

@tarang-jain
Copy link
Copy Markdown
Contributor

@tarang-jain tarang-jain commented Nov 1, 2025

The initial value chosen for the mask is 0. As a result, the mask computed with SubtractLeft always marks feature 0 as "selected" even though it is not. Instead we set it to -1.

Failing tests that this PR adds to the xfail-list:

  • "sklearn.inspection.tests.test_permutation_importance::test_permutation_importance_correlated_feature_regression_pandas[0.5-1]"
  • "sklearn.inspection.tests.test_permutation_importance::test_permutation_importance_correlated_feature_regression_pandas[0.5-2]"
  • "sklearn.inspection.tests.test_permutation_importance::test_permutation_importance_correlated_feature_regression_pandas[1.0-1]"
  • "sklearn.inspection.tests.test_permutation_importance::test_permutation_importance_correlated_feature_regression_pandas[1.0-2]"

@tarang-jain tarang-jain requested a review from a team as a code owner November 1, 2025 00:13
@tarang-jain tarang-jain requested a review from vyasr November 1, 2025 00:13
@tarang-jain tarang-jain added bug Something isn't working non-breaking Non-breaking change labels Nov 1, 2025
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Nov 3, 2025

Would it be possible to add a unit test that covers this?

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Nov 3, 2025

Can you investigate the test failures, please?

@tarang-jain
Copy link
Copy Markdown
Contributor Author

tarang-jain commented Nov 3, 2025

@csadorf those failures are not related to this PR, it looks like some UMAP failures. Just merged upstream to see if they resolve on their own. As far as testing is concerned, I can potentially add a basic test on the C++ side that checks if every feature is sampled at least once using all the different sampling algorithms.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Nov 3, 2025

As far as testing is concerned, I can potentially add a basic test on the C++ side that checks if every feature is sampled at least once using all the different sampling algorithms.

Whatever is appropriate, but we should make sure to prevent a future regression.

@tarang-jain tarang-jain requested a review from a team as a code owner November 4, 2025 01:53
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Nov 4, 2025
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Nov 4, 2025

It looks like this PR is introducing a regression in permutation importance as indicated by the scikit-learn upstream tests. I am currently investigating the problem.

@tarang-jain
Copy link
Copy Markdown
Contributor Author

There was also a problem in one of the SHAP tests, which had hardcoded values (as you had indicated earlier from the logs) -- that is fixed now.

Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch appears correct to me, but we probably have a secondary sampling bias issue by setting excess items to n - 1, which is a valid index and thus is guaranteed to be included in the selection whenever we randomly drew less than k unique indices in the first sampling iteration. The probability of that is non-zero.

We should identify a clear MRE to demonstrate these sampling issues and expand our testing to ensure that this critical bug is covered to improve our confidence in correctness and prevent future regressions.

Comment on lines +220 to +222
// Use -1 as the initial value since it can't match any valid column index [0, n-1]
BlockAdjacentDifferenceT(temp_storage.diff)
.SubtractLeft(items, mask, CustomDifference<IdxT>(), mask[0]);
.SubtractLeft(items, mask, CustomDifference<IdxT>(), IdxT(-1));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears correct to me. The previous implementation was comparing the first randomly selected column index against the initial value of mask[0] which is always zero. Outside the fact that comparing against a mask value makes absolutely no sense here, this also means it would never be selected, because the items array is sorted.

@csadorf csadorf linked an issue Nov 5, 2025 that may be closed by this pull request
Comment thread cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh Outdated
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Nov 5, 2025

Let's add the failing scikit-learn tests to the xfail list. We will remove them as we fix the wider problem in #7448 .

@tarang-jain
Copy link
Copy Markdown
Contributor Author

This PR has been refactored to only address the issue wherein the first column (feature 0) was not being sampled at all. Other bugs are being tracked separately.

@tarang-jain
Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot Bot merged commit cc3ac08 into rapidsai:main Nov 7, 2025
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CUDA/C++ Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] RandomForest Builder Does Not Sample 0th Feature

4 participants