[BUG] Fix RandomForest Builder Sampling#7422
Conversation
|
Would it be possible to add a unit test that covers this? |
|
Can you investigate the test failures, please? |
|
@csadorf those failures are not related to this PR, it looks like some UMAP failures. Just merged upstream to see if they resolve on their own. As far as testing is concerned, I can potentially add a basic test on the C++ side that checks if every feature is sampled at least once using all the different sampling algorithms. |
Whatever is appropriate, but we should make sure to prevent a future regression. |
…into fix-rf-sampling
|
It looks like this PR is introducing a regression in permutation importance as indicated by the scikit-learn upstream tests. I am currently investigating the problem. |
|
There was also a problem in one of the SHAP tests, which had hardcoded values (as you had indicated earlier from the logs) -- that is fixed now. |
csadorf
left a comment
There was a problem hiding this comment.
This patch appears correct to me, but we probably have a secondary sampling bias issue by setting excess items to n - 1, which is a valid index and thus is guaranteed to be included in the selection whenever we randomly drew less than k unique indices in the first sampling iteration. The probability of that is non-zero.
We should identify a clear MRE to demonstrate these sampling issues and expand our testing to ensure that this critical bug is covered to improve our confidence in correctness and prevent future regressions.
| // Use -1 as the initial value since it can't match any valid column index [0, n-1] | ||
| BlockAdjacentDifferenceT(temp_storage.diff) | ||
| .SubtractLeft(items, mask, CustomDifference<IdxT>(), mask[0]); | ||
| .SubtractLeft(items, mask, CustomDifference<IdxT>(), IdxT(-1)); |
There was a problem hiding this comment.
This appears correct to me. The previous implementation was comparing the first randomly selected column index against the initial value of mask[0] which is always zero. Outside the fact that comparing against a mask value makes absolutely no sense here, this also means it would never be selected, because the items array is sorted.
|
Let's add the failing scikit-learn tests to the xfail list. We will remove them as we fix the wider problem in #7448 . |
|
This PR has been refactored to only address the issue wherein the first column (feature 0) was not being sampled at all. Other bugs are being tracked separately. |
|
/merge |
The initial value chosen for the mask is 0. As a result, the mask computed with
SubtractLeftalways marks feature 0 as "selected" even though it is not. Instead we set it to -1.Failing tests that this PR adds to the xfail-list: