Describe the bug
The RandomForest builder has a sampling bias issue in the feature selection algorithm:
Feature 0 Bias: The initial value chosen for the mask in SubtractLeft is 0. As a result, the mask computed with SubtractLeft always marks feature 0 as "selected" even when it is not actually selected. This is because the first randomly selected column index is compared against the initial mask value of 0 (which is always zero), and since the items array is sorted, this comparison would never correctly identify feature 0 as unique.
Steps/Code to reproduce bug
Minimal reproducible example (MRE):
#!/usr/bin/env python3
"""Minimal test for feature 0 sampling bias. Fails if bug is present."""
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from cuml.ensemble import RandomForestClassifier
import cupy as cp
# Dataset where ONLY feature 0 is predictive
np.random.seed(42)
X = np.random.randn(5000, 10)
y = (X[:, 0] > 0).astype(np.int32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train cuML model
model = RandomForestClassifier(n_estimators=100, max_features='sqrt', max_depth=5, random_state=42)
model.fit(cp.array(X_train, dtype=cp.float32), cp.array(y_train, dtype=cp.int32))
# Predict and check accuracy
pred = cp.asnumpy(model.predict(cp.array(X_test, dtype=cp.float32)))
accuracy = accuracy_score(y_test, pred)
# Should achieve >95% accuracy on this trivial problem
assert accuracy > 0.95, f"Expected accuracy >0.95 but got {accuracy:.4f}. Feature 0 sampling bias detected."
Expected behavior if bug is present:
- Assertion fails with accuracy around 50-60% (essentially random guessing)
- Error message:
AssertionError: Expected accuracy >0.95 but got 0.6027. Feature 0 sampling bias detected.
Expected behavior if bug is fixed:
- Test passes with accuracy >95%
Expected behavior
- All features should have equal probability of being selected (for uniform sampling)
- The feature selection should be unbiased across all features
- No feature should be artificially included or excluded due to initialization or padding values
Environment details (please complete the following information):
- Environment location: Any
- Linux Distro/Architecture: Any
- GPU Model/Driver: Any CUDA-capable GPU
- CUDA: Any supported version
- Method of cuDF & cuML install: Any
- Component:
cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh
Additional context
- The issue can be fixed by changing the initial value from
mask[0] to IdxT(-1), which cannot match any valid column index [0, n-1]
- This sampling bias affects feature importances and overall model quality
- Testing should be expanded to ensure this critical bug is covered to improve confidence in correctness and prevent future regressions
- The affected code is in
cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh in the SubtractLeft operation
Describe the bug
The RandomForest builder has a sampling bias issue in the feature selection algorithm:
Feature 0 Bias: The initial value chosen for the mask in
SubtractLeftis 0. As a result, the mask computed withSubtractLeftalways marks feature 0 as "selected" even when it is not actually selected. This is because the first randomly selected column index is compared against the initial mask value of 0 (which is always zero), and since the items array is sorted, this comparison would never correctly identify feature 0 as unique.Steps/Code to reproduce bug
Minimal reproducible example (MRE):
Expected behavior if bug is present:
AssertionError: Expected accuracy >0.95 but got 0.6027. Feature 0 sampling bias detected.Expected behavior if bug is fixed:
Expected behavior
Environment details (please complete the following information):
cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuhAdditional context
mask[0]toIdxT(-1), which cannot match any valid column index [0, n-1]cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuhin theSubtractLeftoperation