[BUG] RandomForest Builder Does Not Sample 0th Feature

**Describe the bug**

The RandomForest builder has a sampling bias issue in the feature selection algorithm:

**Feature 0 Bias**: The initial value chosen for the mask in `SubtractLeft` is 0. As a result, the mask computed with `SubtractLeft` always marks feature 0 as "selected" even when it is not actually selected. This is because the first randomly selected column index is compared against the initial mask value of 0 (which is always zero), and since the items array is sorted, this comparison would never correctly identify feature 0 as unique.

**Steps/Code to reproduce bug**

Minimal reproducible example (MRE):

```python
#!/usr/bin/env python3
"""Minimal test for feature 0 sampling bias. Fails if bug is present."""

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from cuml.ensemble import RandomForestClassifier
import cupy as cp

# Dataset where ONLY feature 0 is predictive
np.random.seed(42)
X = np.random.randn(5000, 10)
y = (X[:, 0] > 0).astype(np.int32)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train cuML model
model = RandomForestClassifier(n_estimators=100, max_features='sqrt', max_depth=5, random_state=42)
model.fit(cp.array(X_train, dtype=cp.float32), cp.array(y_train, dtype=cp.int32))

# Predict and check accuracy
pred = cp.asnumpy(model.predict(cp.array(X_test, dtype=cp.float32)))
accuracy = accuracy_score(y_test, pred)

# Should achieve >95% accuracy on this trivial problem
assert accuracy > 0.95, f"Expected accuracy >0.95 but got {accuracy:.4f}. Feature 0 sampling bias detected."
```

**Expected behavior if bug is present:** 
- Assertion fails with accuracy around 50-60% (essentially random guessing)
- Error message: `AssertionError: Expected accuracy >0.95 but got 0.6027. Feature 0 sampling bias detected.`

**Expected behavior if bug is fixed:**
- Test passes with accuracy >95%

**Expected behavior**

- All features should have equal probability of being selected (for uniform sampling)
- The feature selection should be unbiased across all features
- No feature should be artificially included or excluded due to initialization or padding values

**Environment details (please complete the following information):**
- Environment location: Any
- Linux Distro/Architecture: Any
- GPU Model/Driver: Any CUDA-capable GPU
- CUDA: Any supported version
- Method of cuDF & cuML install: Any
- Component: `cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh`

**Additional context**

- The issue can be fixed by changing the initial value from `mask[0]` to `IdxT(-1)`, which cannot match any valid column index [0, n-1]
- This sampling bias affects feature importances and overall model quality
- Testing should be expanded to ensure this critical bug is covered to improve confidence in correctness and prevent future regressions
- The affected code is in `cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh` in the `SubtractLeft` operation


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RandomForest Builder Does Not Sample 0th Feature #7445

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] RandomForest Builder Does Not Sample 0th Feature #7445

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions