Skip to content

[BUG] RandomForest Builder Does Not Sample 0th Feature #7445

@csadorf

Description

@csadorf

Describe the bug

The RandomForest builder has a sampling bias issue in the feature selection algorithm:

Feature 0 Bias: The initial value chosen for the mask in SubtractLeft is 0. As a result, the mask computed with SubtractLeft always marks feature 0 as "selected" even when it is not actually selected. This is because the first randomly selected column index is compared against the initial mask value of 0 (which is always zero), and since the items array is sorted, this comparison would never correctly identify feature 0 as unique.

Steps/Code to reproduce bug

Minimal reproducible example (MRE):

#!/usr/bin/env python3
"""Minimal test for feature 0 sampling bias. Fails if bug is present."""

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from cuml.ensemble import RandomForestClassifier
import cupy as cp

# Dataset where ONLY feature 0 is predictive
np.random.seed(42)
X = np.random.randn(5000, 10)
y = (X[:, 0] > 0).astype(np.int32)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train cuML model
model = RandomForestClassifier(n_estimators=100, max_features='sqrt', max_depth=5, random_state=42)
model.fit(cp.array(X_train, dtype=cp.float32), cp.array(y_train, dtype=cp.int32))

# Predict and check accuracy
pred = cp.asnumpy(model.predict(cp.array(X_test, dtype=cp.float32)))
accuracy = accuracy_score(y_test, pred)

# Should achieve >95% accuracy on this trivial problem
assert accuracy > 0.95, f"Expected accuracy >0.95 but got {accuracy:.4f}. Feature 0 sampling bias detected."

Expected behavior if bug is present:

  • Assertion fails with accuracy around 50-60% (essentially random guessing)
  • Error message: AssertionError: Expected accuracy >0.95 but got 0.6027. Feature 0 sampling bias detected.

Expected behavior if bug is fixed:

  • Test passes with accuracy >95%

Expected behavior

  • All features should have equal probability of being selected (for uniform sampling)
  • The feature selection should be unbiased across all features
  • No feature should be artificially included or excluded due to initialization or padding values

Environment details (please complete the following information):

  • Environment location: Any
  • Linux Distro/Architecture: Any
  • GPU Model/Driver: Any CUDA-capable GPU
  • CUDA: Any supported version
  • Method of cuDF & cuML install: Any
  • Component: cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh

Additional context

  • The issue can be fixed by changing the initial value from mask[0] to IdxT(-1), which cannot match any valid column index [0, n-1]
  • This sampling bias affects feature importances and overall model quality
  • Testing should be expanded to ensure this critical bug is covered to improve confidence in correctness and prevent future regressions
  • The affected code is in cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh in the SubtractLeft operation

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions