Skip to content

[BUG] Feature Sampling Bias in Random Forest #7448

@csadorf

Description

@csadorf

Summary

The Random Forest implementation exhibits systematic bias in feature selection during tree construction. Lower-indexed features are over-sampled while higher-indexed features are under-sampled or never selected, causing degraded model performance on datasets where important features have higher indices.

Problem Description

When building decision trees with max_features < n_features, the feature sampling algorithm is supposed to uniformly select k features from n total features without replacement. However, the implementation consistently favors features with lower indices.

Observed Behavior

Test results demonstrate clear bias patterns:

Test Case 1: 10 features, sampling 5

  • Features 0-4: Sampled at ratio 2.0 (double the expected rate)
  • Features 5-9: Sampled at ratio 0.0 (never selected)

Test Case 2: 20 features, sampling 10

  • Features 0-9: Sampled at ratio ~2.0
  • Features 10-19: Sampled at ratio 0.0-0.028

Test Case 3: 50 features, sampling 25

  • Features 0-26: Sampled at ratios 1.14-1.87 (over-sampled)
  • Features 27+: Sampled at decreasing ratios from 0.64 to 0.0

Expected behavior: All features should be sampled at ratio ≈ 1.0 (k/n probability).

Root Cause

The bug is located in cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh in the excess_sample_with_replacement_kernel function.

The algorithm works as follows:

  1. Generate random feature samples (with replacement)
  2. Sort the samples by feature index (ascending order)
  3. Identify unique features
  4. Select the first k unique features from the sorted array

The critical error is in step 4: After sorting by feature index, taking the first k unique values always selects the k smallest feature indices that appear in the random sample. This creates systematic bias towards lower-numbered features.

Why This Occurs

The sorting step (line 215) orders features by their index values:

BlockRadixSortT(temp_storage.sort).Sort(items);

Then the selection logic (lines 238-243) takes the first k from the sorted list:

if (mask[i] and col_indices[i] < k) {
  colids[col_offset + col_indices[i]] = items[i];
}

Since items is sorted in ascending order, col_indices[i] < k selects the k smallest feature indices, not a random k features.

Impact

Model Performance

  • Features with lower indices receive disproportionate influence in tree construction
  • Features with higher indices are systematically excluded from consideration
  • Model quality degrades when important features have higher index positions
  • Ensemble diversity is reduced due to similar feature sets across trees

Affected Configurations

  • Any Random Forest training with max_features < n_features
  • Both classification and regression tasks
  • Impact severity scales with feature count and sampling fraction

Reproduction

The bug has been demonstrated locally with comprehensive C++ tests that measure feature sampling frequencies across multiple configurations. Test code and diagnostic utilities will be pushed to the repository shortly.

Expected vs Actual Behavior

For uniform random sampling, each feature should be selected with probability k/n:

  • Expected: All features have selection ratio ≈ 1.0 (within statistical tolerance)
  • Actual: Selection ratio correlates strongly with feature index (lower = higher ratio)

Potential Solutions

The fix likely requires modifying the selection logic in excess_sample_with_replacement_kernel to:

  1. Avoid positional bias: The current approach of taking "first k" after sorting by feature index must be changed
  2. Ensure uniform distribution: Could use:
    • Random shuffling of unique features before selection
    • Hash-based priorities to randomize ordering
    • Direct random sampling from the unique set
  3. Use RAFT's sampling utilities: Consider replacing the custom excess_sample_with_replacement_kernel with RAFT's sample_without_replacement function, which implements proper one-pass sampling without replacement and is likely already well-tested

Related Issues

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions