[BUG] Feature Sampling Bias in Random Forest

## Summary

The Random Forest implementation exhibits systematic bias in feature selection during tree construction. Lower-indexed features are over-sampled while higher-indexed features are under-sampled or never selected, causing degraded model performance on datasets where important features have higher indices.

## Problem Description

When building decision trees with `max_features < n_features`, the feature sampling algorithm is supposed to uniformly select `k` features from `n` total features without replacement. However, the implementation consistently favors features with lower indices.

### Observed Behavior

Test results demonstrate clear bias patterns:

**Test Case 1: 10 features, sampling 5**
- Features 0-4: Sampled at ratio 2.0 (double the expected rate)
- Features 5-9: Sampled at ratio 0.0 (never selected)

**Test Case 2: 20 features, sampling 10**
- Features 0-9: Sampled at ratio ~2.0
- Features 10-19: Sampled at ratio 0.0-0.028

**Test Case 3: 50 features, sampling 25**
- Features 0-26: Sampled at ratios 1.14-1.87 (over-sampled)
- Features 27+: Sampled at decreasing ratios from 0.64 to 0.0

Expected behavior: All features should be sampled at ratio ≈ 1.0 (k/n probability).

## Root Cause

The bug is located in `cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh` in the `excess_sample_with_replacement_kernel` function.

The algorithm works as follows:
1. Generate random feature samples (with replacement)
2. **Sort the samples by feature index** (ascending order)
3. Identify unique features
4. **Select the first k unique features** from the sorted array

The critical error is in step 4: After sorting by feature index, taking the first k unique values always selects the k **smallest** feature indices that appear in the random sample. This creates systematic bias towards lower-numbered features.

### Why This Occurs

The sorting step (line 215) orders features by their index values:
```cuda
BlockRadixSortT(temp_storage.sort).Sort(items);
```

Then the selection logic (lines 238-243) takes the first k from the sorted list:
```cuda
if (mask[i] and col_indices[i] < k) {
  colids[col_offset + col_indices[i]] = items[i];
}
```

Since `items` is sorted in ascending order, `col_indices[i] < k` selects the k smallest feature indices, not a random k features.

## Impact

### Model Performance
- Features with lower indices receive disproportionate influence in tree construction
- Features with higher indices are systematically excluded from consideration
- Model quality degrades when important features have higher index positions
- Ensemble diversity is reduced due to similar feature sets across trees

### Affected Configurations
- Any Random Forest training with `max_features < n_features`
- Both classification and regression tasks
- Impact severity scales with feature count and sampling fraction

## Reproduction

The bug has been demonstrated locally with comprehensive C++ tests that measure feature sampling frequencies across multiple configurations. Test code and diagnostic utilities will be pushed to the repository shortly.

### Expected vs Actual Behavior

For uniform random sampling, each feature should be selected with probability `k/n`:
- **Expected**: All features have selection ratio ≈ 1.0 (within statistical tolerance)
- **Actual**: Selection ratio correlates strongly with feature index (lower = higher ratio)

## Potential Solutions

The fix likely requires modifying the selection logic in `excess_sample_with_replacement_kernel` to:

1. **Avoid positional bias**: The current approach of taking "first k" after sorting by feature index must be changed
2. **Ensure uniform distribution**: Could use:
   - Random shuffling of unique features before selection
   - Hash-based priorities to randomize ordering
   - Direct random sampling from the unique set
3. **Use RAFT's sampling utilities**: Consider replacing the custom `excess_sample_with_replacement_kernel` with RAFT's [`sample_without_replacement`](https://docs.rapids.ai/api/raft/stable/cpp_api/random_sampling_without_replacement/) function, which implements proper one-pass sampling without replacement and is likely already well-tested

## Related Issues

- https://github.com/rapidsai/cuml/issues/7445 - Reports feature 0 and feature n-1 specific biases in the same kernel



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Feature Sampling Bias in Random Forest #7448

Summary

Problem Description

Observed Behavior

Root Cause

Why This Occurs

Impact

Model Performance

Affected Configurations

Reproduction

Expected vs Actual Behavior

Potential Solutions

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Feature Sampling Bias in Random Forest #7448

Description

Summary

Problem Description

Observed Behavior

Root Cause

Why This Occurs

Impact

Model Performance

Affected Configurations

Reproduction

Expected vs Actual Behavior

Potential Solutions

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions