Skip to content

[BUG] Errors with 25.06 RandomForest inference on a multi-gpu server #6930

@eordentlich

Description

@eordentlich

Describe the bug
I'm seeing some errors with 25.06 RandomForest inference on a multi-gpu server when setting the current device to a non-zero index device.

Test script below. python test_rf_mg.py 0 selecting device 0 runs fine while python test_rf_mg.py 1 results in a cudaErrorIllegalAddress in File "cuml/fil/fil.pyx", line 329, in cuml.fil.fil.ForestInference_impl._predict . If the ForestInference lines in the test script are uncommented, this also runs fine with device 0 but results in the error:

File "cuml/fil/fil.pyx", line 306, in cuml.fil.fil.ForestInference_impl._predict
RuntimeError: I/O data on different device than model

Steps/Code to reproduce bug
Code below was saved to test_rf_mg.py in above description:

import cupy as cp
import sys
from rmm.allocators.cupy import rmm_cupy_allocator
cp.cuda.set_allocator(rmm_cupy_allocator)

gpu_id = int(sys.argv[1])
print(f"gpu_id {gpu_id}")
cp.cuda.Device(gpu_id).use()

from cuml.ensemble import RandomForestClassifier as cuRFC

X = cp.random.normal(size=(10,4)).astype(cp.float32)
y = cp.asarray([0,1]*5, dtype=cp.int32)

cuml_model = cuRFC(max_features=1.0, n_streams=1,
                   n_bins=8,
                   n_estimators=40)
                   
cuml_model.fit(X,y)

#from cuml import ForestInference as FI
#tl_model = cuml_model.convert_to_treelite_model()
#fi = FI(treelite_model=tl_model, output_type="cupy", is_classifier=True, device_id=gpu_id)
#cuml_predict = fi.predict(X)
#print("Predicted labels : ", cuml_predict)

cuml_predict = cuml_model.predict(X)
print("Predicted labels : ", cuml_predict)
#Predicted labels :  [0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]

Expected behavior
No errors when selecting non 0 device as current device.

Environment details (please complete the following information):

  • Environment location: DGX Cloud reserved instance
  • Linux Distro/Architecture: Ubuntu 22.04.2 LTS
  • GPU Model/Driver: 8x A100 80GB/Driver Version: 560.35.05
  • CUDA: 12.6
  • Method of cuDF & cuML install: conda
    • If method of install is [conda], run conda list and include results here

dgx-conda-25.06.txt

Additional context
No observed errors with 25.04

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions