Describe the bug
I'm seeing some errors with 25.06 RandomForest inference on a multi-gpu server when setting the current device to a non-zero index device.
Test script below. python test_rf_mg.py 0 selecting device 0 runs fine while python test_rf_mg.py 1 results in a cudaErrorIllegalAddress in File "cuml/fil/fil.pyx", line 329, in cuml.fil.fil.ForestInference_impl._predict . If the ForestInference lines in the test script are uncommented, this also runs fine with device 0 but results in the error:
File "cuml/fil/fil.pyx", line 306, in cuml.fil.fil.ForestInference_impl._predict
RuntimeError: I/O data on different device than model
Steps/Code to reproduce bug
Code below was saved to test_rf_mg.py in above description:
import cupy as cp
import sys
from rmm.allocators.cupy import rmm_cupy_allocator
cp.cuda.set_allocator(rmm_cupy_allocator)
gpu_id = int(sys.argv[1])
print(f"gpu_id {gpu_id}")
cp.cuda.Device(gpu_id).use()
from cuml.ensemble import RandomForestClassifier as cuRFC
X = cp.random.normal(size=(10,4)).astype(cp.float32)
y = cp.asarray([0,1]*5, dtype=cp.int32)
cuml_model = cuRFC(max_features=1.0, n_streams=1,
n_bins=8,
n_estimators=40)
cuml_model.fit(X,y)
#from cuml import ForestInference as FI
#tl_model = cuml_model.convert_to_treelite_model()
#fi = FI(treelite_model=tl_model, output_type="cupy", is_classifier=True, device_id=gpu_id)
#cuml_predict = fi.predict(X)
#print("Predicted labels : ", cuml_predict)
cuml_predict = cuml_model.predict(X)
print("Predicted labels : ", cuml_predict)
#Predicted labels : [0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]
Expected behavior
No errors when selecting non 0 device as current device.
Environment details (please complete the following information):
- Environment location: DGX Cloud reserved instance
- Linux Distro/Architecture: Ubuntu 22.04.2 LTS
- GPU Model/Driver: 8x A100 80GB/Driver Version: 560.35.05
- CUDA: 12.6
- Method of cuDF & cuML install: conda
- If method of install is [conda], run
conda list and include results here
dgx-conda-25.06.txt
Additional context
No observed errors with 25.04
Describe the bug
I'm seeing some errors with 25.06 RandomForest inference on a multi-gpu server when setting the current device to a non-zero index device.
Test script below.
python test_rf_mg.py 0selecting device 0 runs fine whilepython test_rf_mg.py 1results in acudaErrorIllegalAddressinFile "cuml/fil/fil.pyx", line 329, in cuml.fil.fil.ForestInference_impl._predict. If theForestInferencelines in the test script are uncommented, this also runs fine with device 0 but results in the error:Steps/Code to reproduce bug
Code below was saved to
test_rf_mg.pyin above description:Expected behavior
No errors when selecting non 0 device as current device.
Environment details (please complete the following information):
conda listand include results heredgx-conda-25.06.txt
Additional context
No observed errors with 25.04