Add fix for devices that do not have memory resources#6823
Add fix for devices that do not have memory resources#6823rapids-bot[bot] merged 13 commits intorapidsai:branch-25.10from
Conversation
csadorf
left a comment
There was a problem hiding this comment.
I think we should target this for 25.08, not 25.06.
| else: | ||
| return None |
There was a problem hiding this comment.
I don't think returning None here is a good idea, because it would lead to TypeErrors in many of our (stress) tests.
There was a problem hiding this comment.
What do you recommend? Would returning 0 be a good solution?
There was a problem hiding this comment.
The only place we use this function is to set pytest.max_gpu_memory. That variable is implicitly expected to be a nonzero integer number wherever it is used.
I think using None is ok to indicate "unknown", but then we need to make sure to test for that wherever pytest.max_gpu_memory is used.
|
Have we decided what we would like to do with this change? |
@viclafargue I think we should fix this up and merge into branch-25.10. |
|
/ok to test b4f2c85 |
|
/ok to test 7f9c957 |
|
@viclafargue I think you'll also need to add a pynvml dependency, I think that needs to be in the |
| try: | ||
| if device_id and not str(device_id).isnumeric(): | ||
| # This means device_id is UUID. | ||
| # This works for both MIG and non-MIG device UUIDs. | ||
| handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(device_id)) | ||
| if pynvml.nvmlDeviceIsMigDeviceHandle(handle): | ||
| # Additionally get parent device handle | ||
| # if the device itself is a MIG instance | ||
| handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle( | ||
| handle | ||
| ) | ||
| else: | ||
| handle = pynvml.nvmlDeviceGetHandleByIndex(device_id) | ||
| return handle | ||
| except pynvml.NVMLError: | ||
| raise ValueError(f"Invalid device index or UUID: {device_id}") |
There was a problem hiding this comment.
This seems fairly complicated for what appears to be a rather basic function. Is this really the recommended approach for this?
There was a problem hiding this comment.
For general support, yes, but I presume this is for CI only so not necessarily all is required. However, this is a verbatim copy from Dask-CUDA, which is probably the only place this function is tested, so I think it makes sense to have a verbatim copy here as it will be less headache for you.
In the long-term, I'd like to have those functions in some shared package so that all RAPIDS projects can piggyback instead of copying verbatim. I've been pushing on that for 2 years but it has been really hard to convince our management of its value, perhaps now that we have similar functions copied in like 50 different places its value will finally become obvious. @quasiben
pentschev
left a comment
There was a problem hiding this comment.
LGTM, thanks @viclafargue !
|
/merge |
No description provided.