Skip to content

Use less GPU memory in test_managed_alloc_driver_undersubscribe.#188

Merged
bdice merged 2 commits intoNVIDIA:mainfrom
bdice:test-h100
Apr 8, 2025
Merged

Use less GPU memory in test_managed_alloc_driver_undersubscribe.#188
bdice merged 2 commits intoNVIDIA:mainfrom
bdice:test-h100

Conversation

@bdice
Copy link
Copy Markdown
Contributor

@bdice bdice commented Apr 2, 2025

Resolves #184.

(numba.cuda.tests.cudadrv.test_managed_alloc.TestManagedAlloc.test_managed_alloc_driver_undersubscribe) allocates 50% of managed h100 memory (40GB), which causes the test process to be killed.

AFAIS, there's no specific reason why we need to allocate that much for the under-subscribed test. Reducing to 10% of gpu memory fixes the issue.

@bdice bdice marked this pull request as draft April 2, 2025 02:48
@isVoid
Copy link
Copy Markdown
Contributor

isVoid commented Apr 7, 2025

Root cause:
(numba.cuda.tests.cudadrv.test_managed_alloc.TestManagedAlloc.test_managed_alloc_driver_undersubscribe) allocates 50% of managed h100 memory (40GB), which causes the test process to be killed.

AFAIS, there's no specific reason why we need to allocate that much for the under-subscribed test. Reducing to 10% of gpu memory fixes the issue.

@bdice bdice marked this pull request as ready for review April 7, 2025 23:49
@bdice
Copy link
Copy Markdown
Contributor Author

bdice commented Apr 7, 2025

Thanks for the fix @isVoid. Since I created this PR, GitHub does not allow me to approve it, but I think your changes are correct and I would approve it -- feel free to approve it on my behalf, if you'd like. Currently there are no required reviews on this repository, so this can be merged. I'll leave it to your discretion on when to merge.

edit: I updated the description based on your comment.

@bdice bdice changed the title Test h100 GPUs. Test H100 GPUs. Apr 7, 2025
@bdice bdice changed the title Test H100 GPUs. Use less GPU memory in test_managed_alloc_driver_undersubscribe. Apr 7, 2025
@kkraus14
Copy link
Copy Markdown
Contributor

kkraus14 commented Apr 8, 2025

I do not have power here (yet 😉), but the fix LGTM as well

@bdice
Copy link
Copy Markdown
Contributor Author

bdice commented Apr 8, 2025

I’ll go ahead and merge. It seems we have consensus on this small fix.

@bdice bdice merged commit b32dfdb into NVIDIA:main Apr 8, 2025
35 checks passed
jiel-nv pushed a commit to jiel-nv/numba-cuda that referenced this pull request Apr 10, 2025
…DIA#188)

* Test h100 GPUs.

* limit the size of memory allocated

---------

Co-authored-by: isVoid <isVoid@users.noreply.github.com>
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Apr 22, 2025
- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (NVIDIA#155)
- reinstate test (NVIDIA#226)
- Restore PR NVIDIA#185 (Stop Certain Driver API Discovery for "v2") (NVIDIA#223)
- Report NVRTC builtin operation failures to the user (NVIDIA#196)
- Add Module Setup and Teardown Callback to Linkable Code Interface (NVIDIA#145)
- Test CUDA 12.8. (NVIDIA#187)
- Ensure RTC Bindings Clamp to the Maximum Supported CC (NVIDIA#189)
- Migrate code style to ruff (NVIDIA#170)
- Use less GPU memory in test_managed_alloc_driver_undersubscribe. (NVIDIA#188)
- Update workflows to always use proxy cache. (NVIDIA#191)
@gmarkall gmarkall mentioned this pull request Apr 22, 2025
gmarkall added a commit that referenced this pull request Apr 22, 2025
- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (#155)
- reinstate test (#226)
- Restore PR #185 (Stop Certain Driver API Discovery for "v2") (#223)
- Report NVRTC builtin operation failures to the user (#196)
- Add Module Setup and Teardown Callback to Linkable Code Interface (#145)
- Test CUDA 12.8. (#187)
- Ensure RTC Bindings Clamp to the Maximum Supported CC (#189)
- Migrate code style to ruff (#170)
- Use less GPU memory in test_managed_alloc_driver_undersubscribe. (#188)
- Update workflows to always use proxy cache. (#191)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Tests fail with H100 GPU

3 participants