-
Notifications
You must be signed in to change notification settings - Fork 184
[Java] Exception-safe RMM Allocations #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rapids-bot
merged 10 commits into
rapidsai:branch-25.10
from
mythrocks:exception-safe-rmm-allocation
Aug 8, 2025
Merged
Changes from 1 commit
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
b8a4183
[WIP] [Java] Exception-safe RMM Allocations
mythrocks 16c2c6c
Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…
mythrocks 6b8ce7b
Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…
mythrocks 4fd5f8b
Fixed logic error in CagraIndexImpl refactor.
mythrocks aa3518a
Moved RMM Allocations to CloseableRMMAllocation::ctor.
mythrocks 7314065
Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…
mythrocks ad657bb
Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…
mythrocks f0a6566
Review: Use copy-ctor instead of release().
mythrocks 40c56ea
Review: Static allocation for empty bitset vector.
mythrocks f90efc0
Review: Precalculate prefilter constants for TieredIndexImpl.
mythrocks File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a problem here: we pass the dataset "pointer" to IndexReference to hold it and clean it when we are done with the index (see destroyIndex()), so this will lead to a double free.
Your change is good only if we are able to determine that we don't need the dataset device memory after we built the index; however, I was not able to determine if we need it or not; I think we might need it, as it might not be copied over again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now why you need "release": it's a way to work around that.
I think it is better to avoid it, and simply avoid to use try with resources here.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I haven't conveyed the utility of
CloseableRMMAllocationproperly.Simple case (
CagraIndexImpl, etc.)Consider the following simplified (if contrived) case, representative of how
allocRMMSegment()is used inCagraIndexImpl,TieredIndexImpl, etc.:There are several throwable points between the
alloc()and thefree(). If any of them fire,queriesDPis leaked in__device__memory. This is the simple case thatCloseableRMMAllocationaddresses.The case for
.release()(BruteForceIndexImpl)Similar example as above, except that the
Indexadopts the allocation, and holds it untildestroyIndex()is called.All the perils of the first example appear here as well; there are many throwable points between the allocation and the creation of the
IndexReference. We need a way to clean up the memory allocation if there's any throw before the final commit (toIndexReference).This is why we
release()right before the return.Not using throw-with-resources would mean that we're still open to
__device__memory leaks.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Still, I feel like
release()is not clear; a more explicit way would be to do an old, simple try/catch:let me see if I can think of a different pattern.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you are trying can be seen as similar to C++
unique_ptr; the difference is that we don't have move semantics in Java (as a simple way to transfer ownership).An alternative could be to implement
release()closer to what C++ has, e.g. returning the enclosed object:While I think this models more clearly the "I am transferring ownership" idea, there is one drawback: what if IndexReference or CloseableRMMAllocation ctors throw? This is not a danger here, they are both no-throw ctors (just simple assignements), but still it's a little bit less robust.
I think I still like the explicit try/catch more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to highlight this though:
Leaking device memory seems particularly bad; but how bad is it actually? Like, would be leaking it even after the process is gone, or the OS/device driver will be able to reclaim that memory (like for host memory)?
In any case, calling @ChrisHegarty in to see if we can protect further against this possibility (using cleaners maybe?)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly right. That's what this is modeled around. This will turn into an RAII wrapper after I have moved the allocation into the constructor.
That's also the pattern we use in https://github.com/rapidsai/cudf, when we transfer ownership of the underlying
__device__memory in a CUDF column.My initial version also had the pointer returned from release.Edit: Looks like it's not just my initial version;
release()does currently return the old pointer when relinquishing the memory. I didn't use it inBruteForceIndexImplbecause the original pointer was already at hand in the same scope. This will get tighter once the RAII change is made.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the leak does not persist beyond the lifetime of the process. That should be reclaimed after the process exits, yes.
But I would like not to make an assumption that the users of
cuvs-javaare short-running processes, and require them to be tolerant of leaking memory. Even if current users might be alright with a leak in exceptional events, a future user might be a long-running application.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not, I was just trying to understand what's the worst case scenario here, and if we need something extra, beyond taking care of exceptions.