Redo memory management for NearestNeighbors#7115
Redo memory management for NearestNeighbors#7115rapids-bot[bot] merged 3 commits intorapidsai:branch-25.10from
Conversation
| cdef uintptr_t _D_ptr = D_ndarr.ptr | ||
| cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle() | ||
| cdef vector[float*] *inputs = new vector[float*]() | ||
| cdef vector[int] *sizes = new vector[int]() |
There was a problem hiding this comment.
Previously these vectors would leak every call to fit.
| def effective_metric_params_(self): | ||
| return self.metric_params or {} | ||
|
|
||
| def __del__(self): |
There was a problem hiding this comment.
__del__ on an estimator itself is a red flag. It's not universally bad, but every case I've seen of it in cuml currently is doing memory management incorrectly and has bugs. If an estimator contains non-trivial state (e.g. things that aren't wrapped in python array classes) that needs to be freed, that state should typically be wrapped in a private class and managed with __dealloc__/__cinit__ or other native methods. Casting pointers to python ints to pass around and hope everything works out is a recipe for bugs.
I'll open followup issues for the other cases where this is still done in cuml.
| "nprobe": 3, | ||
| "M": 0, | ||
| "n_bits": 8, | ||
| } |
There was a problem hiding this comment.
The previous code for generating these default values (in ann.pyx) had lots of loops and logic, but would almost always return {"nlist": 8, "nprobe": 3, "M": 1, "n_bits": 4}, which was invalid for cuvs (M * n_bits must be a multiple of 8).
I'd like to cleanup this parameter handling in a followup (as described by the TODO), but for now I've simplified the logic to use a fixed set of defaults that are always valid. M=0 relies on cuvs to pick a good value for M, and n_bits=8 is the cuvs default. Given that the old code would generate invalid defaults for most inputs I think this is better.
There was a problem hiding this comment.
Fully agree with this comment, is there an issue open for the followup?
| def __getstate__(self): | ||
| state = self.__dict__.copy() | ||
| # TODO: Indices currently aren't pickleable. For now we drop them and | ||
| # recreate them on load. |
There was a problem hiding this comment.
Previously pickling NearestNeighbors/KNeighborsClassifier/KNeighborsRegressor with algorithm in {"rbc", "ivfpq", "ivfflat"} would silently succeed in our tests (where the pointer still existed), but in any realistic use would crash due to invalid pointers on load (this is why you should not store pointers to native objects as integers on python objects).
We now succeed in pickling in these cases, but recreate the indices on load. I don't think this is wrong, but it's maybe not ideal. To meet the sklearn conventions we already need to include the original X in the serialized state; recreating the index on load takes some time - but serializing the index would increase the binary size since the original data is also there. Something to maybe handle in a followup.
There was a problem hiding this comment.
Can we create an issue, and perhaps it'd be a good idea to raise a warning to users, since many users might not see the warnings in the docs/docstrings and be surprised?
There was a problem hiding this comment.
I'll create a followup issue. I don't think a warning is necessary in this case - it's not wrong to do it this way, and it's still something we hope to fixup. Warnings IMO are for warning the user about something they can change that will remove the warning (deprecation, parameter out of range moved to threshold, ...). In this case there's not much they can do to adjust the behavior, and it's not giving them incorrect results.
|
Thanks for looking into this. For context, the approximate nearest neighbor section of the code is really old and was at the time implemented as a wrapper around the FAISS library. The C++ layer was then updated to be instead routed to cuVS. But, it looks like automatic parametrization was not updated accordingly. Thanks also for uncovering the memory management issues, some were really bad it seems. Isolating the ANN and RBC indices is a massive improvement for memory management, however I would strongly suggest turning the ANN section of the code into a simple routing to the cuVS Python bindings (removing the C++ layer). The cuVS Python bindings handle more ANN indices types and much more parameter types and should also set the defaults according to what cuVS expects. Additionally, the cuVS Python bindings also handle proper serialization and deserialization (save/load). The pickle file could simply save the path to the file containing the index to deserialize its content. This would add the cuvs Python library as a dependency though, but it may possibly be worth it. |
This is a refactor of how NearestNeighbors manages lifetimes for its internal state, specifically the fit indices for RBC/IVFPQ/IVFFlat. The previous implementation had several bugs: - Pickling didn't work and could lead to corrupted state, double frees, or segfaults. - Certain operations could lead to double frees of the indices, or memory leaks. - Some temporary locals were allocated with `new` and never freed, leading to small memory leaks every fit call. This PR splits out the indices into their own small classes with robust lifetime management. Storing these as private attributes on the `NearestNeighbors` instances helps remove a bunch of lifetime/memory footguns. It also results in simpler code overall. Note that estimators pickled with these index types will currently rebuild the indices on reload. This is better than what happened before (silent corruption or full on crashes), but still not ideal. I hope to remedy this in follow ups (IVF* indices in cuvs should be able to support this, RBC is less straightforward). While fixing the lifetimes I did end up cleaning up some tangential code, but there should be no user-facing behavior changes beyond the bugfixes noted above.
ae168ad to
20a23b3
Compare
While I'm all for moving things to use the python layer, I don't believe this will (currently) be possible for cuvs. To ensure cuml can be installed from pypi we're planning on static linking libcuvs when building the libcuml wheels. Until cuvs itself can fit on pypi, we'll need to wrap the routines we use ourselves to make this work.
Pickled binaries should generally be self contained so they can be easily serialized between machines. I do hope to use those (C++) routines though to serialize the IVF* indices for pickling in a followup. |
dantegd
left a comment
There was a problem hiding this comment.
Changes look great! Just left a couple of comments
| "nprobe": 3, | ||
| "M": 0, | ||
| "n_bits": 8, | ||
| } |
There was a problem hiding this comment.
Fully agree with this comment, is there an issue open for the followup?
| def __getstate__(self): | ||
| state = self.__dict__.copy() | ||
| # TODO: Indices currently aren't pickleable. For now we drop them and | ||
| # recreate them on load. |
There was a problem hiding this comment.
Can we create an issue, and perhaps it'd be a good idea to raise a warning to users, since many users might not see the warnings in the docs/docstrings and be surprised?
| # differences upon reload. For now we check for comparable performance | ||
| # just to ensure things are wired together properly. | ||
| accuracy = (i1 == i2).sum() / i1.size | ||
| assert accuracy >= 0.9 |
There was a problem hiding this comment.
Why 0.9, seems like a safe bound, but was just wondering why the specific choice
There was a problem hiding this comment.
It's high enough but still gives some leeway for nondeterministic results.
|
/merge |
This was fixed in the recent refactor of `NearestNeighbors` (#7115), this PR just adds a test for `kneighbors_graph` across our algorithms. Fixes #5642. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Bradley Dice (https://github.com/bdice) - Dante Gama Dessavre (https://github.com/dantegd) URL: #7125
This finishes up a few remaining tasks for `cuml.neighbors` in #7317. Most of the work for this was already done in #7115. - Simple `__init__` everywhere - Releases the GIL for all calls to `libcuml` - Split long methods into subroutines for readability - Avoid unnecessary conversions in a few spots - General organizational cleanups for readability Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #7320
This is a refactor of how NearestNeighbors manages lifetimes for its internal state, specifically the fit indices for RBC/IVFPQ/IVFFlat. The previous implementation had several bugs:
newand never freed, leading to small memory leaks every fit call.This PR splits out the indices into their own small classes with robust lifetime management. Storing these as private attributes on the
NearestNeighborsinstances helps remove a bunch of lifetime/memory footguns. It also results in simpler code overall.Note that estimators pickled with these index types will currently rebuild the indices on reload. This is better than what happened before (silent corruption or full on crashes), but still not ideal. I hope to remedy this in follow ups (IVF* indices in cuvs should be able to support this, RBC is less straightforward).
While fixing the lifetimes I did end up cleaning up some tangential code, but there should be no user-facing behavior changes beyond the bugfixes noted above.
Fixes #7066
Fixes #7067
Fixes #4020 (the issue was already fixed long ago, just ripped out the notes and warning here while I was moving code around).
Fixes #4629
Fixes #4743