Redo memory management for NearestNeighbors by jcrist · Pull Request #7115 · rapidsai/cuml

jcrist · 2025-08-14T14:41:28Z

This is a refactor of how NearestNeighbors manages lifetimes for its internal state, specifically the fit indices for RBC/IVFPQ/IVFFlat. The previous implementation had several bugs:

Pickling didn't work and could lead to corrupted state, double frees, or segfaults.
Certain operations could lead to double frees of the indices, or memory leaks.
Some temporary locals were allocated with new and never freed, leading to small memory leaks every fit call.

This PR splits out the indices into their own small classes with robust lifetime management. Storing these as private attributes on the NearestNeighbors instances helps remove a bunch of lifetime/memory footguns. It also results in simpler code overall.

Note that estimators pickled with these index types will currently rebuild the indices on reload. This is better than what happened before (silent corruption or full on crashes), but still not ideal. I hope to remedy this in follow ups (IVF* indices in cuvs should be able to support this, RBC is less straightforward).

While fixing the lifetimes I did end up cleaning up some tangential code, but there should be no user-facing behavior changes beyond the bugfixes noted above.

Fixes #7066
Fixes #7067
Fixes #4020 (the issue was already fixed long ago, just ripped out the notes and warning here while I was moving code around).
Fixes #4629
Fixes #4743

jcrist · 2025-08-14T14:43:22Z

-        cdef uintptr_t _D_ptr = D_ndarr.ptr
-        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
-        cdef vector[float*] *inputs = new vector[float*]()
-        cdef vector[int] *sizes = new vector[int]()


Previously these vectors would leak every call to fit.

jcrist · 2025-08-14T14:47:44Z

    def effective_metric_params_(self):
        return self.metric_params or {}

-    def __del__(self):


__del__ on an estimator itself is a red flag. It's not universally bad, but every case I've seen of it in cuml currently is doing memory management incorrectly and has bugs. If an estimator contains non-trivial state (e.g. things that aren't wrapped in python array classes) that needs to be freed, that state should typically be wrapped in a private class and managed with __dealloc__/__cinit__ or other native methods. Casting pointers to python ints to pass around and hope everything works out is a recipe for bugs.

I'll open followup issues for the other cases where this is still done in cuml.

jcrist · 2025-08-14T14:52:26Z

+                "nprobe": 3,
+                "M": 0,
+                "n_bits": 8,
+            }


The previous code for generating these default values (in ann.pyx) had lots of loops and logic, but would almost always return {"nlist": 8, "nprobe": 3, "M": 1, "n_bits": 4}, which was invalid for cuvs (M * n_bits must be a multiple of 8).

I'd like to cleanup this parameter handling in a followup (as described by the TODO), but for now I've simplified the logic to use a fixed set of defaults that are always valid. M=0 relies on cuvs to pick a good value for M, and n_bits=8 is the cuvs default. Given that the old code would generate invalid defaults for most inputs I think this is better.

Fully agree with this comment, is there an issue open for the followup?

I'll open a followup

jcrist · 2025-08-14T14:57:35Z

+    def __getstate__(self):
+        state = self.__dict__.copy()
+        # TODO: Indices currently aren't pickleable. For now we drop them and
+        # recreate them on load.


Previously pickling NearestNeighbors/KNeighborsClassifier/KNeighborsRegressor with algorithm in {"rbc", "ivfpq", "ivfflat"} would silently succeed in our tests (where the pointer still existed), but in any realistic use would crash due to invalid pointers on load (this is why you should not store pointers to native objects as integers on python objects).

We now succeed in pickling in these cases, but recreate the indices on load. I don't think this is wrong, but it's maybe not ideal. To meet the sklearn conventions we already need to include the original X in the serialized state; recreating the index on load takes some time - but serializing the index would increase the binary size since the original data is also there. Something to maybe handle in a followup.

Can we create an issue, and perhaps it'd be a good idea to raise a warning to users, since many users might not see the warnings in the docs/docstrings and be surprised?

I'll create a followup issue. I don't think a warning is necessary in this case - it's not wrong to do it this way, and it's still something we hope to fixup. Warnings IMO are for warning the user about something they can change that will remove the warning (deprecation, parameter out of range moved to threshold, ...). In this case there's not much they can do to adjust the behavior, and it's not giving them incorrect results.

viclafargue · 2025-08-19T11:49:55Z

Thanks for looking into this. For context, the approximate nearest neighbor section of the code is really old and was at the time implemented as a wrapper around the FAISS library. The C++ layer was then updated to be instead routed to cuVS. But, it looks like automatic parametrization was not updated accordingly.

Thanks also for uncovering the memory management issues, some were really bad it seems. Isolating the ANN and RBC indices is a massive improvement for memory management, however I would strongly suggest turning the ANN section of the code into a simple routing to the cuVS Python bindings (removing the C++ layer). The cuVS Python bindings handle more ANN indices types and much more parameter types and should also set the defaults according to what cuVS expects. Additionally, the cuVS Python bindings also handle proper serialization and deserialization (save/load). The pickle file could simply save the path to the file containing the index to deserialize its content. This would add the cuvs Python library as a dependency though, but it may possibly be worth it.

This is a refactor of how NearestNeighbors manages lifetimes for its internal state, specifically the fit indices for RBC/IVFPQ/IVFFlat. The previous implementation had several bugs: - Pickling didn't work and could lead to corrupted state, double frees, or segfaults. - Certain operations could lead to double frees of the indices, or memory leaks. - Some temporary locals were allocated with `new` and never freed, leading to small memory leaks every fit call. This PR splits out the indices into their own small classes with robust lifetime management. Storing these as private attributes on the `NearestNeighbors` instances helps remove a bunch of lifetime/memory footguns. It also results in simpler code overall. Note that estimators pickled with these index types will currently rebuild the indices on reload. This is better than what happened before (silent corruption or full on crashes), but still not ideal. I hope to remedy this in follow ups (IVF* indices in cuvs should be able to support this, RBC is less straightforward). While fixing the lifetimes I did end up cleaning up some tangential code, but there should be no user-facing behavior changes beyond the bugfixes noted above.

jcrist · 2025-08-19T16:18:51Z

however I would strongly suggest turning the ANN section of the code into a simple routing to the cuVS Python bindings (removing the C++ layer).
...
This would add the cuvs Python library as a dependency though, but it may possibly be worth it.

While I'm all for moving things to use the python layer, I don't believe this will (currently) be possible for cuvs. To ensure cuml can be installed from pypi we're planning on static linking libcuvs when building the libcuml wheels. Until cuvs itself can fit on pypi, we'll need to wrap the routines we use ourselves to make this work.

The pickle file could simply save the path to the file containing the index to deserialize its content.

Pickled binaries should generally be self contained so they can be easily serialized between machines. I do hope to use those (C++) routines though to serialize the IVF* indices for pickling in a followup.

dantegd

Changes look great! Just left a couple of comments

dantegd · 2025-08-20T19:27:24Z

+                "nprobe": 3,
+                "M": 0,
+                "n_bits": 8,
+            }


Fully agree with this comment, is there an issue open for the followup?

dantegd · 2025-08-20T19:29:19Z

+    def __getstate__(self):
+        state = self.__dict__.copy()
+        # TODO: Indices currently aren't pickleable. For now we drop them and
+        # recreate them on load.


Can we create an issue, and perhaps it'd be a good idea to raise a warning to users, since many users might not see the warnings in the docs/docstrings and be surprised?

dantegd · 2025-08-20T19:31:29Z

+        # differences upon reload. For now we check for comparable performance
+        # just to ensure things are wired together properly.
+        accuracy = (i1 == i2).sum() / i1.size
+        assert accuracy >= 0.9


Why 0.9, seems like a safe bound, but was just wondering why the specific choice

It's high enough but still gives some leeway for nondeterministic results.

jcrist · 2025-08-20T20:32:08Z

/merge

This was fixed in the recent refactor of `NearestNeighbors` (#7115), this PR just adds a test for `kneighbors_graph` across our algorithms. Fixes #5642. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Bradley Dice (https://github.com/bdice) - Dante Gama Dessavre (https://github.com/dantegd) URL: #7125

This finishes up a few remaining tasks for `cuml.neighbors` in #7317. Most of the work for this was already done in #7115. - Simple `__init__` everywhere - Releases the GIL for all calls to `libcuml` - Split long methods into subroutines for readability - Avoid unnecessary conversions in a few spots - General organizational cleanups for readability Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #7320

jcrist self-assigned this Aug 14, 2025

jcrist requested a review from a team as a code owner August 14, 2025 14:41

jcrist added the Cython / Python Cython or Python issue label Aug 14, 2025

jcrist requested a review from a team as a code owner August 14, 2025 14:41

jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 14, 2025

jcrist requested review from KyleFromNVIDIA, betatim and divyegala August 14, 2025 14:41

github-actions Bot added the CMake label Aug 14, 2025

jcrist commented Aug 14, 2025

View reviewed changes

dantegd reviewed Aug 19, 2025

View reviewed changes

Comment thread python/cuml/tests/test_pickle.py

jcrist added 3 commits August 19, 2025 11:10

Update xfail list

eb94c3e

Update docstring

20a23b3

jcrist force-pushed the refactor-nearest-neighbors branch from ae168ad to 20a23b3 Compare August 19, 2025 16:15

dantegd approved these changes Aug 20, 2025

View reviewed changes

rapids-bot Bot merged commit d44e401 into rapidsai:branch-25.10 Aug 20, 2025
71 checks passed

jcrist deleted the refactor-nearest-neighbors branch August 20, 2025 20:32

This was referenced Aug 20, 2025

Make indices in NearestNeighbors pickleable #7122

Open

Support kneighbors_graph for approximate algos #7125

Merged

csadorf mentioned this pull request Aug 29, 2025

Deprecate/remove get_json/get_summary_text/get_detailed_text in cuml.ensemble? #7159

Closed

This was referenced Oct 7, 2025

[TRACKER] Cleanup python estimator implementations #7317

Open

A few cuml.neighbors cleanups #7320

Merged

Conversation

jcrist commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viclafargue commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Aug 19, 2025

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jcrist commented Aug 14, 2025 •

edited

Loading

viclafargue commented Aug 19, 2025 •

edited

Loading