Skip to content

Redo memory management for NearestNeighbors#7115

Merged
rapids-bot[bot] merged 3 commits intorapidsai:branch-25.10from
jcrist:refactor-nearest-neighbors
Aug 20, 2025
Merged

Redo memory management for NearestNeighbors#7115
rapids-bot[bot] merged 3 commits intorapidsai:branch-25.10from
jcrist:refactor-nearest-neighbors

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Aug 14, 2025

This is a refactor of how NearestNeighbors manages lifetimes for its internal state, specifically the fit indices for RBC/IVFPQ/IVFFlat. The previous implementation had several bugs:

  • Pickling didn't work and could lead to corrupted state, double frees, or segfaults.
  • Certain operations could lead to double frees of the indices, or memory leaks.
  • Some temporary locals were allocated with new and never freed, leading to small memory leaks every fit call.

This PR splits out the indices into their own small classes with robust lifetime management. Storing these as private attributes on the NearestNeighbors instances helps remove a bunch of lifetime/memory footguns. It also results in simpler code overall.

Note that estimators pickled with these index types will currently rebuild the indices on reload. This is better than what happened before (silent corruption or full on crashes), but still not ideal. I hope to remedy this in follow ups (IVF* indices in cuvs should be able to support this, RBC is less straightforward).

While fixing the lifetimes I did end up cleaning up some tangential code, but there should be no user-facing behavior changes beyond the bugfixes noted above.

Fixes #7066
Fixes #7067
Fixes #4020 (the issue was already fixed long ago, just ripped out the notes and warning here while I was moving code around).
Fixes #4629
Fixes #4743

@jcrist jcrist self-assigned this Aug 14, 2025
@jcrist jcrist requested a review from a team as a code owner August 14, 2025 14:41
@jcrist jcrist added the Cython / Python Cython or Python issue label Aug 14, 2025
@jcrist jcrist requested a review from a team as a code owner August 14, 2025 14:41
@jcrist jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 14, 2025
@github-actions github-actions Bot added the CMake label Aug 14, 2025
cdef uintptr_t _D_ptr = D_ndarr.ptr
cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
cdef vector[float*] *inputs = new vector[float*]()
cdef vector[int] *sizes = new vector[int]()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously these vectors would leak every call to fit.

def effective_metric_params_(self):
return self.metric_params or {}

def __del__(self):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__del__ on an estimator itself is a red flag. It's not universally bad, but every case I've seen of it in cuml currently is doing memory management incorrectly and has bugs. If an estimator contains non-trivial state (e.g. things that aren't wrapped in python array classes) that needs to be freed, that state should typically be wrapped in a private class and managed with __dealloc__/__cinit__ or other native methods. Casting pointers to python ints to pass around and hope everything works out is a recipe for bugs.

I'll open followup issues for the other cases where this is still done in cuml.

"nprobe": 3,
"M": 0,
"n_bits": 8,
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code for generating these default values (in ann.pyx) had lots of loops and logic, but would almost always return {"nlist": 8, "nprobe": 3, "M": 1, "n_bits": 4}, which was invalid for cuvs (M * n_bits must be a multiple of 8).

I'd like to cleanup this parameter handling in a followup (as described by the TODO), but for now I've simplified the logic to use a fixed set of defaults that are always valid. M=0 relies on cuvs to pick a good value for M, and n_bits=8 is the cuvs default. Given that the old code would generate invalid defaults for most inputs I think this is better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree with this comment, is there an issue open for the followup?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a followup

def __getstate__(self):
state = self.__dict__.copy()
# TODO: Indices currently aren't pickleable. For now we drop them and
# recreate them on load.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously pickling NearestNeighbors/KNeighborsClassifier/KNeighborsRegressor with algorithm in {"rbc", "ivfpq", "ivfflat"} would silently succeed in our tests (where the pointer still existed), but in any realistic use would crash due to invalid pointers on load (this is why you should not store pointers to native objects as integers on python objects).

We now succeed in pickling in these cases, but recreate the indices on load. I don't think this is wrong, but it's maybe not ideal. To meet the sklearn conventions we already need to include the original X in the serialized state; recreating the index on load takes some time - but serializing the index would increase the binary size since the original data is also there. Something to maybe handle in a followup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an issue, and perhaps it'd be a good idea to raise a warning to users, since many users might not see the warnings in the docs/docstrings and be surprised?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a followup issue. I don't think a warning is necessary in this case - it's not wrong to do it this way, and it's still something we hope to fixup. Warnings IMO are for warning the user about something they can change that will remove the warning (deprecation, parameter out of range moved to threshold, ...). In this case there's not much they can do to adjust the behavior, and it's not giving them incorrect results.

Comment thread python/cuml/tests/test_pickle.py
@viclafargue
Copy link
Copy Markdown
Contributor

viclafargue commented Aug 19, 2025

Thanks for looking into this. For context, the approximate nearest neighbor section of the code is really old and was at the time implemented as a wrapper around the FAISS library. The C++ layer was then updated to be instead routed to cuVS. But, it looks like automatic parametrization was not updated accordingly.

Thanks also for uncovering the memory management issues, some were really bad it seems. Isolating the ANN and RBC indices is a massive improvement for memory management, however I would strongly suggest turning the ANN section of the code into a simple routing to the cuVS Python bindings (removing the C++ layer). The cuVS Python bindings handle more ANN indices types and much more parameter types and should also set the defaults according to what cuVS expects. Additionally, the cuVS Python bindings also handle proper serialization and deserialization (save/load). The pickle file could simply save the path to the file containing the index to deserialize its content. This would add the cuvs Python library as a dependency though, but it may possibly be worth it.

jcrist added 3 commits August 19, 2025 11:10
This is a refactor of how NearestNeighbors manages lifetimes for its
internal state, specifically the fit indices for RBC/IVFPQ/IVFFlat. The
previous implementation had several bugs:

- Pickling didn't work and could lead to corrupted state, double frees,
  or segfaults.
- Certain operations could lead to double frees of the indices, or
  memory leaks.
- Some temporary locals were allocated with `new` and never freed,
  leading to small memory leaks every fit call.

This PR splits out the indices into their own small classes with robust
lifetime management. Storing these as private attributes on the
`NearestNeighbors` instances helps remove a bunch of lifetime/memory
footguns. It also results in simpler code overall.

Note that estimators pickled with these index types will currently
rebuild the indices on reload. This is better than what happened before
(silent corruption or full on crashes), but still not ideal. I hope to
remedy this in follow ups (IVF* indices in cuvs should be able to
support this, RBC is less straightforward).

While fixing the lifetimes I did end up cleaning up some tangential
code, but there should be no user-facing behavior changes beyond the
bugfixes noted above.
@jcrist jcrist force-pushed the refactor-nearest-neighbors branch from ae168ad to 20a23b3 Compare August 19, 2025 16:15
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Aug 19, 2025

however I would strongly suggest turning the ANN section of the code into a simple routing to the cuVS Python bindings (removing the C++ layer).
...
This would add the cuvs Python library as a dependency though, but it may possibly be worth it.

While I'm all for moving things to use the python layer, I don't believe this will (currently) be possible for cuvs. To ensure cuml can be installed from pypi we're planning on static linking libcuvs when building the libcuml wheels. Until cuvs itself can fit on pypi, we'll need to wrap the routines we use ourselves to make this work.

The pickle file could simply save the path to the file containing the index to deserialize its content.

Pickled binaries should generally be self contained so they can be easily serialized between machines. I do hope to use those (C++) routines though to serialize the IVF* indices for pickling in a followup.

Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great! Just left a couple of comments

"nprobe": 3,
"M": 0,
"n_bits": 8,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree with this comment, is there an issue open for the followup?

def __getstate__(self):
state = self.__dict__.copy()
# TODO: Indices currently aren't pickleable. For now we drop them and
# recreate them on load.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an issue, and perhaps it'd be a good idea to raise a warning to users, since many users might not see the warnings in the docs/docstrings and be surprised?

# differences upon reload. For now we check for comparable performance
# just to ensure things are wired together properly.
accuracy = (i1 == i2).sum() / i1.size
assert accuracy >= 0.9
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0.9, seems like a safe bound, but was just wondering why the specific choice

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's high enough but still gives some leeway for nondeterministic results.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Aug 20, 2025

/merge

@rapids-bot rapids-bot Bot merged commit d44e401 into rapidsai:branch-25.10 Aug 20, 2025
71 checks passed
@jcrist jcrist deleted the refactor-nearest-neighbors branch August 20, 2025 20:32
rapids-bot Bot pushed a commit that referenced this pull request Aug 21, 2025
This was fixed in the recent refactor of `NearestNeighbors` (#7115), this PR just adds a test for `kneighbors_graph` across our algorithms.

Fixes #5642.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #7125
rapids-bot Bot pushed a commit that referenced this pull request Oct 10, 2025
This finishes up a few remaining tasks for `cuml.neighbors` in #7317. Most of the work for this was already done in #7115.

- Simple `__init__` everywhere
- Releases the GIL for all calls to `libcuml`
- Split long methods into subroutines for readability
- Avoid unnecessary conversions in a few spots
- General organizational cleanups for readability

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Simon Adorf (https://github.com/csadorf)

URL: #7320
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

4 participants