Skip to content

Use RBC from cuVS#6644

Merged
rapids-bot[bot] merged 17 commits intorapidsai:branch-25.06from
divyegala:cuvs-rbc
May 14, 2025
Merged

Use RBC from cuVS#6644
rapids-bot[bot] merged 17 commits intorapidsai:branch-25.06from
divyegala:cuvs-rbc

Conversation

@divyegala
Copy link
Copy Markdown
Member

@divyegala divyegala commented May 7, 2025

Depends on rapidsai/cuvs#218. This PR reduces the supported combination of types for RBC method in dbscan.cu to only <float, int64_t>. This is because this is the only type combination that cuVS compiles RBC for, which is otherwise very expensive and slow to compile.

Effects on Binary Size

Tracked here #6626 (comment)

@divyegala divyegala self-assigned this May 7, 2025
@divyegala divyegala requested a review from a team as a code owner May 7, 2025 22:35
@divyegala divyegala added the improvement Improvement / enhancement to an existing function label May 7, 2025
@divyegala divyegala requested review from a team as code owners May 7, 2025 22:35
@divyegala divyegala added the non-breaking Non-breaking change label May 7, 2025
@github-actions github-actions Bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels May 7, 2025
@divyegala divyegala mentioned this pull request May 7, 2025
5 tasks
@divyegala divyegala changed the title Use RBC from cuVS [DO NOT MERGE] Use RBC from cuVS May 7, 2025
@divyegala divyegala changed the title [DO NOT MERGE] Use RBC from cuVS Use RBC from cuVS May 9, 2025
Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I have a few questions and requests.

Comment thread cpp/cmake/thirdparty/get_cuvs.cmake Outdated
find_and_configure_cuvs(VERSION ${CUML_MIN_VERSION_cuvs}
FORK rapidsai
PINNED_TAG branch-${CUML_BRANCH_VERSION_cuvs}
PINNED_TAG fea-2408-rbc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either block this PR or create an issue to track unpinning, please?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your blocking review is fine, I will let you know when I unpin.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpinned 9192e3d

Comment on lines +65 to +67
if algorithm == "rbc":
if datatype == np.float64 or out_dtype in ["int32", np.int32]:
pytest.skip("RBC does not support float64 dtype or int32 labels")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Is that a new limitation? If so then this is a breaking change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fair enough. Changed the labels.

Comment thread python/cuml/cuml/tests/test_dbscan.py
from libcpp cimport bool
from libcpp.vector cimport vector
from pylibraft.common.handle cimport handle_t
from pylibraft.common.mdspan cimport *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware that this a common pattern in this codebase, but should we try to avoid wildcard imports in the future?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need to avoid them? Happy to not do this, just want to know for my own knowledge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting the "imports" section from PEP 8:

Wildcard imports (from import *) should be avoided, as they make it unclear which names are present in the namespace, confusing both readers and many automated tools. [...]

It obfuscates what's actually present in the namespace which makes it harder to understand what exact interface is exposed through the module and what a symbol's provenance is. This might be a Cython pattern that I am not aware of, but for general Python code this is an indisputed anti-pattern.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's less bad and not necessarily an antipattern for cimports. It's not uncommon in cython codebases to have large headers and include them with cimport * (using cimport * is basically the same as a C include). pyarrow does this a bunch, for example.

Qualified imports make it easier to understand what's being pulled in, and also lets linters check when a cimport is no longer needed (I just removed a bunch of unnecessary ones in #6600, for example). I wouldn't block on adding a cimport *, but if the number of included symbols is small, I also think it'd be nicer to spell them out explicitly.

Copy link
Copy Markdown
Contributor

@csadorf csadorf May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcrist Thanks for the perspective. Definitely not blocking for this PR. Just wanted to get a take on this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using cimport * is basically the same as a C include

This is the equivalence I was applying. But I removed the wildcard imports 2c30ad1, thanks both!

Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions, most of this looks like a straightforward port.

):
if algorithm == "rbc":
if datatype == np.float64 or out_dtype in ["int32", np.int32]:
pytest.skip("RBC does not support float64 dtype or int32 labels")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reduction in support (and if so, why? the linked PR looked like it just moved what was in raft to cuvs, I'd expect support to remain the same)? Or did this not work before (and would fallback)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what happens if you run DBSCAN with these params and dtypes? I see something in the c++ layer to log a warning and fallback - is that what's hit here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reduction in support (and if so, why? the linked PR looked like it just moved what was in raft to cuvs, I'd expect support to remain the same)? Or did this not work before (and would fallback)?

This is a reduction in support, yes. @csadorf also pointed it out here #6644 (comment). The reason why RAFT supported it but cuVS does not is because RAFT was header-only so we could compile for all the types we want, whereas cuVS pre-compiles these types for us. cuVS offers only float support.

Personally, I am fine with us not asking cuVS to provide double support because RBC is extremely expensive to compile. Every unique combination of type adds 20 MB in binary size.

Also, what happens if you run DBSCAN with these params and dtypes? I see something in the c++ layer to log a warning and fallback - is that what's hit here?

Yes, the C++ layer logs a warning and provides a fallback. But in the tests we don't want to hit the fallback as the fallback is already tested as part of the param combinations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm fine with dropping this and if the algorithm still runs without it (and warns) then I don't think this is breaking enough to be worthy of a deprecation cycle.

But in the tests we don't want to hit the fallback as the fallback is already tested as part of the param combination

That said, I do think it's worth testing that the fallback actually fallsback. Are you saying that the fallback (RBC w/ these datatypes) is run elsewhere and we see the fallback is hit there? Or only that the other algorithm is tested elsewhere? If the latter, then I think we'll want to ensure the former is tested somewhere.

Also FWIW, unless this test fails due to numeric differences or takes a ton of time, I don't see value in skipping it here personally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, it is the latter. The tests are quick and not really flaky. I'll remove the skip.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I "un"skipped the fallback test in our C++ tests because they are faster to run and generally more stable. 5214539

@divyegala divyegala added breaking Breaking change and removed non-breaking Non-breaking change labels May 9, 2025
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented May 9, 2025

This is because this is the only type combination that cuVS compiles RBC for, which is otherwise very expensive and slow to compile.

What's the expected user impact? Should this go through a deprecation cycle? Is this still an issue if we statically link to cuVS?

@divyegala
Copy link
Copy Markdown
Member Author

divyegala commented May 9, 2025

What's the expected user impact?

Nothing apparent, we have a fallback available. While RBC is definitely an optimization to BRUTE_FORCE strategy, it is just too expensive for us to compile and support 4 different type combinations for. I think just float is good enough for almost all use cases. Some context on RBC: it is more strongly a low dimensional optimization and is most-performant for 2 or 3 columns in data matrix.

Should this go through a deprecation cycle?

This is a good question. If you feel strongly about it, then we can. It will most likely delay our PyPI plans though.

Is this still an issue if we statically link to cuVS?

Yes, static or dynamic link does not matter. If cuVS does not provide the type support we can't use it.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented May 9, 2025

What's the expected user impact?

Nothing apparent, we have a fallback available. While RBC is definitely an optimization to BRUTE_FORCE strategy, it is just too expensive for us to compile and support 4 different type combinations for. I think just float is good enough for almost all use cases. Some context on RBC: it is more strongly a low dimensional optimization and is most-performant for 2 or 3 columns in data matrix.

But here we are actually limiting the datatype, not just the index type, are we?

Should this go through a deprecation cycle?

This is a good question. If you feel strongly about it, then we can. It will most likely delay our PyPI plans though.

We need to understand the user impact to be able to weigh that decision. Having a fallback to a less performant method is insufficient mitigation IMO. I would assume that dropping support for float64 datatypes has more than just marginal impact.

@cjnolet I would be interested in your take on this, too. Can we safely assume that most DBSCAN users would either prefer or be fine with working with single-precision datasets?

@github-actions github-actions Bot removed the CMake label May 9, 2025
@divyegala divyegala removed request for a team, bdice and robertmaynard May 9, 2025 21:26
@divyegala divyegala requested review from csadorf and jcrist May 13, 2025 03:00
@divyegala
Copy link
Copy Markdown
Member Author

@viclafargue can you review this PR?

Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, because I am convinced that removing double precision support will have very limited, albeit non-zero impact on users.

That said, for changes of this nature in the future, I would recommend more upfront communication about new limitations and providing users with adequate time to adapt through a deprecation cycle.

While I have some concerns about the implementation process, I understand that this change is necessary to advance cuVS adoption.

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented May 13, 2025

I think just float is good enough for almost all use cases. Some context on RBC: it is more strongly a low dimensional optimization and is most-performant for 2 or 3 columns in data matrix.

Sorry for being late to this discussion @csadorf and @divyegala. No doubt, the decision to go from double + float support to just float is going to have a non-zero impact, but the longer we go and the more we're faced with these expensive decisions about hosting device code for multiple formats, the more I'm thjinking we should startt moving towards supporting only float across most, if not all, of our algorithms.

  1. The expense of supporting double out of the box is much greater than users understanding that users can normalize and/or divide their vectors in the (tiny) chance they ACTUALLY need double precision.
  2. One reason we opted to support both double and float from the start was that we wanted to be as user friendly as possible,
  3. but there's another hidden reason here that we supported it because it can be more accurate in the face of certain computations- such as gradients in solvers and distances which can sometimes eat up the excess precision available in floats.
  4. However in this latter case, I think the proper way to handle this is to promote to double during those computations and use float everywhere else. Hopefully we will start moving in this direction.

@csadorf I agree, it's unfortunate this came up as a rather last-minute fix/workaround, and I definitely agree we should have moe discussions about how we are going to migrate longer term.

Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After many years developing our algorithms, I agree with @cjnolet analysis and points.

For changing a full algorithm support (say whether RF supports float64 or not) we would definitely need a deprecation cycle and more discussion and analysis, but the impact here is even smaller than that would be and seems like an acceptable choice to me.

Comment thread cpp/src/dbscan/dbscan.cuh Outdated
@divyegala
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot Bot merged commit a22a259 into rapidsai:branch-25.06 May 14, 2025
92 of 93 checks passed
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed review, @divyegala. I've gone through it, and everything looks good to me.

@divyegala divyegala linked an issue May 14, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce object sizes of dbscan and knn

7 participants