Skip to content

Enable HDBSCAN gpu training and cpu inference#6108

Merged
rapids-bot[bot] merged 4 commits intorapidsai:branch-24.12from
divyegala:hdbscan-gpu-train-cpu-infer
Oct 17, 2024
Merged

Enable HDBSCAN gpu training and cpu inference#6108
rapids-bot[bot] merged 4 commits intorapidsai:branch-24.12from
divyegala:hdbscan-gpu-train-cpu-infer

Conversation

@divyegala
Copy link
Copy Markdown
Member

@divyegala divyegala commented Oct 11, 2024

Until now, we supported all combinations of GPU/CPU interoperability except the one mentioned in the title. This was because the CPU HDBSCAN package was missing attribute setters. With scikit-learn-contrib/hdbscan#657, attribute setters are now available which allow us to transfer GPU trained attributes to the CPU model. This feature is available as part of hdbscan=0.8.39

@divyegala divyegala added feature request New feature or request DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Oct 11, 2024
@divyegala divyegala self-assigned this Oct 11, 2024
@github-actions github-actions Bot added conda conda issue Cython / Python Cython or Python issue labels Oct 11, 2024
Comment thread dependencies.yaml
- statsmodels
- umap-learn==0.5.6
- pynndescent
- setuptools # Needed on Python 3.12 for dask-glm, which requires pkg_resources but Python 3.12 doesn't have setuptools by default
Copy link
Copy Markdown
Member Author

@divyegala divyegala Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask-glm was removed by PR #6028 so this is now unnecessary

@divyegala divyegala removed the DO NOT MERGE Hold off on merging; see PR for details label Oct 16, 2024
@divyegala divyegala marked this pull request as ready for review October 16, 2024 22:54
@divyegala divyegala requested review from a team as code owners October 16, 2024 22:54
Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me, had a question about a comment but that's all

Comment on lines +172 to +174
# These attributes have to be reassigned to the CPU model
# as the raw arrays because the reference HDBSCAN implementation
# reconstructs the objects from the raw arrays
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens in the setters in the hdbscan library, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially it happens in the getters. Here's the issue, consider CondensedTree object:

  1. We use setter to assign CondensedTree object to self.condensed_tree_ from cuML to hdbscan
  2. The getter for self.condensed_tree_ checks if it has a value already. If it does, it assumes that it is raw numpy arrays and creates another CondensedTree object without any value sanitization

That's why I re-assigned the raw arrays, so when hdbscan library internally calls the getters it reconstructs the object correctly.

@divyegala
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot Bot merged commit fb4c8af into rapidsai:branch-24.12 Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conda conda issue Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants