Run the upstream HDBSCAN tests in CI by jcrist · Pull Request #6995 · rapidsai/cuml

jcrist · 2025-07-09T17:56:27Z

This adds the HDBSCAN upstream test suite to CI.

A few complications:

The upstream hdbscan library has seen a lapse in maintenance lately. Thus, the test suite won't run cleanly (or even import) without sklearn < 1.6. Rather than pinning the sklearn version across the upstream test jobs (which would require a new entry in dependencies.yaml and changes to the test_python_common.sh script), for now we just skip the HDBSCAN tests if that condition isn't met. The oldest deps run will run the HDBSCAN tests, but not the newest deps run.
There are many failures in the HDBSCAN tests, most of which are memory errors. The xfail list here is stable (I've ran it 10s of times to be sure). That said, if you use pytest-randomly to shuffle the test order, you'll see a different set of failures. Not great. I figure it's still worth getting the test infra in place though before taking a look at the underlying memory issues.

Currently on top of #6994.

Without a `pytest.ini`, the upstream tests will use the pytest config for cuml itself (which turns Deprecation/FutureWarning into errors). This can lead to test failures due to things outside our control. This PR adds a small `pytest.ini` to the upstream tests, and changes the pytest invocation so our upstream tests use it. It currently contains the following warning config: - `UnmatchedXfailTests`: error if there's an issue with the xfail list - `FutureWarning` raised by cuml: error if our code warns of a deprecation. - `PytestUnknownMarkWarning`: ignored, our xfail-list plugin creates a bunch of unknown marks, no need to warn about these. Also fixes some annoying UX in the umap/run-tests.sh script where it would `cd` into the directory before executing, meaning any forwarded args (like `--junit-xml`) would have the incorrect relative path.

This adds the HDBSCAN upstream test suite to CI. A few complications: - The upstream `hdbscan` library has seen a lapse in maintenance lately. Thus, the test suite won't run cleanly (or even import) without sklearn < 1.6. Rather than pinning the sklearn version across the upstream test jobs (which would require a new entry in `dependencies.yaml` and changes to the `test_python_common.sh` script), for now we just skip the HDBSCAN tests if that condition isn't met. The oldest deps run will run the HDBSCAN tests, but not the newest deps run. - There are many failures in the HDBSCAN tests, most of which are memory errors. The xfail list here is stable (I've ran it 10s of times to be sure). That said, if you use `pytest-randomly` to shuffle the test order, you'll see a different set of failures. Not great.

csadorf · 2025-07-09T18:24:02Z

Merge after #6845 and #6994.

jcrist · 2025-07-09T18:33:32Z

Hmmm, looks like the memory failures (while consistent on my machine) are different in CI. Two tasks then:

I'll look into marking more of them flaky to attempt to get the infra in.
I'll start on work to open reproducible issues without cuml.accel for some. HDBSCAN cudaErrorInvalidDevice if min_cluster_size >= n_samples #6985 is one already.

jcrist · 2025-07-10T16:22:05Z

Looked into this more this morning. There are 3 categories of failures:

Simple python-level failures (most of these should be quick fixes)
Tests that always fail, and break downstream tests (there's 5 of these, at least on my machine)
Tests that sometimes fail, and break downstream tests (this list seems impossible to nail down, it might be every test)

Given this, I think we won't be able to run these tests (at least in a required manner) until the memory bugs are fixed. A few options I see for now:

Don't merge this PR, only try to split out the always failing tests into new issues that can be resolved without the upstream hdbscan tests or cuml.accel. Try again once those are fixed.
Merge this PR, but remove the part adding this to CI. Gives devs the infra setup so they can run the test suite to help resolve memory issues. I'd still break out issues, but we'd have more stuff already merged in.
Merge this PR, but ignore failures. This could be done by very very liberally marking tests as flaky, or through ignoring the pytest failure in the ci bash script.

Clearly we have some bugs in our hdbscan implementation, these we should prioritize resolving. Likewise, the python-level failures are worth resolving since they're possibly easy fixes and might affect cuml.accel users running hdbscan. The actual hdbscan upstream test suite is of pretty low quality for us though, it's not clear to me that we'd gain a ton of value running it in CI beyond hdbscan smoketests (which we could as easily write ourselves in the cuml_accel_tests/integration/ directory). Given this, I'd lean towards option 1 or 2 above.

csadorf · 2025-07-10T21:15:33Z

I very much prefer option 2. Let's get the infrastructure set up and start working on addressing those issues. I don't think running tests in CI that fail beyond xfails is advantageous as it just creates confusion and noise. We should add some smoke tests to cuml_accel_tests/integration/ either way.

We can't run these consistently, holding off adding them for now.

jcrist · 2025-07-14T22:17:37Z

/merge

jcrist added 2 commits July 9, 2025 10:00

jcrist self-assigned this Jul 9, 2025

jcrist requested review from a team as code owners July 9, 2025 17:56

jcrist requested a review from jameslamb July 9, 2025 17:56

jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 9, 2025

jcrist requested review from cjnolet and dantegd July 9, 2025 17:56

jcrist added the cuml-accel Issues related to cuml.accel label Jul 9, 2025

github-actions Bot added Cython / Python Cython or Python issue ci labels Jul 9, 2025

jcrist requested review from csadorf and removed request for cjnolet July 9, 2025 18:08

csadorf approved these changes Jul 9, 2025

View reviewed changes

WIP

cc40bab

Revert running tests in CI

aafbc18

We can't run these consistently, holding off adding them for now.

github-actions Bot removed the ci label Jul 14, 2025

rapids-bot Bot merged commit f594e08 into rapidsai:branch-25.08 Jul 14, 2025
73 checks passed

jcrist deleted the hdbscan-tests branch July 14, 2025 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the upstream HDBSCAN tests in CI#6995

Run the upstream HDBSCAN tests in CI#6995
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.08from
jcrist:hdbscan-tests

jcrist commented Jul 9, 2025 •

edited

Loading

Uh oh!

csadorf commented Jul 9, 2025

Uh oh!

jcrist commented Jul 9, 2025

Uh oh!

jcrist commented Jul 10, 2025 •

edited

Loading

Uh oh!

csadorf commented Jul 10, 2025

Uh oh!

jcrist commented Jul 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jcrist commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csadorf commented Jul 9, 2025

Uh oh!

jcrist commented Jul 9, 2025

Uh oh!

jcrist commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csadorf commented Jul 10, 2025

Uh oh!

jcrist commented Jul 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jcrist commented Jul 9, 2025 •

edited

Loading

jcrist commented Jul 10, 2025 •

edited

Loading