Skip to content

Improved UMAP testing and debugging#7073

Merged
rapids-bot[bot] merged 28 commits intorapidsai:branch-25.10from
viclafargue:improved-umap-testing-and-debugging
Sep 30, 2025
Merged

Improved UMAP testing and debugging#7073
rapids-bot[bot] merged 28 commits intorapidsai:branch-25.10from
viclafargue:improved-umap-testing-and-debugging

Conversation

@viclafargue
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue commented Jul 31, 2025

Answers #7072.

Working on a single PR for now as some metrics are used both for debugging and testing.

This PR adds the following :

A debugging stack for UMAP :

  • Some code to run the entire UMAP pipeline (cuML and reference) : KNN graph, fuzzy simplicial set, spectral initialization and simplicial set embeddings.
  • Some code to compare the reference and cuML runs : KNN recall, fuzzy simplicial set entropy and most importantly the computing of all sorts of metrics to compare the quality of UMAP embeddings.
  • Some code to produce a web report that offers a visualization for everything being measured.

Improved testing :
✅ KNN testing
✅ Fuzzy simplicial set testing
✅Spectral initialization testing
✅Embedding optimization testing
🚧cuml-accel specific testing (will give update on #6974)

Usage :

$ python python/cuml/cuml/testing/manifold/run_umap_debug.py --dataset "Swiss Roll"

UMAP Quality Assessment Script
==================================================
Implementation: both
Dataset: Swiss Roll
Web report: disabled
==================================================
Running assessment on dataset: Swiss Roll

Processing dataset: Swiss Roll
Running reference UMAP pipeline...
  Computing k-nearest neighbors (KNN) ...
  Computing fuzzy simplicial set ...
  Computing spectral initialization from fuzzy graph ...
  Computing 2-D embedding from fuzzy simplicial set ...
Running cuml UMAP pipeline...
  Computing k-nearest neighbors (KNN) ...
  Computing fuzzy simplicial set ...
  Computing spectral initialization from fuzzy graph ...
  Computing 2-D embedding from fuzzy simplicial set ...

==================================================
METRICS FOR SWISS ROLL (REFERENCE)
==================================================
Local Structure Preservation:
  Trustworthiness: 0.9977
  Continuity: 0.9998

Global Structure Preservation:
  Geodesic Spearman Correlation: 0.8791
  Geodesic Pearson Correlation: 0.8521
  DEMaP: 0.8521

Fuzzy Simplicial Set:
  Cross-entropy: 6543.2301

Topological Features (Betti Numbers):
  High-dim H0: 999
  High-dim H1: 351
  Low-dim H0: 999
  Low-dim H1: 75

==================================================
METRICS FOR SWISS ROLL (CUML)
==================================================
Local Structure Preservation:
  Trustworthiness: 0.9976
  Continuity: 0.9998

Global Structure Preservation:
  Geodesic Spearman Correlation: 0.8798
  Geodesic Pearson Correlation: 0.8674
  DEMaP: 0.8674

Fuzzy Simplicial Set:
  Cross-entropy: 7503.6173

Comparison with Reference Implementation:
  KNN Recall: 1.0000
  Fuzzy Cross-entropy (vs. reference): 0.0000

Topological Features (Betti Numbers):
  High-dim H0: 999
  High-dim H1: 351
  Low-dim H0: 999
  Low-dim H1: 65

==================================================
ANALYSIS COMPLETE!
Web report generation was disabled.
==================================================
$ python python/cuml/cuml/testing/manifold/run_umap_debug.py --dataset "Swiss Roll" --web-report
image

@viclafargue viclafargue requested a review from a team as a code owner July 31, 2025 13:03
@viclafargue viclafargue requested review from betatim and jcrist July 31, 2025 13:03
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Jul 31, 2025
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Aug 4, 2025

@viclafargue Please make sure to target branch-25.10.

@csadorf csadorf added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 4, 2025
@viclafargue viclafargue changed the base branch from branch-25.08 to branch-25.10 August 5, 2025 07:33
@divyegala divyegala linked an issue Sep 8, 2025 that may be closed by this pull request
@betatim
Copy link
Copy Markdown
Member

betatim commented Sep 26, 2025

I started taking a look at this and tried running the command from the top comment. You need to install a few extra libraries (poltly, cuvs, ripser) to make it work. Maybe worth adding a short README to the directory that explains what is needed and include the two example commands from the top comment. So that people from the future have an easy time figuring out how to use this

Comment thread python/cuml/cuml/testing/manifold/umap_metrics.py Outdated
@betatim
Copy link
Copy Markdown
Member

betatim commented Sep 29, 2025

I won't get around to thinking about this in time for merging it today. However I think this is a debugging tool for developers, so this is a low risk thing to merge in my opinion.

@csadorf csadorf added the DO NOT MERGE Hold off on merging; see PR for details label Sep 30, 2025
Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good addition to improve our test quality, but I have a few comments, that I'd like to see addressed.

I am not sure that python/cuml/cuml/testing/manifold/ specifically and this repository in general is really the right place to maintain this framework unless we make this accessible as an extra entrypoint or add it to the cuML Python API. Its dependencies are also not captured in depdencies.yaml. I think maintaining this in a separate repository is likely more advisable for now.

Comment thread python/cuml/cuml/testing/manifold/README.md Outdated
#### Running Tests

```bash
DATASET_DIR=datasets pytest python/cuml/cuml/testing/manifold/test_umap.py -v
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this and almost all tests failed, because the datasets were not available. Is this expected? Am I supposed to download those separately? If so, how or where?

Copy link
Copy Markdown
Contributor Author

@viclafargue viclafargue Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the datasets should be downloaded separately. When missing the datasets the tests should avert the user that they have to download them. I just complemented this with a README update. The commands are now given both in the test and the README.

Comment thread python/cuml/cuml/testing/manifold/README.md Outdated
Comment thread python/cuml/cuml/testing/manifold/README.md Outdated
Comment thread python/cuml/cuml/testing/manifold/README.md Outdated
Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot!

@csadorf csadorf removed the DO NOT MERGE Hold off on merging; see PR for details label Sep 30, 2025
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Sep 30, 2025

/merge

@rapids-bot rapids-bot Bot merged commit 8388e2a into rapidsai:branch-25.10 Sep 30, 2025
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance UMAP test coverage and address known bug reports

5 participants