-
Notifications
You must be signed in to change notification settings - Fork 623
Improved UMAP testing and debugging #7073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rapids-bot
merged 28 commits into
rapidsai:branch-25.10
from
viclafargue:improved-umap-testing-and-debugging
Sep 30, 2025
Merged
Changes from 22 commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
2649bf1
Improved UMAP testing and debugging
viclafargue 5dadaec
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue bae7a2c
KNN and fuzzy simplicial set testing with large real datasets
viclafargue 53b5bef
Adding test_simplicial_set_embedding and updating metrics functions
viclafargue 8243565
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue e1d5297
Updating files
viclafargue 639e1b1
Improvements
viclafargue 3e21137
updating tests
viclafargue 9c9054b
Making geodesic metrics computationally tractable
viclafargue 6c017c7
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue 744f995
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue 95e2dc1
Adding spectral initialization tests
viclafargue f98dbfa
improvements
viclafargue add231b
improvements
viclafargue 1e557ac
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue c09bb3a
Additional testing of simplicial set embeddings
viclafargue 9b39844
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue a3fe0ed
Adding all-neighbors backend
viclafargue 7d4359c
Merge remote-tracking branch 'origin/branch-25.10' into improved-umap…
viclafargue 71a88c1
Moving n_job argument
viclafargue ba59825
Adding README
viclafargue dd31d2e
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue 282b8a8
Answer review
viclafargue f9b21aa
README edit
viclafargue 5731e2a
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue 345fa3f
Moving code to python/cuml/umap_dev_tools
viclafargue 50e071b
update path
viclafargue f60eaef
Adding cuVS as necessary dependency in README doc
viclafargue File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| # UMAP Testing and Embedding Quality Assessment Tools | ||
|
|
||
| This directory provides comprehensive tools for both UMAP implementation validation and embedding quality assessment. It serves data scientists, researchers, and developers who need to evaluate the quality of UMAP embeddings or compare different UMAP implementations. | ||
|
|
||
| ## Overview | ||
|
|
||
| The tools in this directory serve three main purposes: | ||
|
|
||
| 1. **Implementation Testing** (`test_umap.py`): Rigorous validation of cuML UMAP against reference implementations | ||
| 2. **Embedding Quality Assessment** (`umap_metrics.py`): Comprehensive evaluation tools for measuring the quality of any UMAP embedding | ||
| 3. **Comparison Implementation** (`run_umap_debug.py`): Detailed analysis comparing cuML UMAP with the reference implementation and allowing debugging | ||
|
|
||
| ### For Data Scientists | ||
|
|
||
| These tools provide **standardized metrics** to evaluate how well your UMAP embeddings preserve data structure. Use them to **quantify embedding quality**, **optimize parameters**, and **generate publication-ready reports** with comprehensive visualizations. | ||
|
|
||
| ### For Researchers and Developers | ||
|
|
||
| These tools enable **rigorous implementation comparison** and provide detailed algorithmic insights including **accuracy benchmarking**, **pipeline debugging**, and **topological analysis** using persistent homology. | ||
|
|
||
| ## Files Description | ||
|
|
||
| ### Core Testing Files | ||
|
|
||
| - **`test_umap.py`**: Main test suite for UMAP functionality with real-world datasets | ||
| - **`umap_metrics.py`**: Comprehensive metrics computation library for UMAP quality assessment | ||
| - **`run_umap_debug.py`**: Interactive debugging tool for comparing reference vs cuML implementations | ||
| - **`toy_datasets.py`**: Synthetic and real dataset generators for testing | ||
| - **`web_results_generation.py`**: Web-based interactive report generation | ||
|
|
||
| ### Standard Testing (`test_umap.py`) | ||
|
|
||
| This file contains tests for real-world datasets commonly used in nearest neighbor search benchmarks: | ||
|
|
||
| - **Deep Image 96 Angular**: High-dimensional image features with cosine similarity | ||
| - **Fashion-MNIST 784 Euclidean**: Fashion item image embeddings | ||
| - **GIST 960 Euclidean**: Image descriptor vectors | ||
| - **MNIST 784 Euclidean**: Handwritten digit embeddings | ||
| - **SIFT 128 Euclidean**: Scale-invariant feature transform descriptors | ||
|
|
||
| #### Key Test Features | ||
|
|
||
| - **KNN Accuracy Validation**: Compares k-nearest neighbor search results between cuML and reference implementations, measuring neighbor recall and distance accuracy across different metrics (euclidean, cosine, etc.) | ||
| - **Fuzzy Simplicial Set Verification**: Validates the construction of fuzzy simplicial sets by comparing edge weights, graph topology, and membership probabilities between implementations | ||
| - **Spectral Initialization Testing**: Compares spectral embedding initialization methods, ensuring consistent starting points for the optimization process | ||
| - **Embedding Quality Assessment**: Measures final embedding quality using trustworthiness, continuity, and other established manifold learning metrics | ||
| - **Parameter Robustness Testing**: Validates performance across different UMAP parameters (n_neighbors, min_dist, n_components) and dataset characteristics | ||
| - **Implementation Consistency**: Ensures cuML produces statistically equivalent results to the reference implementation within acceptable tolerances | ||
| - **Performance Regression Detection**: Catches performance degradations or quality regressions in cuML updates | ||
|
|
||
| #### Running Tests | ||
|
|
||
| ```bash | ||
| DATASET_DIR=datasets pytest python/cuml/cuml/testing/manifold/test_umap.py -v | ||
| ``` | ||
|
|
||
| ### Embedding Quality Assessment (`run_umap_debug.py`) | ||
|
|
||
| Interactive tool for UMAP embedding quality assessment and implementation comparison. Provides **comprehensive quality metrics**, **standardized evaluation benchmarks**, and **publication-ready reports**. Also enables **pipeline debugging** and **detailed implementation analysis** across multiple test datasets. | ||
|
|
||
| #### Available Datasets | ||
|
|
||
| **Synthetic**: Swiss Roll, S-Curve, Sphere, Torus, Gaussian Blobs | ||
| **Real**: Iris, Wine, Breast Cancer, Digits, Diabetes | ||
|
|
||
| #### Usage Examples | ||
|
|
||
| ```bash | ||
| # Quality assessment with web report | ||
| python run_umap_debug.py --implementation cuml --dataset "Swiss Roll" --web-report | ||
|
viclafargue marked this conversation as resolved.
Outdated
|
||
|
|
||
| # Compare cuML vs reference implementation | ||
| python run_umap_debug.py --implementation both --dataset "Swiss Roll" --web-report | ||
|
|
||
| # Quick quality check (no web report) | ||
| python run_umap_debug.py --dataset "Swiss Roll" --implementation cuml | ||
|
|
||
| # List available datasets | ||
| python run_umap_debug.py --list-datasets | ||
| ``` | ||
|
|
||
| ### Quality Metrics Library (`umap_metrics.py`) | ||
|
|
||
| This module provides a comprehensive suite of scientifically-validated metrics for assessing UMAP embedding quality. These metrics are based on established literature in manifold learning and dimensionality reduction. | ||
|
|
||
| #### Local Structure Preservation | ||
| These metrics evaluate how well your embedding preserves local neighborhoods and nearest-neighbor relationships: | ||
|
|
||
| - **Trustworthiness**: Quantifies how many of the k-nearest neighbors in the embedding were also k-nearest neighbors in the original space (higher is better, range: 0-1) | ||
| - **Continuity**: Measures how many of the k-nearest neighbors in the original space remain k-nearest neighbors in the embedding (higher is better, range: 0-1) | ||
|
|
||
| #### Global Structure Preservation | ||
| These metrics assess how well large-scale data relationships are maintained: | ||
|
|
||
| - **Geodesic Spearman Correlation**: Rank correlation between geodesic distances in original space and Euclidean distances in embedding space | ||
| - **Geodesic Pearson Correlation (DEMaP)**: Linear correlation between geodesic and embedded distances - the Distance-based Embedding quality Metric | ||
| - **Global Structure Score**: Combined measure of how well overall data topology is preserved | ||
|
|
||
| #### Fuzzy Simplicial Set Analysis | ||
| For researchers and developers, these metrics analyze the intermediate graph representations: | ||
|
|
||
| - **KL Divergence**: Information-theoretic comparison between high-dimensional and low-dimensional fuzzy graphs | ||
| - **Jaccard Index**: Proportion of edges that overlap between fuzzy simplicial sets | ||
| - **Row-sum L1 Error**: Per-node membership mass differences between graph representations | ||
|
|
||
| #### Topology Preservation | ||
| Advanced topological analysis using computational topology: | ||
|
|
||
| - **Persistent Homology**: Analysis of topological features (holes, connected components) across scales | ||
| - **Betti Numbers**: Count of topological features - H0 (connected components) and H1 (loops/cycles) | ||
| - **Topological Similarity**: Comparison of persistent diagrams between original and embedded data | ||
|
|
||
| #### Interpreting the Metrics | ||
|
|
||
| **For Data Scientists:** | ||
| - **Trustworthiness & Continuity > 0.9**: Excellent local structure preservation | ||
| - **Trustworthiness & Continuity > 0.8**: Good preservation, suitable for most analyses | ||
| - **Trustworthiness & Continuity < 0.7**: Poor preservation, consider parameter tuning | ||
| - **DEMaP > 0.7**: Good global structure preservation | ||
| - **Similar Betti numbers**: Good topological preservation | ||
|
|
||
| ### Web Report Generation (`web_results_generation.py`) | ||
|
|
||
| Creates interactive HTML reports with: | ||
|
|
||
| - **Embedding Visualizations**: 2D scatter plots with original data coloring | ||
| - **Spectral Initialization Plots**: Visualization of initial embedding states | ||
| - **Quality Metrics Tables**: Comprehensive metric comparisons | ||
| - **Implementation Comparisons**: Side-by-side reference vs cuML analysis | ||
|
|
||
| ## Missing Dependencies | ||
|
viclafargue marked this conversation as resolved.
Outdated
|
||
|
|
||
| The following dependencies are **NOT** present in the conda environment and need to be installed separately: | ||
|
|
||
| #### Required for Geodesic Distance Computation | ||
| ```bash | ||
| conda install -c rapidsai-nightly cugraph | ||
|
viclafargue marked this conversation as resolved.
Outdated
|
||
| ``` | ||
|
|
||
| #### Required for Topology Preservation Metrics | ||
| ```bash | ||
| pip install ripser | ||
| ``` | ||
|
|
||
| #### Required for Web Report Generation | ||
| ```bash | ||
| pip install plotly | ||
| ``` | ||
|
|
||
| ## Data Requirements | ||
|
|
||
| ### Real Dataset Testing | ||
|
|
||
| For tests using real benchmark datasets, set the `DATASET_DIR` environment variable: | ||
|
viclafargue marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```bash | ||
| export DATASET_DIR=/path/to/benchmark/datasets | ||
| ``` | ||
|
|
||
| Expected dataset format: | ||
| - Binary files with `.fbin` extension for base vectors | ||
| - Datasets should follow the standard ANN benchmark format | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this and almost all tests failed, because the datasets were not available. Is this expected? Am I supposed to download those separately? If so, how or where?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the datasets should be downloaded separately. When missing the datasets the tests should avert the user that they have to download them. I just complemented this with a README update. The commands are now given both in the test and the README.