Skip to content
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
2649bf1
Improved UMAP testing and debugging
viclafargue Jul 31, 2025
5dadaec
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Aug 11, 2025
bae7a2c
KNN and fuzzy simplicial set testing with large real datasets
viclafargue Aug 12, 2025
53b5bef
Adding test_simplicial_set_embedding and updating metrics functions
viclafargue Aug 14, 2025
8243565
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Aug 14, 2025
e1d5297
Updating files
viclafargue Aug 14, 2025
639e1b1
Improvements
viclafargue Aug 15, 2025
3e21137
updating tests
viclafargue Aug 18, 2025
9c9054b
Making geodesic metrics computationally tractable
viclafargue Aug 19, 2025
6c017c7
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Sep 5, 2025
744f995
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Sep 10, 2025
95e2dc1
Adding spectral initialization tests
viclafargue Sep 10, 2025
f98dbfa
improvements
viclafargue Sep 15, 2025
add231b
improvements
viclafargue Sep 15, 2025
1e557ac
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Sep 19, 2025
c09bb3a
Additional testing of simplicial set embeddings
viclafargue Sep 19, 2025
9b39844
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Sep 26, 2025
a3fe0ed
Adding all-neighbors backend
viclafargue Sep 26, 2025
7d4359c
Merge remote-tracking branch 'origin/branch-25.10' into improved-umap…
viclafargue Sep 26, 2025
71a88c1
Moving n_job argument
viclafargue Sep 26, 2025
ba59825
Adding README
viclafargue Sep 26, 2025
dd31d2e
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Sep 29, 2025
282b8a8
Answer review
viclafargue Sep 30, 2025
f9b21aa
README edit
viclafargue Sep 30, 2025
5731e2a
Merge branch 'branch-25.10' into improved-umap-testing-and-debugging
viclafargue Sep 30, 2025
345fa3f
Moving code to python/cuml/umap_dev_tools
viclafargue Sep 30, 2025
50e071b
update path
viclafargue Sep 30, 2025
f60eaef
Adding cuVS as necessary dependency in README doc
viclafargue Sep 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions python/cuml/cuml/testing/manifold/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# UMAP Testing and Embedding Quality Assessment Tools

This directory provides comprehensive tools for both UMAP implementation validation and embedding quality assessment. It serves data scientists, researchers, and developers who need to evaluate the quality of UMAP embeddings or compare different UMAP implementations.

## Overview

The tools in this directory serve three main purposes:

1. **Implementation Testing** (`test_umap.py`): Rigorous validation of cuML UMAP against reference implementations
2. **Embedding Quality Assessment** (`umap_metrics.py`): Comprehensive evaluation tools for measuring the quality of any UMAP embedding
3. **Comparison Implementation** (`run_umap_debug.py`): Detailed analysis comparing cuML UMAP with the reference implementation and allowing debugging

### For Data Scientists

These tools provide **standardized metrics** to evaluate how well your UMAP embeddings preserve data structure. Use them to **quantify embedding quality**, **optimize parameters**, and **generate publication-ready reports** with comprehensive visualizations.

### For Researchers and Developers

These tools enable **rigorous implementation comparison** and provide detailed algorithmic insights including **accuracy benchmarking**, **pipeline debugging**, and **topological analysis** using persistent homology.

## Files Description

### Core Testing Files

- **`test_umap.py`**: Main test suite for UMAP functionality with real-world datasets
- **`umap_metrics.py`**: Comprehensive metrics computation library for UMAP quality assessment
- **`run_umap_debug.py`**: Interactive debugging tool for comparing reference vs cuML implementations
- **`toy_datasets.py`**: Synthetic and real dataset generators for testing
- **`web_results_generation.py`**: Web-based interactive report generation

### Standard Testing (`test_umap.py`)

This file contains tests for real-world datasets commonly used in nearest neighbor search benchmarks:

- **Deep Image 96 Angular**: High-dimensional image features with cosine similarity
- **Fashion-MNIST 784 Euclidean**: Fashion item image embeddings
- **GIST 960 Euclidean**: Image descriptor vectors
- **MNIST 784 Euclidean**: Handwritten digit embeddings
- **SIFT 128 Euclidean**: Scale-invariant feature transform descriptors

#### Key Test Features

- **KNN Accuracy Validation**: Compares k-nearest neighbor search results between cuML and reference implementations, measuring neighbor recall and distance accuracy across different metrics (euclidean, cosine, etc.)
- **Fuzzy Simplicial Set Verification**: Validates the construction of fuzzy simplicial sets by comparing edge weights, graph topology, and membership probabilities between implementations
- **Spectral Initialization Testing**: Compares spectral embedding initialization methods, ensuring consistent starting points for the optimization process
- **Embedding Quality Assessment**: Measures final embedding quality using trustworthiness, continuity, and other established manifold learning metrics
- **Parameter Robustness Testing**: Validates performance across different UMAP parameters (n_neighbors, min_dist, n_components) and dataset characteristics
- **Implementation Consistency**: Ensures cuML produces statistically equivalent results to the reference implementation within acceptable tolerances
- **Performance Regression Detection**: Catches performance degradations or quality regressions in cuML updates

#### Running Tests

```bash
DATASET_DIR=datasets pytest python/cuml/cuml/testing/manifold/test_umap.py -v
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this and almost all tests failed, because the datasets were not available. Is this expected? Am I supposed to download those separately? If so, how or where?

Copy link
Copy Markdown
Contributor Author

@viclafargue viclafargue Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the datasets should be downloaded separately. When missing the datasets the tests should avert the user that they have to download them. I just complemented this with a README update. The commands are now given both in the test and the README.

```

### Embedding Quality Assessment (`run_umap_debug.py`)

Interactive tool for UMAP embedding quality assessment and implementation comparison. Provides **comprehensive quality metrics**, **standardized evaluation benchmarks**, and **publication-ready reports**. Also enables **pipeline debugging** and **detailed implementation analysis** across multiple test datasets.

#### Available Datasets

**Synthetic**: Swiss Roll, S-Curve, Sphere, Torus, Gaussian Blobs
**Real**: Iris, Wine, Breast Cancer, Digits, Diabetes

#### Usage Examples

```bash
# Quality assessment with web report
python run_umap_debug.py --implementation cuml --dataset "Swiss Roll" --web-report
Comment thread
viclafargue marked this conversation as resolved.
Outdated

# Compare cuML vs reference implementation
python run_umap_debug.py --implementation both --dataset "Swiss Roll" --web-report

# Quick quality check (no web report)
python run_umap_debug.py --dataset "Swiss Roll" --implementation cuml

# List available datasets
python run_umap_debug.py --list-datasets
```

### Quality Metrics Library (`umap_metrics.py`)

This module provides a comprehensive suite of scientifically-validated metrics for assessing UMAP embedding quality. These metrics are based on established literature in manifold learning and dimensionality reduction.

#### Local Structure Preservation
These metrics evaluate how well your embedding preserves local neighborhoods and nearest-neighbor relationships:

- **Trustworthiness**: Quantifies how many of the k-nearest neighbors in the embedding were also k-nearest neighbors in the original space (higher is better, range: 0-1)
- **Continuity**: Measures how many of the k-nearest neighbors in the original space remain k-nearest neighbors in the embedding (higher is better, range: 0-1)

#### Global Structure Preservation
These metrics assess how well large-scale data relationships are maintained:

- **Geodesic Spearman Correlation**: Rank correlation between geodesic distances in original space and Euclidean distances in embedding space
- **Geodesic Pearson Correlation (DEMaP)**: Linear correlation between geodesic and embedded distances - the Distance-based Embedding quality Metric
- **Global Structure Score**: Combined measure of how well overall data topology is preserved

#### Fuzzy Simplicial Set Analysis
For researchers and developers, these metrics analyze the intermediate graph representations:

- **KL Divergence**: Information-theoretic comparison between high-dimensional and low-dimensional fuzzy graphs
- **Jaccard Index**: Proportion of edges that overlap between fuzzy simplicial sets
- **Row-sum L1 Error**: Per-node membership mass differences between graph representations

#### Topology Preservation
Advanced topological analysis using computational topology:

- **Persistent Homology**: Analysis of topological features (holes, connected components) across scales
- **Betti Numbers**: Count of topological features - H0 (connected components) and H1 (loops/cycles)
- **Topological Similarity**: Comparison of persistent diagrams between original and embedded data

#### Interpreting the Metrics

**For Data Scientists:**
- **Trustworthiness & Continuity > 0.9**: Excellent local structure preservation
- **Trustworthiness & Continuity > 0.8**: Good preservation, suitable for most analyses
- **Trustworthiness & Continuity < 0.7**: Poor preservation, consider parameter tuning
- **DEMaP > 0.7**: Good global structure preservation
- **Similar Betti numbers**: Good topological preservation

### Web Report Generation (`web_results_generation.py`)

Creates interactive HTML reports with:

- **Embedding Visualizations**: 2D scatter plots with original data coloring
- **Spectral Initialization Plots**: Visualization of initial embedding states
- **Quality Metrics Tables**: Comprehensive metric comparisons
- **Implementation Comparisons**: Side-by-side reference vs cuML analysis

## Missing Dependencies
Comment thread
viclafargue marked this conversation as resolved.
Outdated

The following dependencies are **NOT** present in the conda environment and need to be installed separately:

#### Required for Geodesic Distance Computation
```bash
conda install -c rapidsai-nightly cugraph
Comment thread
viclafargue marked this conversation as resolved.
Outdated
```

#### Required for Topology Preservation Metrics
```bash
pip install ripser
```

#### Required for Web Report Generation
```bash
pip install plotly
```

## Data Requirements

### Real Dataset Testing

For tests using real benchmark datasets, set the `DATASET_DIR` environment variable:
Comment thread
viclafargue marked this conversation as resolved.
Outdated

```bash
export DATASET_DIR=/path/to/benchmark/datasets
```

Expected dataset format:
- Binary files with `.fbin` extension for base vectors
- Datasets should follow the standard ANN benchmark format
Loading