Systematic discovery of statistically anomalous quasar spectra in DESI DR1 using unsupervised machine learning.
This project consumes the Analysis-Ready Dataset (ARD) built by desi-cosmic-void-galaxies to perform large-scale anomaly detection across ~1.6 million QSO spectra. Using a Variational Autoencoder architecture, the goal is to systematically identify rare physical states, unknown object classes, and unexpected phenomena that would be missed by traditional catalog queries.
Current Status: Skeletal β repository structure established, awaiting ARD completion (Phase 05-06 in upstream project).
This section provides context for those less familiar with spectral anomaly detection. If you already know autoencoders and outlier detection, skip to Data Dependencies.
Traditional astronomical analysis starts with known categories: we query for quasars, select by redshift, filter by emission line properties. This approach is powerful but inherently limited β it only finds what we already know to look for.
Anomaly detection inverts this: instead of asking "which objects match my criteria?", we ask "which objects don't fit the normal pattern?" This enables discovery of:
- Rare physical states: Changing-look quasars mid-transition, extreme BAL systems
- Misclassified objects: Sources incorrectly labeled as QSOs that represent something else entirely
- Unexpected phenomena: Novel spectral signatures not yet cataloged in the literature
- Pipeline artifacts: Systematic data reduction issues worth feeding back to the collaboration
A Variational Autoencoder (VAE) learns to compress spectra into a low-dimensional latent space and reconstruct them. Most spectra compress and reconstruct well β they're "normal" in the statistical sense. Anomalies are spectra the model struggles with:
- High reconstruction error: The output doesn't match the input
- Latent space isolation: The compressed representation is far from other objects
- KL divergence: The encoding doesn't fit the learned distribution
By combining these metrics, we identify spectra that warrant human inspection β the "unknown unknowns" hiding in 1.6 million objects.
DESI DR1 is the largest uniform spectroscopic QSO sample ever assembled. The sheer scale means rare phenomena (1-in-10,000 events) still yield hundreds of candidates. Combined with the ARD's pre-computed properties and spectral embeddings, this enables systematic discovery at unprecedented scale.
This project is an ARD consumer β it does not perform primary data ingestion. All catalog data and spectral embeddings come from the upstream ARD factory.
| Source | Repository | What We Use |
|---|---|---|
| DESI DR1 ARD | desi-cosmic-void-galaxies | QSO catalog + spectral embeddings |
| Column | Source | Purpose |
|---|---|---|
| TARGETID | DESI Core | Object identifier |
| Z_HELIO | DESI Core | Redshift for rest-frame transformation |
| SPECTYPE | DESI Core | QSO selection |
| BAL_PROB | AGN VAC | Known BAL flagging |
| LATENT_VEC | Tier 2 compute | 16-D spectral embedding |
| RECON_MSE | Tier 2 compute | Reconstruction error |
| ANOMALY_SCORE | Tier 2 compute | Isolation Forest score |
| Asset | Location | Purpose |
|---|---|---|
| QSO Parquet tiles | proj-fs02 network share | Raw spectra for validation |
| Linkage index | PostgreSQL (proj-pg01) | TARGETID β tile mapping |
Note: The core ML metrics (LATENT_VEC, RECON_MSE, ANOMALY_SCORE) are computed upstream as Tier 2 ARD columns. This project focuses on candidate validation and scientific interpretation rather than model training.
The analysis leverages pre-computed embeddings from the ARD, focusing on candidate triage and scientific follow-up.
Query the ARD for high-anomaly objects:
SELECT targetid, z_helio, recon_mse, anomaly_score, bal_prob
FROM ard.qso_ard
WHERE anomaly_score > threshold
ORDER BY anomaly_score DESCApply multi-metric filtering:
- Reconstruction error above population threshold
- Latent space isolation (Isolation Forest)
- Exclude known BAL systems (or flag for separate analysis)
For top candidates:
- Retrieve spectra from Parquet tiles
- Generate diagnostic plots (spectrum, reconstruction, residuals)
- Human classification: genuine anomaly vs artifact vs known phenomenon
For validated anomalies:
- Multi-wavelength cross-match (WISE, GALEX, X-ray catalogs)
- Literature search for prior observations
- Physical interpretation and categorization
- Curated anomaly catalog with classifications
- Discovery papers for novel phenomena
- Public release as community resource
graph TD
subgraph "Upstream ARD"
A1[desi-cosmic-void-galaxies<br/>ARD Factory] --> A2[ard.qso_ard<br/>Materialized Table]
A1 --> A3[Tier 2 Compute<br/>Embeddings + Scores]
A1 --> A4[Parquet Spectral Tiles<br/>proj-fs02]
end
subgraph "This Project"
A2 --> B1[Candidate Ranking<br/>Multi-Metric Query]
A3 --> B1
B1 --> B2[Visual Validation<br/>Human Review]
A4 --> B2
B2 --> B3[Cross-Match<br/>Multi-Ξ» Context]
B3 --> B4[Classification<br/>Physical Interpretation]
end
subgraph "Outputs"
B4 --> C1[Anomaly Catalog<br/>Public Release]
B4 --> C2[Discovery Papers<br/>Novel Phenomena]
end
style A1 fill:#336791,color:#fff
style A2 fill:#4ecdc4
style A3 fill:#fff3e0
style C1 fill:#c8e6c9
| Phase | Name | Status | Blocker |
|---|---|---|---|
| β | Repository Setup | β Complete | β |
| β | ARD Dependency | β³ Waiting | Upstream Phase 05-06 |
| β | Tier 2 Embeddings | β³ Waiting | Upstream Phase 07 |
| 01 | Candidate Ranking | β¬ Not Started | Embeddings available |
| 02 | Visual Validation | β¬ Not Started | Phase 01 |
| 03 | Cross-Match | β¬ Not Started | Phase 02 |
| 04 | Catalog Release | β¬ Not Started | Phase 03 |
Before work begins on this project:
- ARD Phase 05-06 must complete β validates QSO catalog
- ARD Phase 07 (Tier 2 compute) must generate spectral embeddings
- LATENT_VEC, RECON_MSE, ANOMALY_SCORE columns must be populated
desi-qso-anomaly-detection/
βββ π docs/ # Documentation
β βββ data-science-infrastructure.md
β βββ documentation-standards/
βββ π¬ src/ # Source code (to be developed)
βββ π scripts/ # Analysis pipelines (to be developed)
βββ π notebooks/ # Validation notebooks (to be developed)
βββ π web/ # Validation UI (to be developed)
βββ π§ͺ tests/ # Unit tests (to be developed)
βββ π work-logs/ # Milestone documentation
β βββ 01-ideation-and-setup/
βββ πΎ data/ # Local data cache (gitignored)
βββ ποΈ scratch/ # Session checkpoints
βββ π README.md # This fileThis project runs on the Proxmox Astronomy Lab cluster.
| Resource | Node | Purpose |
|---|---|---|
| PostgreSQL 16 | proj-pg01 | ARD queries, candidate ranking |
| Spectral tiles | proj-fs02 | QSO spectra for validation |
| GPU compute | radio-gpu01 | Embedding inference (if needed locally) |
| Python processing | proj-dp01 | Validation pipeline |
| Project | Role | Status |
|---|---|---|
| desi-cosmic-void-galaxies | ARD provider (upstream) | Active |
| desi-quasar-outflows | Outflow energetics (consumer) | Skeletal |
| This repo | Anomaly detection (consumer) | Skeletal |
| Resource | Description |
|---|---|
| DESI DR1 Portal | Official data documentation |
| Spender | Spectral autoencoder architecture |
| AGN/QSO VAC | BAL flags and QSO properties |
This project is licensed under the MIT License β see LICENSE for details.
- DESI Collaboration β Data Release 1 public data
- Spender development team β Spectral embedding architecture
- AGN/QSO VAC team β BAL identification and QSO properties
Last Updated: December 29, 2025 | Status: Skeletal (Awaiting ARD + Embeddings)

