Skip to content

Comments

PRS Optimization: Complete Polygenic Risk Score Pipeline Infrastructure#41

Merged
Zacharyr41 merged 19 commits intomainfrom
feature/prs-optimization-features
Jan 5, 2026
Merged

PRS Optimization: Complete Polygenic Risk Score Pipeline Infrastructure#41
Zacharyr41 merged 19 commits intomainfrom
feature/prs-optimization-features

Conversation

@Zacharyr41
Copy link
Owner

Overview

This PR introduces a comprehensive infrastructure for Polygenic Risk Score (PRS) calculation workflows, adding ~19,000 lines of production-ready code across 87 files. The implementation follows clinical-grade standards with full test coverage (1843 tests passing).

Key Features

1. GWAS Summary Statistics Import (GWAS-SSF Standard)

  • Full compliance with GWAS-SSF specification
  • Automatic strand-ambiguous variant detection (A/T, C/G)
  • Effect allele frequency validation and harmonization
  • Binary trait support with odds ratio handling
  • Batch import with progress tracking

2. Genotype Data Storage with Dosage Support

  • Efficient storage of imputed dosages (0.0-2.0 scale)
  • Genotype probability triplets (GP field)
  • Phased haplotype support
  • Chromosome-partitioned tables for query performance

3. Reference Panel Integration

  • HapMap3 variants: Gold-standard ~1.3M SNPs for PRS
  • LD block annotations: Population-specific (EUR, AFR, EAS, SAS, AMR)
  • Automatic genome build detection (GRCh37/GRCh38)

4. PGS Catalog Integration

  • Direct import from PGS Catalog harmonized files
  • Effect weight normalization (beta ↔ OR conversion)
  • Multi-ancestry score support

5. Quality Control Pipeline

  • Variant QC: Call rate, HWE, MAF filtering
  • Sample QC: Heterozygosity, missingness, sex check
  • Ancestry-aware thresholds
  • SQL-native QC functions for in-database filtering

6. Materialized Views for PRS Queries

  • prs_candidate_variants: Pre-joined variants × GWAS × weights
  • sample_prs_components: Per-sample score building blocks
  • Automatic refresh with dependency tracking

7. Export Functions

  • PRSice-2 format (.valid, .snp files)
  • PLINK score format
  • LDpred2 input preparation
  • Direct PRS calculation export

8. SQL Validation Functions

  • Allele complementation (is_complement)
  • Strand ambiguity detection (is_strand_ambiguous)
  • Effect allele harmonization (harmonize_effect_allele)
  • Chromosome normalization (normalize_chromosome)

Schema Changes

Table Purpose
gwas_summary_stats GWAS effect sizes, p-values, frequencies
studies GWAS study metadata (trait, ancestry, N)
genotype_dosages Imputed dosage values per sample×variant
prs_weights Effect weights from PGS Catalog
prs_scores Score metadata and versions
hapmap3_variants Reference panel variants
ld_blocks LD block boundaries by population
sample_qc_metrics Per-sample QC statistics
variant_qc_metrics Per-variant QC statistics
population_frequencies Multi-ancestry allele frequencies

CLI Commands Added

# Import GWAS summary statistics
vcf-pg-loader gwas import --study GCST90002357 --trait Height summary_stats.tsv

# Import PGS Catalog scores
vcf-pg-loader prs import-pgs PGS000001_hmPOS_GRCh38.txt

# Load genotype dosages
vcf-pg-loader genotypes load --format dosage imputed.vcf.gz

# Run sample QC
vcf-pg-loader qc sample --het-threshold 0.2 --call-rate 0.98

# Export for PRSice-2
vcf-pg-loader export prsice --score PGS000001 --output ./prsice_input

Performance

  • Chromosome partitioning enables parallel scans across 25 partitions
  • Materialized views reduce PRS query time from minutes to seconds
  • Batch loading with COPY protocol for high-throughput imports
  • Indexed lookups on (chrom, pos, ref, alt) for variant matching

Testing

  • 1843 tests passing (1841 passed, 2 skipped)
  • Integration tests use testcontainers for PostgreSQL
  • Comprehensive edge case coverage (strand flips, missing data, multi-allelic)
  • Performance benchmarks for realistic data volumes

Documentation

Breaking Changes

None. All changes are additive.

Test Plan

  • All 1843 tests pass locally
  • Pre-commit hooks (ruff, ruff-format) pass
  • CI pipeline passes
  • Manual verification of GWAS import with real GWAS-SSF file
  • Manual verification of PGS Catalog import

🤖 Generated with Claude Code

Zacharyr41 and others added 18 commits January 3, 2026 20:39
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
GCST* identifiers are public GWAS Catalog study accessions, not secrets.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Add SampleQCConfig dataclass to allow overriding hardcoded thresholds
for sex inference, call rate, contamination, and X chromosome PAR region
while preserving backward compatibility through default values.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add HapMap3Downloader class for fetching reference panel from LDpred2 figshare
- Add download-reference CLI command with caching and checksum support
- Integrate cached downloads with load-reference command
- Add httpx dependency for async HTTP downloads
- Update CLI and schema documentation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add LDBlockDownloader for downloading Berisa & Pickrell (2016) LD blocks
- Support EUR, AFR, ASN populations from ldetect-data Bitbucket
- Extend download-reference CLI with --population option for ld-blocks
- Update load-reference to check cache first and suggest download
- Add 27 tests covering config, checksum, downloader, and CLI
- Update documentation (cli-reference, reference-tables, prs-workflows)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Change LEFT JOIN to INNER JOIN for gwas_summary_stats so only variants
  with effect estimates are included (no NULL betas)
- Add gnomad_afr_af and gnomad_eas_af columns alongside gnomad_nfe_af
  for multi-ancestry PRS support

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@Zacharyr41 Zacharyr41 force-pushed the feature/prs-optimization-features branch from 35eb734 to 45a4e27 Compare January 5, 2026 03:29
Copy link
Owner Author

@Zacharyr41 Zacharyr41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed these. Last round I will upload shortly. Made some commits as I went through but now it looks good.

- Fix silent error swallowing in export_plink_score
- Add shared variant matching utils with consistent chromosome normalization
- Fix resource leak in genotype loader (try/finally for VCF close)
- Add input validators for study_accession and genome_build
- Update GWAS and PRS loaders to use shared utilities
- Add docs note about materialized view GWAS dependency

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@Zacharyr41 Zacharyr41 merged commit 62825a8 into main Jan 5, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant