Skip to content

CLI Tool for efficiently migrating VCF files into relational database (Postgres)

License

Notifications You must be signed in to change notification settings

Zacharyr41/vcf-pg-loader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vcf-pg-loader

CI install with bioconda PyPI version nf-core module

PRS-CS LDpred2 PRSice--2 PLINK

High-performance VCF to PostgreSQL loader with clinical-grade compliance, purpose-built for Polygenic Risk Score (PRS) research.

Video Tutorials

What is a VCF File?

What is a VCF File?

vcf-pg-loader Demo

vcf-pg-loader Demo

PRS Research Features

vcf-pg-loader provides a complete data infrastructure for polygenic risk score research:

Feature Description
GWAS Summary Statistics Import and query GWAS results in GWAS-SSF standard format
PGS Catalog Integration Load PRS weights directly from PGS Catalog scoring files
HapMap3 Reference Panel Built-in support for HapMap3 SNPs used by PRS-CS, LDpred2, and other methods
LD Block Annotations Berisa & Pickrell (2016) LD blocks for Bayesian PRS methods
Multi-Ancestry Frequencies Population-specific allele frequencies from gnomAD for ancestry-aware PRS
Genotype Dosages Imputation dosages and genotype probabilities (GP) for accurate PRS calculation
Sample QC Metrics Call rate, het/hom ratio, Ti/Tv, sex inference, and contamination checks
Variant QC Metrics HWE exact test, INFO score, call rate, MAF computed at load time
Materialized Views Pre-computed PRS-ready variant sets with concurrent refresh
Export to PRS Tools Direct export to PLINK, PRS-CS, LDpred2, and PRSice-2 formats

Features

Core VCF Loading

  • Streaming VCF parsing with cyvcf2 for memory-efficient processing
  • Variant normalization using the vt algorithm (left-align and trim)
  • Number=A/R/G field handling - proper per-ALT extraction during multi-allelic decomposition
  • Binary COPY protocol via asyncpg for maximum insert performance
  • Chromosome-partitioned tables for efficient region queries
  • Human and non-human genome support - chromosome enum for human, TEXT for others

PRS Data Management

  • GWAS summary statistics - import GWAS-SSF format files with study metadata
  • PGS Catalog weights - load scoring files with automatic variant matching
  • Reference panels - HapMap3 SNP sets for LD-aware PRS methods
  • LD block definitions - genome partitioning for PRS-CS and SBayesR
  • Population frequencies - multi-ancestry AF from gnomAD, 1000 Genomes
  • Genotype storage - hash-partitioned with dosage and GP support

Quality Control

  • Variant QC - HWE p-value, INFO score, call rate, AAF/MAF/MAC
  • Sample QC - call rate, het/hom ratio, Ti/Tv, F coefficient, sex inference
  • Materialized views - pre-filtered PRS candidate variants
  • SQL functions - HWE exact test, allele harmonization in-database

Infrastructure

  • Audit trail with load batch tracking and validation
  • CLI interface with Typer for easy operation
  • TOML configuration - file-based configuration with CLI overrides
  • Progress reporting - real-time progress bar with rich
  • Docker support - multi-stage Dockerfile and docker-compose for development
  • Zero-config database - auto-managed PostgreSQL via Docker, no setup required

Installation

Bioconda (Recommended)

conda install -c conda-forge -c bioconda vcf-pg-loader

PyPI

pip install vcf-pg-loader

Quick Install Script

curl -fsSL https://raw.githubusercontent.com/Zacharyr41/vcf-pg-loader/main/install.sh | bash

This installs vcf-pg-loader and all dependencies (Python, Docker) automatically.

From Source

git clone https://github.com/Zacharyr41/vcf-pg-loader.git
cd vcf-pg-loader
uv pip install -e ".[dev]"

nf-core Module

The vcfpgloader/load module is available in nf-core/modules for use in Nextflow pipelines.

Installation

nf-core modules install vcfpgloader/load

Usage

include { VCFPGLOADER_LOAD } from '../modules/nf-core/vcfpgloader/load/main'

workflow {
    ch_input = Channel.of([
        [ id: 'sample1', family: 'FAM001' ],  // meta map
        file('sample1.vcf.gz'),                // vcf
        file('sample1.vcf.gz.tbi'),            // tbi
        'localhost',                           // db_host
        5432,                                  // db_port
        'variants_db',                         // db_name
        'postgres',                            // db_user
        'public'                               // db_schema
    ])

    VCFPGLOADER_LOAD(ch_input)

    // Access outputs
    VCFPGLOADER_LOAD.out.report     // JSON report with loading statistics
    VCFPGLOADER_LOAD.out.log        // Detailed loading log
    VCFPGLOADER_LOAD.out.row_count  // Number of variants loaded
}

Database Password

Set PGPASSWORD via environment variable or Nextflow secrets:

// nextflow.config - Option 1: Environment variable
env {
    PGPASSWORD = System.getenv('PGPASSWORD')
}

// nextflow.config - Option 2: Nextflow secrets
env {
    PGPASSWORD = secrets.PGPASSWORD
}

Configuration

Customize batch size and other options via ext directives:

// nextflow.config
process {
    withName: 'VCFPGLOADER_LOAD' {
        ext.batch_size = '50000'  // variants per batch (default: 10000)
        ext.args = '--normalize'  // additional CLI arguments
    }
}

Outputs

Channel Description
report JSON file with loading statistics (variants loaded, elapsed time, throughput)
log Detailed loading log with warnings/errors
row_count Integer count of variants successfully loaded
versions Tool version for MultiQC reporting

Verify Installation

vcf-pg-loader doctor

Quick Start

Zero-Config Mode (Easiest)

No PostgreSQL setup required - vcf-pg-loader manages a local database automatically:

# Load a VCF file (auto-starts PostgreSQL in Docker)
vcf-pg-loader load sample.vcf.gz

# Check database status
vcf-pg-loader db status

# Open psql shell to query data
vcf-pg-loader db shell

With Your Own PostgreSQL

# Initialize database schema
vcf-pg-loader init-db --db postgresql://user:pass@localhost/variants

# Load a VCF file
vcf-pg-loader load sample.vcf.gz --db postgresql://user:pass@localhost/variants

# Validate a completed load
vcf-pg-loader validate <load-batch-id> --db postgresql://user:pass@localhost/variants

Additional Options

# Load without normalization
vcf-pg-loader load sample.vcf.gz --no-normalize

# Load non-human VCF (e.g., SARS-CoV-2)
vcf-pg-loader load sarscov2.vcf.gz --no-human-genome

# Initialize for non-human genomes
vcf-pg-loader init-db --db postgresql://... --no-human-genome

PRS Workflow Quick Start

# 1. Load imputed VCF with genotype dosages
vcf-pg-loader load imputed.vcf.gz --db postgresql://localhost/prs_db

# 2. Import GWAS summary statistics
vcf-pg-loader import-gwas gwas_sumstats.tsv \
    --study-id GCST90012345 \
    --trait "Type 2 Diabetes" \
    --db postgresql://localhost/prs_db

# 3. Load PGS Catalog weights
vcf-pg-loader import-pgs PGS000001_hmPOS_GRCh38.txt \
    --db postgresql://localhost/prs_db

# 4. Load HapMap3 reference panel
vcf-pg-loader load-reference hapmap3.tsv \
    --panel-name hapmap3 \
    --db postgresql://localhost/prs_db

# 5. Annotate variants with LD blocks
vcf-pg-loader annotate-ld-blocks \
    --population EUR \
    --db postgresql://localhost/prs_db

# 6. Compute sample QC metrics
vcf-pg-loader compute-sample-qc \
    --db postgresql://localhost/prs_db

# 7. Refresh materialized views for fast queries
vcf-pg-loader refresh-views --db postgresql://localhost/prs_db

# 8. Export to PRS-CS format
vcf-pg-loader export-prs-cs \
    --study-id 1 \
    --output gwas_prscs.txt \
    --hapmap3-only \
    --db postgresql://localhost/prs_db

CLI Commands

load

Load a VCF file into PostgreSQL.

vcf-pg-loader load <vcf_path> [OPTIONS]

Options:
  --db, -d                        PostgreSQL connection URL (omit for auto-managed DB)
  --batch, -b                     Records per batch [default: 50000]
  --workers, -w                   Parallel workers [default: 8]
  --normalize/--no-normalize      Normalize variants using vt algorithm [default: normalize]
  --drop-indexes/--keep-indexes   Drop indexes during load [default: drop-indexes]
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]
  --config, -c                    TOML configuration file
  --verbose, -v                   Enable verbose logging (DEBUG level)
  --quiet, -q                     Suppress non-error output
  --progress/--no-progress        Show progress bar [default: progress]
  --force, -f                     Force reload even if file was already loaded
  --hipaa-mode/--no-hipaa-mode    Enable/disable HIPAA compliance features [default: enabled]

When --db is omitted, vcf-pg-loader automatically uses a managed PostgreSQL container.

Normalization: When enabled (default), variants are left-aligned and trimmed following the vt algorithm. This ensures consistent representation across different variant callers.

Genome Type: Human genome mode uses a PostgreSQL enum for chromosomes (chr1-22, X, Y, M) which provides validation and efficient storage. Non-human mode uses TEXT to support arbitrary chromosome/contig names.

HIPAA Mode: By default, vcf-pg-loader runs with HIPAA compliance features enabled:

  • TLS required for database connections
  • Sample ID anonymization
  • VCF header sanitization to remove PHI

For local development without PHI, use --no-hipaa-mode to disable all compliance features:

vcf-pg-loader load sample.vcf.gz --no-hipaa-mode

validate

Validate a completed load by checking record counts and duplicates.

vcf-pg-loader validate <load_batch_id> [OPTIONS]

Options:
  --db, -d    PostgreSQL connection URL

init-db

Initialize the database schema (tables, indexes, extensions).

vcf-pg-loader init-db [OPTIONS]

Options:
  --db, -d                          PostgreSQL connection URL
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]

Important: The genome type must match between init-db and load commands. Use --no-human-genome for both when loading non-human VCFs.

benchmark

Run performance benchmarks on VCF parsing and loading.

vcf-pg-loader benchmark [OPTIONS]

Options:
  --vcf, -f        Path to VCF file (uses built-in fixture if omitted)
  --synthetic, -s  Generate synthetic VCF with N variants
  --db, -d         PostgreSQL URL (omit for parsing-only benchmark)
  --batch, -b      Batch size [default: 50000]
  --normalize/--no-normalize  Test with/without normalization
  --json           Output results as JSON (for CI integration)
  --quiet, -q      Minimal output

Examples:

# Quick benchmark with built-in fixture (~2.6K variants)
vcf-pg-loader benchmark

# Generate and benchmark 100K synthetic variants
vcf-pg-loader benchmark --synthetic 100000

# Benchmark a specific VCF file
vcf-pg-loader benchmark --vcf /path/to/sample.vcf.gz

# Full benchmark including database loading
vcf-pg-loader benchmark --synthetic 50000 --db postgresql://localhost/variants

# JSON output for CI/scripting
vcf-pg-loader benchmark --synthetic 10000 --json

Sample output:

Benchmark Results (synthetic)
  Variants: 100,000
  Batch size: 50,000
  Normalized: True

Parsing: 100,000 variants in 0.94s (106,000/sec)

doctor

Check system dependencies and diagnose issues.

vcf-pg-loader doctor

# Example output:
Dependency Check
  Python         3.12.4   OK
  cyvcf2         0.30.22  OK
  asyncpg        0.29.0   OK
  Docker         24.0.5   OK
  Docker daemon  running  OK

db

Manage the local PostgreSQL database (Docker-based).

vcf-pg-loader db start   # Start PostgreSQL container
vcf-pg-loader db stop    # Stop the container
vcf-pg-loader db status  # Show running status and connection URL
vcf-pg-loader db url     # Print connection URL (for scripts)
vcf-pg-loader db shell   # Open psql shell
vcf-pg-loader db reset   # Remove container and all data

Architecture

Components

  1. VCFHeaderParser - Parses VCF headers via cyvcf2's native API to extract INFO/FORMAT field definitions
  2. VCFStreamingParser - Memory-efficient streaming iterator that yields batches of VariantRecord objects
  3. VariantParser - Handles per-variant parsing with Number=A/R/G field extraction for multi-allelic decomposition
  4. VCFLoader - Orchestrates loading with asyncpg binary COPY protocol
  5. SchemaManager - Manages PostgreSQL schema creation and index management

Data Flow

VCF File → VCFStreamingParser → Batch Buffer → asyncpg COPY → PostgreSQL
                ↓
         VCFHeaderParser (field metadata)
                ↓
         VariantParser (Number=A/R/G extraction)

Citations and Acknowledgments

This project was inspired by and builds upon several foundational tools in the genomics community:

Primary References

Slivar - Rapid variant filtering:

Pedersen, B.S., Brown, J.M., Dashnow, H. et al. Effective variant filtering and expected candidate variant yield in studies of rare human disease. npj Genom. Med. 6, 60 (2021). https://doi.org/10.1038/s41525-021-00227-3

GEMINI - Original SQL-based VCF database:

Paila, U., Chapman, B.A., Kirchner, R., & Quinlan, A.R. GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations. PLoS Comput Biol 9(7): e1003153 (2013). https://doi.org/10.1371/journal.pcbi.1003153

cyvcf2 - Python VCF parsing:

Pedersen, B.S. & Quinlan, A.R. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33(12), 1867–1869 (2017). https://doi.org/10.1093/bioinformatics/btx057

Supporting Tools

Configuration

vcf-pg-loader supports TOML configuration files for persistent settings:

# vcf-pg-loader.toml
[vcf_pg_loader]
batch_size = 25000
workers = 16
normalize = true
drop_indexes = true
human_genome = true
log_level = "INFO"

Use with the --config flag:

vcf-pg-loader load sample.vcf.gz --config vcf-pg-loader.toml

CLI arguments override config file values.

Docker

Using Docker Compose (recommended for development)

# Start PostgreSQL and run a load
docker-compose up -d postgres
docker-compose run vcf-pg-loader load /data/sample.vcf.gz --db postgresql://vcfloader:vcfloader@postgres:5432/variants

# Or build and run standalone
docker build -t vcf-pg-loader .
docker run vcf-pg-loader --help

Docker Compose Services

  • postgres: PostgreSQL 16 with health checks
  • vcf-pg-loader: The loader application

Mount your VCF files to /data in the container.

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=vcf_pg_loader

# Run only unit tests (skip integration)
uv run pytest -m "not integration"

Code Quality

# Lint
uv run ruff check src tests

# Type check
uv run mypy src

Documentation

Getting Started

Schema Reference

Background

License

MIT - See LICENSE for details.

About

CLI Tool for efficiently migrating VCF files into relational database (Postgres)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •