vcf-pg-loader

High-performance VCF to PostgreSQL loader with clinical-grade compliance, purpose-built for Polygenic Risk Score (PRS) research.

Video Tutorials

What is a VCF File?

vcf-pg-loader Demo

PRS Research Features

vcf-pg-loader provides a complete data infrastructure for polygenic risk score research:

Feature	Description
GWAS Summary Statistics	Import and query GWAS results in GWAS-SSF standard format
PGS Catalog Integration	Load PRS weights directly from PGS Catalog scoring files
HapMap3 Reference Panel	Built-in support for HapMap3 SNPs used by PRS-CS, LDpred2, and other methods
LD Block Annotations	Berisa & Pickrell (2016) LD blocks for Bayesian PRS methods
Multi-Ancestry Frequencies	Population-specific allele frequencies from gnomAD for ancestry-aware PRS
Genotype Dosages	Imputation dosages and genotype probabilities (GP) for accurate PRS calculation
Sample QC Metrics	Call rate, het/hom ratio, Ti/Tv, sex inference, and contamination checks
Variant QC Metrics	HWE exact test, INFO score, call rate, MAF computed at load time
Materialized Views	Pre-computed PRS-ready variant sets with concurrent refresh
Export to PRS Tools	Direct export to PLINK, PRS-CS, LDpred2, and PRSice-2 formats

Features

Core VCF Loading

Streaming VCF parsing with cyvcf2 for memory-efficient processing
Variant normalization using the vt algorithm (left-align and trim)
Number=A/R/G field handling - proper per-ALT extraction during multi-allelic decomposition
Binary COPY protocol via asyncpg for maximum insert performance
Chromosome-partitioned tables for efficient region queries
Human and non-human genome support - chromosome enum for human, TEXT for others

PRS Data Management

GWAS summary statistics - import GWAS-SSF format files with study metadata
PGS Catalog weights - load scoring files with automatic variant matching
Reference panels - HapMap3 SNP sets for LD-aware PRS methods
LD block definitions - genome partitioning for PRS-CS and SBayesR
Population frequencies - multi-ancestry AF from gnomAD, 1000 Genomes
Genotype storage - hash-partitioned with dosage and GP support

Quality Control

Variant QC - HWE p-value, INFO score, call rate, AAF/MAF/MAC
Sample QC - call rate, het/hom ratio, Ti/Tv, F coefficient, sex inference
Materialized views - pre-filtered PRS candidate variants
SQL functions - HWE exact test, allele harmonization in-database

Infrastructure

Audit trail with load batch tracking and validation
CLI interface with Typer for easy operation
TOML configuration - file-based configuration with CLI overrides
Progress reporting - real-time progress bar with rich
Docker support - multi-stage Dockerfile and docker-compose for development
Zero-config database - auto-managed PostgreSQL via Docker, no setup required

Installation

Bioconda (Recommended)

conda install -c conda-forge -c bioconda vcf-pg-loader

PyPI

pip install vcf-pg-loader

Quick Install Script

curl -fsSL https://raw.githubusercontent.com/Zacharyr41/vcf-pg-loader/main/install.sh | bash

This installs vcf-pg-loader and all dependencies (Python, Docker) automatically.

From Source

git clone https://github.com/Zacharyr41/vcf-pg-loader.git
cd vcf-pg-loader
uv pip install -e ".[dev]"

nf-core Module

The vcfpgloader/load module is available in nf-core/modules for use in Nextflow pipelines.

Installation

nf-core modules install vcfpgloader/load

Usage

include { VCFPGLOADER_LOAD } from '../modules/nf-core/vcfpgloader/load/main'

workflow {
    ch_input = Channel.of([
        [ id: 'sample1', family: 'FAM001' ],  // meta map
        file('sample1.vcf.gz'),                // vcf
        file('sample1.vcf.gz.tbi'),            // tbi
        'localhost',                           // db_host
        5432,                                  // db_port
        'variants_db',                         // db_name
        'postgres',                            // db_user
        'public'                               // db_schema
    ])

    VCFPGLOADER_LOAD(ch_input)

    // Access outputs
    VCFPGLOADER_LOAD.out.report     // JSON report with loading statistics
    VCFPGLOADER_LOAD.out.log        // Detailed loading log
    VCFPGLOADER_LOAD.out.row_count  // Number of variants loaded
}

Database Password

Set PGPASSWORD via environment variable or Nextflow secrets:

// nextflow.config - Option 1: Environment variable
env {
    PGPASSWORD = System.getenv('PGPASSWORD')
}

// nextflow.config - Option 2: Nextflow secrets
env {
    PGPASSWORD = secrets.PGPASSWORD
}

Configuration

Customize batch size and other options via ext directives:

// nextflow.config
process {
    withName: 'VCFPGLOADER_LOAD' {
        ext.batch_size = '50000'  // variants per batch (default: 10000)
        ext.args = '--normalize'  // additional CLI arguments
    }
}

Outputs

Channel	Description
`report`	JSON file with loading statistics (variants loaded, elapsed time, throughput)
`log`	Detailed loading log with warnings/errors
`row_count`	Integer count of variants successfully loaded
`versions`	Tool version for MultiQC reporting

Verify Installation

vcf-pg-loader doctor

Quick Start

Zero-Config Mode (Easiest)

No PostgreSQL setup required - vcf-pg-loader manages a local database automatically:

# Load a VCF file (auto-starts PostgreSQL in Docker)
vcf-pg-loader load sample.vcf.gz

# Check database status
vcf-pg-loader db status

# Open psql shell to query data
vcf-pg-loader db shell

With Your Own PostgreSQL

# Initialize database schema
vcf-pg-loader init-db --db postgresql://user:pass@localhost/variants

# Load a VCF file
vcf-pg-loader load sample.vcf.gz --db postgresql://user:pass@localhost/variants

# Validate a completed load
vcf-pg-loader validate <load-batch-id> --db postgresql://user:pass@localhost/variants

Additional Options

# Load without normalization
vcf-pg-loader load sample.vcf.gz --no-normalize

# Load non-human VCF (e.g., SARS-CoV-2)
vcf-pg-loader load sarscov2.vcf.gz --no-human-genome

# Initialize for non-human genomes
vcf-pg-loader init-db --db postgresql://... --no-human-genome

PRS Workflow Quick Start

# 1. Load imputed VCF with genotype dosages
vcf-pg-loader load imputed.vcf.gz --db postgresql://localhost/prs_db

# 2. Import GWAS summary statistics
vcf-pg-loader import-gwas gwas_sumstats.tsv \
    --study-id GCST90012345 \
    --trait "Type 2 Diabetes" \
    --db postgresql://localhost/prs_db

# 3. Load PGS Catalog weights
vcf-pg-loader import-pgs PGS000001_hmPOS_GRCh38.txt \
    --db postgresql://localhost/prs_db

# 4. Load HapMap3 reference panel
vcf-pg-loader load-reference hapmap3.tsv \
    --panel-name hapmap3 \
    --db postgresql://localhost/prs_db

# 5. Annotate variants with LD blocks
vcf-pg-loader annotate-ld-blocks \
    --population EUR \
    --db postgresql://localhost/prs_db

# 6. Compute sample QC metrics
vcf-pg-loader compute-sample-qc \
    --db postgresql://localhost/prs_db

# 7. Refresh materialized views for fast queries
vcf-pg-loader refresh-views --db postgresql://localhost/prs_db

# 8. Export to PRS-CS format
vcf-pg-loader export-prs-cs \
    --study-id 1 \
    --output gwas_prscs.txt \
    --hapmap3-only \
    --db postgresql://localhost/prs_db

CLI Commands

`load`

Load a VCF file into PostgreSQL.

vcf-pg-loader load <vcf_path> [OPTIONS]

Options:
  --db, -d                        PostgreSQL connection URL (omit for auto-managed DB)
  --batch, -b                     Records per batch [default: 50000]
  --workers, -w                   Parallel workers [default: 8]
  --normalize/--no-normalize      Normalize variants using vt algorithm [default: normalize]
  --drop-indexes/--keep-indexes   Drop indexes during load [default: drop-indexes]
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]
  --config, -c                    TOML configuration file
  --verbose, -v                   Enable verbose logging (DEBUG level)
  --quiet, -q                     Suppress non-error output
  --progress/--no-progress        Show progress bar [default: progress]
  --force, -f                     Force reload even if file was already loaded
  --hipaa-mode/--no-hipaa-mode    Enable/disable HIPAA compliance features [default: enabled]

When --db is omitted, vcf-pg-loader automatically uses a managed PostgreSQL container.

Normalization: When enabled (default), variants are left-aligned and trimmed following the vt algorithm. This ensures consistent representation across different variant callers.

Genome Type: Human genome mode uses a PostgreSQL enum for chromosomes (chr1-22, X, Y, M) which provides validation and efficient storage. Non-human mode uses TEXT to support arbitrary chromosome/contig names.

HIPAA Mode: By default, vcf-pg-loader runs with HIPAA compliance features enabled:

TLS required for database connections
Sample ID anonymization
VCF header sanitization to remove PHI

For local development without PHI, use --no-hipaa-mode to disable all compliance features:

vcf-pg-loader load sample.vcf.gz --no-hipaa-mode

`validate`

Validate a completed load by checking record counts and duplicates.

vcf-pg-loader validate <load_batch_id> [OPTIONS]

Options:
  --db, -d    PostgreSQL connection URL

`init-db`

Initialize the database schema (tables, indexes, extensions).

vcf-pg-loader init-db [OPTIONS]

Options:
  --db, -d                          PostgreSQL connection URL
  --human-genome/--no-human-genome  Use human chromosome enum type [default: human-genome]

Important: The genome type must match between init-db and load commands. Use --no-human-genome for both when loading non-human VCFs.

`benchmark`

Run performance benchmarks on VCF parsing and loading.

vcf-pg-loader benchmark [OPTIONS]

Options:
  --vcf, -f        Path to VCF file (uses built-in fixture if omitted)
  --synthetic, -s  Generate synthetic VCF with N variants
  --db, -d         PostgreSQL URL (omit for parsing-only benchmark)
  --batch, -b      Batch size [default: 50000]
  --normalize/--no-normalize  Test with/without normalization
  --json           Output results as JSON (for CI integration)
  --quiet, -q      Minimal output

Examples:

# Quick benchmark with built-in fixture (~2.6K variants)
vcf-pg-loader benchmark

# Generate and benchmark 100K synthetic variants
vcf-pg-loader benchmark --synthetic 100000

# Benchmark a specific VCF file
vcf-pg-loader benchmark --vcf /path/to/sample.vcf.gz

# Full benchmark including database loading
vcf-pg-loader benchmark --synthetic 50000 --db postgresql://localhost/variants

# JSON output for CI/scripting
vcf-pg-loader benchmark --synthetic 10000 --json

Sample output:

Benchmark Results (synthetic)
  Variants: 100,000
  Batch size: 50,000
  Normalized: True

Parsing: 100,000 variants in 0.94s (106,000/sec)

`doctor`

Check system dependencies and diagnose issues.

vcf-pg-loader doctor

# Example output:
Dependency Check
  Python         3.12.4   OK
  cyvcf2         0.30.22  OK
  asyncpg        0.29.0   OK
  Docker         24.0.5   OK
  Docker daemon  running  OK

`db`

Manage the local PostgreSQL database (Docker-based).

vcf-pg-loader db start   # Start PostgreSQL container
vcf-pg-loader db stop    # Stop the container
vcf-pg-loader db status  # Show running status and connection URL
vcf-pg-loader db url     # Print connection URL (for scripts)
vcf-pg-loader db shell   # Open psql shell
vcf-pg-loader db reset   # Remove container and all data

Architecture

Components

VCFHeaderParser - Parses VCF headers via cyvcf2's native API to extract INFO/FORMAT field definitions
VCFStreamingParser - Memory-efficient streaming iterator that yields batches of VariantRecord objects
VariantParser - Handles per-variant parsing with Number=A/R/G field extraction for multi-allelic decomposition
VCFLoader - Orchestrates loading with asyncpg binary COPY protocol
SchemaManager - Manages PostgreSQL schema creation and index management

Data Flow

VCF File → VCFStreamingParser → Batch Buffer → asyncpg COPY → PostgreSQL
                ↓
         VCFHeaderParser (field metadata)
                ↓
         VariantParser (Number=A/R/G extraction)

Citations and Acknowledgments

This project was inspired by and builds upon several foundational tools in the genomics community:

Primary References

Slivar - Rapid variant filtering:

Pedersen, B.S., Brown, J.M., Dashnow, H. et al. Effective variant filtering and expected candidate variant yield in studies of rare human disease. npj Genom. Med. 6, 60 (2021). https://doi.org/10.1038/s41525-021-00227-3

GEMINI - Original SQL-based VCF database:

Paila, U., Chapman, B.A., Kirchner, R., & Quinlan, A.R. GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations. PLoS Comput Biol 9(7): e1003153 (2013). https://doi.org/10.1371/journal.pcbi.1003153

cyvcf2 - Python VCF parsing:

Pedersen, B.S. & Quinlan, A.R. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33(12), 1867–1869 (2017). https://doi.org/10.1093/bioinformatics/btx057

Supporting Tools

vcf2db: https://github.com/quinlan-lab/vcf2db
VCF Format: Danecek et al. (2011) https://doi.org/10.1093/bioinformatics/btr330
bcftools/HTSlib: Danecek et al. (2021) https://doi.org/10.1093/gigascience/giab008
GIAB Benchmarks: Zook et al. (2019) https://doi.org/10.1038/s41587-019-0074-6

Configuration

vcf-pg-loader supports TOML configuration files for persistent settings:

# vcf-pg-loader.toml
[vcf_pg_loader]
batch_size = 25000
workers = 16
normalize = true
drop_indexes = true
human_genome = true
log_level = "INFO"

Use with the --config flag:

vcf-pg-loader load sample.vcf.gz --config vcf-pg-loader.toml

CLI arguments override config file values.

Docker

Using Docker Compose (recommended for development)

# Start PostgreSQL and run a load
docker-compose up -d postgres
docker-compose run vcf-pg-loader load /data/sample.vcf.gz --db postgresql://vcfloader:vcfloader@postgres:5432/variants

# Or build and run standalone
docker build -t vcf-pg-loader .
docker run vcf-pg-loader --help

Docker Compose Services

postgres: PostgreSQL 16 with health checks
vcf-pg-loader: The loader application

Mount your VCF files to /data in the container.

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=vcf_pg_loader

# Run only unit tests (skip integration)
uv run pytest -m "not integration"

Code Quality

# Lint
uv run ruff check src tests

# Type check
uv run mypy src

Documentation

Getting Started

CLI Reference - Complete command-line documentation
PRS Workflows - End-to-end PRS analysis pipelines

Schema Reference

Schema Overview - Complete database schema with ER diagrams
PRS Tables - PGS scores and weights storage
GWAS Tables - Summary statistics (GWAS-SSF)
Reference Tables - HapMap3, LD blocks
Genotypes Tables - Individual-level data
QC Tables - Sample and variant QC metrics
Materialized Views - Pre-computed PRS query results

Background

Genomics Concepts - Understanding VCF data for non-geneticists
Glossary of Terms - Technical terminology reference
Architecture - Detailed system design and implementation

License

MIT - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.claude/agents		.claude/agents
.github		.github
bioconda		bioconda
docker		docker
docs		docs
nf-core/modules/vcfpgloader/load		nf-core/modules/vcfpgloader/load
scripts		scripts
src/vcf_pg_loader		src/vcf_pg_loader
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.hadolint.yaml		.hadolint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.trivyignore		.trivyignore
Brewfile		Brewfile
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.hipaa.yml		docker-compose.hipaa.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

Zacharyr41/vcf-pg-loader

Folders and files

Latest commit

History

Repository files navigation

vcf-pg-loader

Video Tutorials

What is a VCF File?

vcf-pg-loader Demo

PRS Research Features

Features

Core VCF Loading

PRS Data Management

Quality Control

Infrastructure

Installation

Bioconda (Recommended)

PyPI

Quick Install Script

From Source

nf-core Module

Installation

Usage

Database Password

Configuration

Outputs

Verify Installation

Quick Start

Zero-Config Mode (Easiest)

With Your Own PostgreSQL

Additional Options

PRS Workflow Quick Start

CLI Commands

load

validate

init-db

benchmark

doctor

db

Architecture

Components

Data Flow

Citations and Acknowledgments

Primary References

Supporting Tools

Configuration

Docker

Using Docker Compose (recommended for development)

Docker Compose Services

Development

Running Tests

Code Quality

Documentation

Getting Started

Schema Reference

Background

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

`load`

`validate`

`init-db`

`benchmark`

`doctor`

`db`

Packages