Skip to content

fix(trimgalore): drop process label to process_low#11531

Merged
SPPearce merged 1 commit intonf-core:masterfrom
pinin4fjords:pinin4fjords/trimgalore-process-low
May 5, 2026
Merged

fix(trimgalore): drop process label to process_low#11531
SPPearce merged 1 commit intonf-core:masterfrom
pinin4fjords:pinin4fjords/trimgalore-process-low

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 5, 2026

Summary

Drops the TRIMGALORE process label from process_high (12 cpus / 72 GB / 16 h) to process_low (2 cpus / 12 GB / 4 h, scaling with task.attempt).

Why

trim-galore 2.x is a Rust binary that streams reads, so memory stays flat with input size rather than scaling with read count. The process_high ceiling was inherited from the Perl-based 0.6.x era and is now massively over-provisioned, starving shared HPC schedulers for no benefit.

Empirical data (30M PE on nf-core/rnaseq)

Metric Observed process_low budget
peak_rss ~100 MB 12 GB (~120×)
realtime 1.25–2.0 min (median 1.5) 4 h (~80×)
cpus consumed 1 worker thread 2 cpus (≥1 worker)

The script already auto-derives --cores from task.cpus and caps the worker count at 8, so over-allocating cpus doesn't help anyway.

Why process_low and not process_single?

process_single (1 cpu / 6 GB / 4 h) is the only smaller standard bucket. For trim_galore's worker-thread math both yield 1 worker (since cores = max(1, task.cpus - 4) on paired), so the trimming parallelism is identical. The runtime difference comes from the surrounding I/O pipeline:

  • A paired trim_galore invocation spawns ~6–8 helper processes (cutadapt for R1/R2, pigz reader/writer pairs, validator).
  • On 1 cpu they all contend for a single core; on 2 cpus the OS can overlap I/O with the worker thread, which is most of the wall-time on real data. The ~1.5 min realtime above came from a 2-cpu run; collapsing to 1 cpu would likely give some of it back.
  • process_single also semantically signals "single-threaded by nature" (utilities, parsers, R scripts). trim_galore is genuinely multi-process even when only running one trimming worker, so process_low reads more honestly for "small resource ceiling, still parallel I/O".

Users with bespoke needs (huge inputs, custom adapter detection, etc.) can still override resources at the pipeline level.

What's not changing

  • Container, environment, output channels, args - all unchanged.
  • Module-level snapshots - unaffected; the label isn't captured in snap content.

Test plan

  • Module-level nf-test passes against the new label.
  • Downstream pipelines (rnaseq, methylseq, atacseq, ...) still complete on real data with the new ceiling.

trim-galore 2.x is a Rust binary that streams reads, so memory stays
flat with input size. Empirical 30M PE benchmark (rnaseq pipeline):
- peak_rss ~100 MB
- realtime ~1.5 min median, ~2 min max

The previous `process_high` label (12 cpus / 72 GB / 16 h) is
massively over-provisioned for the new implementation and starves
shared HPC schedulers. `process_low` (2 cpus / 12 GB / 4 h, scaling
with task.attempt) gives ~120x memory headroom and ~80x runtime
headroom over observed peaks at 30M PE, comfortably absorbing the
200M+ PE inputs that pipelines actually see in production.

The script's own `--cores` calculation derives worker count from
`task.cpus` and caps at 8, so allocating more than the label's
2 cpus (which yields 1 worker thread paired) gives diminishing
returns; users with bespoke needs can still override `cpus`
downstream.
@SPPearce SPPearce added this pull request to the merge queue May 5, 2026
Merged via the queue into nf-core:master with commit 7ced6ac May 5, 2026
105 of 111 checks passed
@pinin4fjords
Copy link
Copy Markdown
Member Author

Thanks @SPPearce !

@pinin4fjords pinin4fjords deleted the pinin4fjords/trimgalore-process-low branch May 5, 2026 14:10
@FelixKrueger
Copy link
Copy Markdown
Contributor

As a comment on this, the invocation was --cores 8, which uses:

  • ~12 CPUs
  • ~100MB of RAM
  • with roughly 20M read pairs per minute

Please note that the threading model is N+4, so using --cores 2 uses 6 cores, while --cored 8 uses 12 cores.

Using --cores 2 is still good time wise, but the wall clock time can be reduced by a further 75% by using 12 CPUs (--coreds 8).

@SPPearce
Copy link
Copy Markdown
Contributor

SPPearce commented May 5, 2026

Please note that the threading model is N+4, so using --cores 2 uses 6 cores, while --cored 8 uses 12 cores.

Uh?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants