Skip to content

Inputs included as outputs are needlessly unstaged from the scratch directory #3995

@robsyme

Description

@robsyme

Bug report

Many users will use the scratch true directive, in part to minimize the size of the shared work directory - to ensure that the files saved to the work directory are restricted to only those necessary for downstream tasks and for the resume mechanism.

In cases where a process outputs glob pattern also matches the input file, the input file is unnecessarily copied back into the shared work directory

Steps to reproduce the problem

Given main.nf:

process GreedyOutputGlob {
    scratch true
    input: path(csv)
    output: path("*.csv")
    script: "cp $csv out.csv"
}

workflow {
    Channel.fromPath("data/in.csv")
    | GreedyOutputGlob
    | view
}

Note that the in.csv file is copied back to the shared work directory:

❯ nextflow run .      
N E X T F L O W  ~  version 23.04.1
Launching `./main.nf` [hopeful_church] DSL2 - revision: 06d2458686
executor >  local (1)
[42/2fa08b] process > GreedyOutputGlob (1) [100%] 1 of 1 ✔
/private/tmp/foo/work/42/2fa08b2ef83cd1799c58833592deed/out.csv


/tmp/foo on ☁️  sts on ☁️  devstar2002@gcplab.me took 2s 
❯ tree work 
work
└── 42
    └── 2fa08b2ef83cd1799c58833592deed
        ├── in.csv
        └── out.csv

3 directories, 2 files

This is because the nxf_unstage command uses the output glob pattern directly, without regard to the input files:

# ...
for name in $(eval "ls -1d *.csv" | sort | uniq); do
    nxf_fs_copy "$name" /private/tmp/foo/work/42/2fa08b2ef83cd1799c58833592deed || true
done
# ...

Expected behaviour and actual behaviour

To help users save storing the duplicated input files, it would be better if Nextflow excluded input files from being copied back to the shared work directory (unless the includeInputs: true argument is included in the outputs: block).

Environment

  • Nextflow version: 23.04.1
  • Java version: openjdk version "17.0.5" 2022-10-18
  • Operating system: all
  • Bash version: all
    (Add any other context about the problem here)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions