Bug report
Many users will use the scratch true directive, in part to minimize the size of the shared work directory - to ensure that the files saved to the work directory are restricted to only those necessary for downstream tasks and for the resume mechanism.
In cases where a process outputs glob pattern also matches the input file, the input file is unnecessarily copied back into the shared work directory
Steps to reproduce the problem
Given main.nf:
process GreedyOutputGlob {
scratch true
input: path(csv)
output: path("*.csv")
script: "cp $csv out.csv"
}
workflow {
Channel.fromPath("data/in.csv")
| GreedyOutputGlob
| view
}
Note that the in.csv file is copied back to the shared work directory:
❯ nextflow run .
N E X T F L O W ~ version 23.04.1
Launching `./main.nf` [hopeful_church] DSL2 - revision: 06d2458686
executor > local (1)
[42/2fa08b] process > GreedyOutputGlob (1) [100%] 1 of 1 ✔
/private/tmp/foo/work/42/2fa08b2ef83cd1799c58833592deed/out.csv
/tmp/foo on ☁️ sts on ☁️ devstar2002@gcplab.me took 2s
❯ tree work
work
└── 42
└── 2fa08b2ef83cd1799c58833592deed
├── in.csv
└── out.csv
3 directories, 2 files
This is because the nxf_unstage command uses the output glob pattern directly, without regard to the input files:
# ...
for name in $(eval "ls -1d *.csv" | sort | uniq); do
nxf_fs_copy "$name" /private/tmp/foo/work/42/2fa08b2ef83cd1799c58833592deed || true
done
# ...
Expected behaviour and actual behaviour
To help users save storing the duplicated input files, it would be better if Nextflow excluded input files from being copied back to the shared work directory (unless the includeInputs: true argument is included in the outputs: block).
Environment
- Nextflow version: 23.04.1
- Java version: openjdk version "17.0.5" 2022-10-18
- Operating system: all
- Bash version: all
(Add any other context about the problem here)
Bug report
Many users will use the
scratch truedirective, in part to minimize the size of the shared work directory - to ensure that the files saved to the work directory are restricted to only those necessary for downstream tasks and for the resume mechanism.In cases where a process outputs glob pattern also matches the input file, the input file is unnecessarily copied back into the shared work directory
Steps to reproduce the problem
Given
main.nf:Note that the
in.csvfile is copied back to the shared work directory:This is because the
nxf_unstagecommand uses the output glob pattern directly, without regard to the input files:Expected behaviour and actual behaviour
To help users save storing the duplicated input files, it would be better if Nextflow excluded input files from being copied back to the shared work directory (unless the
includeInputs: trueargument is included in theoutputs:block).Environment
(Add any other context about the problem here)