Have you checked the docs?
Description of the bug
We are using GATK4SPARK_APPLYBQSR in a pipeline of ours and using the interval function. This means we have many runs of GATK4SPARK_APPLYBQSR running at the same time for one sample. However, we noticed that the work directory was very large following a run and discovered that the input bam was being copied to the work directory from the scratch directory along with the output bam. We believe the issue is from the output line not being specific enough in it's glob when the scratch directory is used: process.scratch=true.
This was reported in 3995 but was ultimately not fixed.
tuple val(meta), path("*.bam") , emit: bam, optional: true
The input bam will not be included in the output channel, but is copied to the task work directory. This is mentioned in the nextflow docs. I also copied the relevant section below.
Although the input files matching a glob output declaration are not included in the resulting output channel, these files may still be transferred from the task scratch directory to the original task work directory. Therefore, to avoid unnecessary file copies, avoid using loose wildcards when defining output files, e.g. path ''. Instead, use a prefix or a suffix to restrict the set of matching files to only the expected ones, e.g. path 'prefix_.sorted.bam'.
The output block should be changed to something like this in order to avoid this issue.
output:
tuple val(meta), path("${prefix}.bam") , emit: bam, optional: true
tuple val(meta), path("${prefix}.cram"), emit: cram, optional: true
path "versions.yml" , emit: versions
The same issue was brought up for another module here #3504 and fixed in the same way.
This issue may be present in other modules like GATK4_ADDORREPLACEREADGROUPS. We should probably be more specific in our outputs, and not just glob everything with *.bam or *.cram as this leads to more memory usage in the work directory in this scenario. I'm going to make a PR for GATK4SPARK_APPLYBQSR for now, but be on the lookout for other modules that are like this.
Also, this issue may be present in sarek as well, causing the work directory to balloon on HPC systems using scratch.
System information
nextflow version 24.10.5.5935
CentOS Linux release 7.9.2009 (Core)
HPC with LSF scheduler
Singularity 3.3.0
Have you checked the docs?
Description of the bug
We are using GATK4SPARK_APPLYBQSR in a pipeline of ours and using the interval function. This means we have many runs of GATK4SPARK_APPLYBQSR running at the same time for one sample. However, we noticed that the work directory was very large following a run and discovered that the input bam was being copied to the work directory from the scratch directory along with the output bam. We believe the issue is from the output line not being specific enough in it's glob when the scratch directory is used:
process.scratch=true.This was reported in 3995 but was ultimately not fixed.
tuple val(meta), path("*.bam") , emit: bam, optional: trueThe input bam will not be included in the output channel, but is copied to the task work directory. This is mentioned in the nextflow docs. I also copied the relevant section below.
The output block should be changed to something like this in order to avoid this issue.
The same issue was brought up for another module here #3504 and fixed in the same way.
This issue may be present in other modules like GATK4_ADDORREPLACEREADGROUPS. We should probably be more specific in our outputs, and not just glob everything with *.bam or *.cram as this leads to more memory usage in the work directory in this scenario. I'm going to make a PR for GATK4SPARK_APPLYBQSR for now, but be on the lookout for other modules that are like this.
Also, this issue may be present in sarek as well, causing the work directory to balloon on HPC systems using scratch.
System information
nextflow version 24.10.5.5935
CentOS Linux release 7.9.2009 (Core)
HPC with LSF scheduler
Singularity 3.3.0