Inputs copied to work directory from scratch directory in GATK4SPARK_APPLYBQSR

### Have you checked the docs?

- [x] [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting)
- [x] [nf-core modules documentation](https://nf-co.re/docs/contributing/modules)

### Description of the bug

We are using [GATK4SPARK_APPLYBQSR](https://github.com/nf-core/modules/blob/master/modules/nf-core/gatk4/applybqsr/main.nf) in a pipeline of ours and using the interval function.  This means we have many runs of GATK4SPARK_APPLYBQSR running at the same time for one sample.  However, we noticed that the work directory was very large following a run and discovered that the input bam was being copied to the work directory  from the scratch directory along with the output bam.  We believe the issue is from the output line not being specific enough in it's glob when the scratch directory is used:  `process.scratch=true`. 
This was reported in [3995](https://github.com/nextflow-io/nextflow/issues/3995) but was [ultimately not fixed](https://github.com/nextflow-io/nextflow/pull/3996). 

`tuple val(meta), path("*.bam") , emit: bam,  optional: true`

The input bam will not be included in the output channel, but is copied to the task work directory.  This is mentioned in the [nextflow docs](https://www.nextflow.io/docs/latest/process.html#multiple-output-files).  I also copied the relevant section below.  

> Although the input files matching a glob output declaration are not included in the resulting output channel, these files may still be transferred from the task scratch directory to the original task work directory. Therefore, to avoid unnecessary file copies, avoid using loose wildcards when defining output files, e.g. path '*'. Instead, use a prefix or a suffix to restrict the set of matching files to only the expected ones, e.g. path 'prefix_*.sorted.bam'.


The output block should be changed to something like this in order to avoid this issue. 

```
output:
    tuple val(meta), path("${prefix}.bam") , emit: bam,  optional: true
    tuple val(meta), path("${prefix}.cram"), emit: cram, optional: true
    path "versions.yml"            , emit: versions
```

The same issue was brought up for another module here https://github.com/nf-core/modules/issues/3504 and fixed in the same way. 

This issue may be present in other modules like GATK4_ADDORREPLACEREADGROUPS.  We should probably be more specific in our outputs, and not just glob everything with *.bam or *.cram as this leads to more memory usage in the work directory in this scenario.  I'm going to make a PR for GATK4SPARK_APPLYBQSR for now, but be on the lookout for other modules that are like this.

Also, this issue may be present in sarek as well, causing the work directory to balloon on HPC systems using scratch. 

### System information

nextflow version 24.10.5.5935
CentOS Linux release 7.9.2009 (Core) 
HPC with LSF scheduler
Singularity 3.3.0



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inputs copied to work directory from scratch directory in GATK4SPARK_APPLYBQSR #7792

Have you checked the docs?

Description of the bug

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inputs copied to work directory from scratch directory in GATK4SPARK_APPLYBQSR #7792

Description

Have you checked the docs?

Description of the bug

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions