Skip to content

Add Binette#11563

Open
prototaxites wants to merge 3 commits intonf-core:masterfrom
prototaxites:binette
Open

Add Binette#11563
prototaxites wants to merge 3 commits intonf-core:masterfrom
prototaxites:binette

Conversation

@prototaxites
Copy link
Copy Markdown
Contributor

Add binette module for metagenome bin refinement. Unfortunate test depends on the large CheckM2 database...

@ochkalova I only realised as I made this PR you have a branch for Binette - happy to collaborate/work on your branch if you prefer!

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions - See version_topics
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

Copy link
Copy Markdown
Contributor

@ochkalova ochkalova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for doing this module! I never got to finishing it 😅

I left one request for an additional input arg.

Also, in my branch there is a 1.1 Mb test dataset that allows to make a test run without real CheckM2 DB. It still not super quick and requires a couple of minutes to finish, but I think it's better than downloading full CheckM2 DB... There is 2 tests for 2 options of input_type, I think we can only keep one of them.

The peculiarity here is to push this test data to nf-tests (that's why I never finished this module)

If you have capacity to it and update the test, would be great.

Comment on lines +10 to +12
input:
tuple val(meta) , path(contig2bin), path(contigs), path(proteins)
tuple val(meta2), path(checkm2_db)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binnete can process fasta input as well. I would add one more input to define if input is a tsv of folder with fasta files. I think it can be convenient.

Suggested change
input:
tuple val(meta) , path(contig2bin), path(contigs), path(proteins)
tuple val(meta2), path(checkm2_db)
input:
tuple val(meta) , path(input_binning, stageAs: "input_binning/*"), path(contigs), path(proteins)
val input_type
tuple val(meta2), path(checkm2_db)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add this tomorrow!

Comment on lines +28 to +37
"""
binette \\
--contig2bin_tables ${contig2bin} \\
--contigs ${contigs} \\
${proteins_input} \\
--checkm2_db ${checkm2_db} \\
--threads ${task.cpus} \\
--prefix ${prefix} \\
--outdir . \\
${args}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
binette \\
--contig2bin_tables ${contig2bin} \\
--contigs ${contigs} \\
${proteins_input} \\
--checkm2_db ${checkm2_db} \\
--threads ${task.cpus} \\
--prefix ${prefix} \\
--outdir . \\
${args}
def input_arg = ""
if (input_type == 'fasta') {
input_arg = "--bin_dirs"
} else if (input_type == 'tsv') {
input_arg = "--contig2bin_tables"
} else {
error "Invalid input_type: ${input_type}. Must be 'fasta' or 'tsv'"
}
"""
binette \\
${input_arg} input_binning/* \\
--contigs ${contigs} \\
${proteins_input} \\
--checkm2_db ${checkm2_db} \\
--threads ${task.cpus} \\
--prefix ${prefix} \\
--outdir . \\
${args}

--outdir . \\
${args}

find final_bins/ -maxdepth 1 -name "*.fa" -type f -exec gzip {} \\;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a nf-core standard to compress fasta? I don't know if gz files are convenient for a regular user 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a general nf-core recommendation to gzip files (https://nf-co.re/docs/specifications/components/modules/general#compression-of-input-and-output-files), but perhaps more importantly all the other nf-core binning modules also write gzipped fasta so its consistent for downstream use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants