Google Colaboratory notebooks for protein structure predictions for Discoba species

These Colaboratory notebooks are ideal for predicting structures of Trypanosoma and Leishmania species, along with other kinetoplastids, euglenids, diplonemids, etc.

To use a notebook, follow the link to the notebook code then click on the "Open in Colab" badge to run the notebook in Google Colab.

Get started

Go here to carry out a protein structure prediction. Enter your protein sequence (query_sequence) and name (jobname) then press Ctrl + F9 or select Runtime > Run All. It will probably take a couple of hours, depending on protein length.

You need to make sure that Google Colab is using GPU hardware acceleration. Go to Runtime > Change runtime type and make sure GPU is selected for Hardware accelerator. If you get an error you have probably run out of GPU memory: try predicting just part of your protein.

Multiple sequence alignment

These generate a multiple sequence alignment, generating an a3m alignment file suitable for use for use in other protein structure prediction pipelines.

DiscobaHMMER: For generating an a3m alignment file using a HMMER search of the Discoba protein database.

DiscobaMMSeqs2: For generating an a3m using an MMSeqs2 search of the Discoba protein database.

Note that these generate an alignment based only on discoba sequences, not full eukaryotic diversity. You can upload the resulting a3m alignment file to, for example, the official CoabFold notebook to make a structure prediction.

Discoba database generation

The transcriptomes folder contains the code necessary to fetch and build the predicted protein sequences of all Discoba species in the Discoba protein sequence database. Clone the repository, then run fetchAll.sh to build the entire database. This will take some time... Requires curl, gzip, perl, python, nodejs, cutadapt, jellyfish, bowtie2, bwa and samtools. Fetches and builds trinity, which requires autoconf, libbz2-dev and liblzma-dev. Fetches and builds cdhit which should require no further dependencies. Other tools are either scripts or fetched as precompiled packages.

Discoba database

The complete Discoba protein sequence database is available for download from Zenodo.org: discoba.fasta.gz and version information in discobaStats.txt. A mirrored copy is available via WheelerLab.net, which may be faster to download depending on where you are in the world: discoba.fasta.gz.

Protein structure predictions

DiscobaAlphaFold2HMMERv3: For prediction of protein structures using the latest version of ColabFold modified to also include a HMMER search of the Discoba protein sequence database (recommended!).

Acknowledgments

This work builds on the excellent ColabFold and would not be possible without the work by Sergey Ovchinnikov (@sokrypton), Milot Mirdita (@milot_mirdita) and Martin Steinegger (@thesteinegger).

Referencing

If you use structure predictions made using this resource, please cite:

Wheeler RJ. "A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure" PLoS One, doi: 10.1371/journal.pone.0259871 (2021)

Mirdita M, Ovchinnikov S and Steinegger M. "ColabFold - Making protein folding accessible to all." bioRxiv, doi: 10.1101/2021.08.15.456425 (2021)

Please also cite the relevant AlphaFold and database references, see ColabFold documentation for full information.

If you only use the multiple sequence alignments then please just cite Wheeler 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
scripts		scripts
transcriptomes		transcriptomes
.gitignore		.gitignore
DiscobaAlphaFold2HMMERv3.ipynb		DiscobaAlphaFold2HMMERv3.ipynb
DiscobaHMMER.ipynb		DiscobaHMMER.ipynb
DiscobaMMSeqs2.ipynb		DiscobaMMSeqs2.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google Colaboratory notebooks for protein structure predictions for Discoba species

Get started

Multiple sequence alignment

Discoba database generation

Discoba database

Protein structure predictions

Acknowledgments

Referencing

About

Uh oh!

Releases

Packages

Languages

License

zephyris/discoba_alphafold

Folders and files

Latest commit

History

Repository files navigation

Google Colaboratory notebooks for protein structure predictions for Discoba species

Get started

Multiple sequence alignment

Discoba database generation

Discoba database

Protein structure predictions

Acknowledgments

Referencing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages