These Colaboratory notebooks are ideal for predicting structures of Trypanosoma and Leishmania species, along with other kinetoplastids, euglenids, diplonemids, etc.
To use a notebook, follow the link to the notebook code then click on the "Open in Colab" badge to run the notebook in Google Colab.
Go here to carry out a protein structure prediction.
Enter your protein sequence (query_sequence) and name (jobname) then press Ctrl + F9 or select Runtime > Run All. It will probably take a couple of hours, depending on protein length.
You need to make sure that Google Colab is using GPU hardware acceleration. Go to Runtime > Change runtime type and make sure GPU is selected for Hardware accelerator. If you get an error you have probably run out of GPU memory: try predicting just part of your protein.
These generate a multiple sequence alignment, generating an a3m alignment file suitable for use for use in other protein structure prediction pipelines.
DiscobaHMMER: For generating an a3m alignment file using a HMMER search of the Discoba protein database.
DiscobaMMSeqs2: For generating an a3m using an MMSeqs2 search of the Discoba protein database.
Note that these generate an alignment based only on discoba sequences, not full eukaryotic diversity. You can upload the resulting a3m alignment file to, for example, the official CoabFold notebook to make a structure prediction.
The transcriptomes folder contains the code necessary to fetch and build the predicted protein sequences of all Discoba species in the Discoba protein sequence database. Clone the repository, then run fetchAll.sh to build the entire database. This will take some time... Requires curl, gzip, perl, python, nodejs, cutadapt, jellyfish, bowtie2, bwa and samtools. Fetches and builds trinity, which requires autoconf, libbz2-dev and liblzma-dev. Fetches and builds cdhit which should require no further dependencies. Other tools are either scripts or fetched as precompiled packages.
The complete Discoba protein sequence database is available for download from Zenodo.org: discoba.fasta.gz and version information in discobaStats.txt. A mirrored copy is available via WheelerLab.net, which may be faster to download depending on where you are in the world: discoba.fasta.gz.
DiscobaAlphaFold2HMMERv3: For prediction of protein structures using the latest version of ColabFold modified to also include a HMMER search of the Discoba protein sequence database (recommended!).
This work builds on the excellent ColabFold and would not be possible without the work by Sergey Ovchinnikov (@sokrypton), Milot Mirdita (@milot_mirdita) and Martin Steinegger (@thesteinegger).
If you use structure predictions made using this resource, please cite:
Wheeler RJ. "A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure" PLoS One, doi: 10.1371/journal.pone.0259871 (2021)
Mirdita M, Ovchinnikov S and Steinegger M. "ColabFold - Making protein folding accessible to all." bioRxiv, doi: 10.1101/2021.08.15.456425 (2021)
Please also cite the relevant AlphaFold and database references, see ColabFold documentation for full information.
If you only use the multiple sequence alignments then please just cite Wheeler 2021.