Preserve Raw ColabFold MSA Output Files#35
Preserve Raw ColabFold MSA Output Files#35qurat-ul-ain95 wants to merge 1 commit intoaqlaboratory:mainfrom
Conversation
jnwei
left a comment
There was a problem hiding this comment.
First, thank you for the detailed investigations and suggested improvements to the colabfold pipeline. We really value your contributions to making the the colabfold msa pipeline more reusable for other applications.
Overall, the PR looks great with very thorough test examples and good documentation.
I have one general question: It looks like the goal of these changes is to save the alignments as individual a3m files / query. The end goal would be to reuse these alignments for future OpenFold3 predictions, or as alignments for other applications. Is this correct?
If that is the case, I believe the following runner.yml settings will save alignments as a3m files / sequence, labeled by the rep_id, which is the hash of the query sequence.
msa_computation_settings:
msa_file_format: a3m # npz by default
cleanup_msa_dir: false
msa_output_directory: /path/to/msas
The code that handles processing of colabfold MSAs into a3m files can be found here. I think this has some similar functionality to the added _organize_raw_main_outputs_by_query?
I do like the refactoring in this PR and I think it makes sense to use these changes in some of the colabfold parsing functions. @gnikolenyi what do you think?
| def test_parse_multiple_m_values(self, tmp_path): | ||
| """Test parsing a3m file with multiple M values separated by null bytes.""" | ||
| # Real a3m files have null bytes (\x00) before new M value headers | ||
| a3m_content = ">101\nSEQUENCE1\n>UniRef100_A0A123\nMATCH1\n\x00>102\nSEQUENCE2\n>UniRef100_B0B456\nMATCH2\n\x00>103\nSEQUENCE3\n>UniRef100_C0C789\nMATCH3\n" |
There was a problem hiding this comment.
nit: Consider adding textwrap.dedent here and in other test examples for easier readability
Summary
These changes preserve raw ColabFold MSA output files (
.a3mfiles) if the user disablescleanup_msa_dir. Previously, therawdirectory containing batch MSA files was always deleted, regardless of thecleanup_msa_dirsetting, preventing users from accessing the raw colabfold MSA data for inspection or reuse.The pipeline sends batched requests to Colabfold containing multiple sequences. Suggested change in this PR separates out the resulting files by the unique sequence identifier
rep_idand stores them in a newraw_colabfold_outputfolder which persists between runs.Changes
_parse_a3m_file_by_m()function: util function to parse batch A3M files and extract individual MSA sections by M value (ColabFold's internal sequence identifier). This function handles:\x00) that separate M sections in batch files_organize_raw_main_outputs_by_query()method: Extracts individual MSA sections from batch A3M files and saves them in a persistent directoryraw_colabfold_output:raw_colabfold_output/{rep_id}/{filename}.a3muniref.a3mandbfd.mgnify30.metaeuk30.smag30.a3mfor eachrep_idclass TestParseA3mFileByMcontains a few pytests to check_parse_a3m_file_by_m()functionality.Closes #31