fix: chain template alignments auth labelling (inference)#117
fix: chain template alignments auth labelling (inference)#117
Conversation
- auth chain ids vs labelled chain ids - add a test that confims this
| @@ -0,0 +1,62 @@ | |||
| 101 1b27_B 1.0 110 0 0 1 110 1 110 1.022e-45 160 110M | |||
There was a problem hiding this comment.
This needs to be generated at test-time, I don't want to be adding files into the repo
There was a problem hiding this comment.
I think we could be helpful to include a explicit colabfold_template.m8 sample for a few reasons.
- It allows the user to directly see the expected inputs to our template parser function
- Generating this file at test time would require a call to the colabfold msa server, since we don't use the colabfold binary. It is good practice to avoid dependencies on web services during tests to avoid latency and test breakages due to service unavailability.
If the main concern is adding an additional file, I would suggest adding the example content as a text block in the module using textwrap (example). This particular example could be made much shorter, e.g. we could include just 5 alignment lines, including the offending 1rnb_A line.
There was a problem hiding this comment.
I agree with Jennifer, we already have some smaller files in the repo for this purpose under openfold3/tests/test_data. I think it should be fine to have this there.
| label_to_author = get_label_to_author_chain_id_dict(cif_file) | ||
| author_to_label = {v: k for k, v in label_to_author.items()} | ||
| label_chain_id = author_to_label[template.chain_id] |
There was a problem hiding this comment.
This is the re-mapping from "auth" chains IDs to "label" chain IDs... very wishful in terms of inputs not being pathological
There was a problem hiding this comment.
So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.
A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.
jnwei
left a comment
There was a problem hiding this comment.
Overall this is great! This was a hard bug to pin down.
A few drive by comments regarding setting up the tests with colabfold web services.
| template = templates[16] | ||
| assert template.chain_id == "A" and template.entry_id == "1rnb" | ||
|
|
||
| fetch( |
There was a problem hiding this comment.
Can we mock this call instead of explicitly calling the RCSB database using fetch?
As this is a unit test, it would be good to remove dependencies on web servers so that we don't have latency issues / failures due to the availability of the service.
There was a problem hiding this comment.
Switched to just a cif file as fixture
| @@ -0,0 +1,62 @@ | |||
| 101 1b27_B 1.0 110 0 0 1 110 1 110 1.022e-45 160 110M | |||
There was a problem hiding this comment.
I think we could be helpful to include a explicit colabfold_template.m8 sample for a few reasons.
- It allows the user to directly see the expected inputs to our template parser function
- Generating this file at test time would require a call to the colabfold msa server, since we don't use the colabfold binary. It is good practice to avoid dependencies on web services during tests to avoid latency and test breakages due to service unavailability.
If the main concern is adding an additional file, I would suggest adding the example content as a text block in the module using textwrap (example). This particular example could be made much shorter, e.g. we could include just 5 alignment lines, including the offending 1rnb_A line.
gnikolenyi
left a comment
There was a problem hiding this comment.
Looks good so far. Added some comments on top of Jennifer's.
One thing I am missing is the actual mapping being done after the colabfold pipeline pulled the templates. I see you have the primitive but it is not yet being called in openfold3/core/data/tools/colabfold_msa_server.py or anywhere else outside of the unittests. Could you please also add the remapping to the colabfold pipeline itself?
| @@ -0,0 +1,62 @@ | |||
| 101 1b27_B 1.0 110 0 0 1 110 1 110 1.022e-45 160 110M | |||
There was a problem hiding this comment.
I agree with Jennifer, we already have some smaller files in the repo for this purpose under openfold3/tests/test_data. I think it should be fine to have this there.
| label_to_author = get_label_to_author_chain_id_dict(cif_file) | ||
| author_to_label = {v: k for k, v in label_to_author.items()} | ||
| label_chain_id = author_to_label[template.chain_id] |
There was a problem hiding this comment.
So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.
A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.
|
@jnwei @gnikolenyi many thanks for the reviews – I'm not sure why this wasn't a draft, this is clearly not ready for prime time. Agreed that we need this to use some fixture files (in an integration test context), but I would also like to have an end-to-end test that does everything. Ideally, we would just have in-memory generated fixtures but we don't have the tooling setup atm.
This would be ideal but it's not possible unless we pull in the cif files, which have the info on the peptide chains to do the mapping. |
|
Wrapping @ljarosch into this PR, because it's quite hairy. Here is some updated context
Colabfold returns "author" chain IDs rather than "labelled" chain IDs, and this PR fixes how these are handled. However, the code now assumes that author chain IDs are provided and may erroneously correct properly provided chain IDs. |
|
@jnwei @jandom Added the template structure download and chain ID remapping logic to the colabfold pipeline and removed it from the template pipeline. See logs below for an example with the remapping printed explicitly: For example, checking 3jaj author S1 -> label JC: https://www.rcsb.org/sequence/3JAJ#JC Note the following:
|
jnwei
left a comment
There was a problem hiding this comment.
My understanding of the latest update from @gnikolenyi is that there are two main changes:
- We download the cifs of the templates in the colabfold alignment process. This is necessary to parse the author labeled ids. These cifs are stored temporarily
- The cifs for the templates are downloaded again in the template pipeline for template feature processing.
No changes / updates were made to the tests. Gergo separately ran an example to process the MSAs of several other examples.
|
This is a huge step in the right direction but without extensive tests, I'm quite weary. Will review ASP. |
…gnments-auth-labelling
|
@gnikolenyi I have somewhat changed this code, the biggest change is skipping the CIF download and using the RSCB API to get the mapping instead (phew!). Outstanding items
|
gnikolenyi
left a comment
There was a problem hiding this comment.
Great idea using the RCSB API instead of a CIF download. Can you add a mock for the TestFetchAuthorToLabelChainIds test instead of directly pinging the server? Generally, we don't want dependencies like this in unit tests (great lesson from @jnwei!). After that it should be good to go on my end
Summary
Hopefully helps to solve #101
Changes
So far wrote a test that reproduced the failure, and then added a "fix"
Related Issues
Testing
Other Notes