Skip to content

fix: chain template alignments auth labelling (inference)#117

Open
jandom wants to merge 11 commits intomainfrom
jandom/2026-02/fix/chain-template-alignments-auth-labelling
Open

fix: chain template alignments auth labelling (inference)#117
jandom wants to merge 11 commits intomainfrom
jandom/2026-02/fix/chain-template-alignments-auth-labelling

Conversation

@jandom
Copy link
Collaborator

@jandom jandom commented Feb 7, 2026

Summary

Hopefully helps to solve #101

Changes

So far wrote a test that reproduced the failure, and then added a "fix"

Related Issues

Testing

Other Notes

- auth chain ids vs labelled chain ids
- add a test that confims this
@jandom jandom requested a review from gnikolenyi February 7, 2026 15:44
@jandom jandom self-assigned this Feb 7, 2026
@@ -0,0 +1,62 @@
101 1b27_B 1.0 110 0 0 1 110 1 110 1.022e-45 160 110M
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be generated at test-time, I don't want to be adding files into the repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could be helpful to include a explicit colabfold_template.m8 sample for a few reasons.

  1. It allows the user to directly see the expected inputs to our template parser function
  2. Generating this file at test time would require a call to the colabfold msa server, since we don't use the colabfold binary. It is good practice to avoid dependencies on web services during tests to avoid latency and test breakages due to service unavailability.

If the main concern is adding an additional file, I would suggest adding the example content as a text block in the module using textwrap (example). This particular example could be made much shorter, e.g. we could include just 5 alignment lines, including the offending 1rnb_A line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jennifer, we already have some smaller files in the repo for this purpose under openfold3/tests/test_data. I think it should be fine to have this there.

Comment on lines +62 to +64
label_to_author = get_label_to_author_chain_id_dict(cif_file)
author_to_label = {v: k for k, v in label_to_author.items()}
label_chain_id = author_to_label[template.chain_id]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the re-mapping from "auth" chains IDs to "label" chain IDs... very wishful in terms of inputs not being pathological

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is great! This was a hard bug to pin down.

A few drive by comments regarding setting up the tests with colabfold web services.

template = templates[16]
assert template.chain_id == "A" and template.entry_id == "1rnb"

fetch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mock this call instead of explicitly calling the RCSB database using fetch?

As this is a unit test, it would be good to remove dependencies on web servers so that we don't have latency issues / failures due to the availability of the service.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to just a cif file as fixture

@@ -0,0 +1,62 @@
101 1b27_B 1.0 110 0 0 1 110 1 110 1.022e-45 160 110M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could be helpful to include a explicit colabfold_template.m8 sample for a few reasons.

  1. It allows the user to directly see the expected inputs to our template parser function
  2. Generating this file at test time would require a call to the colabfold msa server, since we don't use the colabfold binary. It is good practice to avoid dependencies on web services during tests to avoid latency and test breakages due to service unavailability.

If the main concern is adding an additional file, I would suggest adding the example content as a text block in the module using textwrap (example). This particular example could be made much shorter, e.g. we could include just 5 alignment lines, including the offending 1rnb_A line.

Copy link
Collaborator

@gnikolenyi gnikolenyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Added some comments on top of Jennifer's.

One thing I am missing is the actual mapping being done after the colabfold pipeline pulled the templates. I see you have the primitive but it is not yet being called in openfold3/core/data/tools/colabfold_msa_server.py or anywhere else outside of the unittests. Could you please also add the remapping to the colabfold pipeline itself?

@@ -0,0 +1,62 @@
101 1b27_B 1.0 110 0 0 1 110 1 110 1.022e-45 160 110M
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jennifer, we already have some smaller files in the repo for this purpose under openfold3/tests/test_data. I think it should be fine to have this there.

Comment on lines +62 to +64
label_to_author = get_label_to_author_chain_id_dict(cif_file)
author_to_label = {v: k for k, v in label_to_author.items()}
label_chain_id = author_to_label[template.chain_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

@jandom jandom marked this pull request as draft February 10, 2026 14:38
@jandom
Copy link
Collaborator Author

jandom commented Feb 10, 2026

@jnwei @gnikolenyi many thanks for the reviews – I'm not sure why this wasn't a draft, this is clearly not ready for prime time.

Agreed that we need this to use some fixture files (in an integration test context), but I would also like to have an end-to-end test that does everything. Ideally, we would just have in-memory generated fixtures but we don't have the tooling setup atm.

Could you please also add the remapping to the colabfold pipeline itself?

This would be ideal but it's not possible unless we pull in the cif files, which have the info on the peptide chains to do the mapping.

@jandom jandom requested review from gnikolenyi and jnwei February 11, 2026 18:00
@jandom jandom requested a review from ljarosch February 12, 2026 16:47
@jandom
Copy link
Collaborator Author

jandom commented Feb 12, 2026

Wrapping @ljarosch into this PR, because it's quite hairy. Here is some updated context

  • this only occurs at inference and only when using colabfold
  • at training and at manual inference, we provide the correctly formatted templates

Colabfold returns "author" chain IDs rather than "labelled" chain IDs, and this PR fixes how these are handled. However, the code now assumes that author chain IDs are provided and may erroneously correct properly provided chain IDs.

@gnikolenyi
Copy link
Collaborator

@jnwei @jandom Added the template structure download and chain ID remapping logic to the colabfold pipeline and removed it from the template pipeline. See logs below for an example with the remapping printed explicitly:

Submitting 5 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [elapsed: 00:02 remaining: 00:00]
Downloading template CIFs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:04<00:00,  8.53it/s]
Remapped 4uyk: author A -> label A
Remapped 1914: author A -> label A
Remapped 1e8o: author A -> label A
Remapped 3jaj: author S1 -> label JC
Remapped 5aox: author D -> label D
Remapped 4ue5: author E -> label E
Remapped 1e8s: author A -> label A
Remapped 4uyk: author B -> label B
Remapped 7nfx: author t -> label WA
Remapped 5aox: author B -> label B
Remapped 1914: author A -> label A
Remapped 4uyj: author B -> label B
Remapped 5aox: author E -> label E
Remapped 1e8o: author D -> label D
Remapped 7obr: author t -> label WA
Remapped 1e8o: author B -> label B
Remapped 4ue5: author B -> label B
Remapped 2w9j: author B -> label B
Remapped 2w9j: author A -> label A
Remapped 4gnx: author A -> label A
Remapped 3kdf: author A -> label A
Remapped 7uy6: author F -> label E
Remapped 7lmb: author F -> label G
Remapped 6d6v: author F -> label C
Remapped 4gnx: author B -> label B
Remapped 2pqa: author C -> label C
Remapped 2pqa: author A -> label A
Remapped 1l1o: author E -> label E
Remapped 2z6k: author A -> label A
Remapped 1quq: author A -> label A
Remapped 2pi2: author C -> label C
Remapped 6i52: author B -> label B
Remapped 2z6k: author B -> label B
Remapped 1quq: author C -> label C
Remapped 3kdf: author D -> label B
Remapped 4joi: author B -> label B
Remapped 4joi: author A -> label A
Remapped 7u5c: author F -> label F
Remapped 6w6w: author C -> label D
Remapped 8d0k: author B -> label B
Remapped 8c5y: author B -> label B
Remapped 4gnx: author C -> label C
Remapped 1jmc: author A -> label B
Remapped 1fgu: author B -> label B
Remapped 6i52: author C -> label C
Remapped 1l1o: author C -> label C
Remapped 1l1o: author F -> label F
Remapped 1ynx: author A -> label A
Remapped 8aaj: author A -> label A
Remapped 8c5z: author A -> label A
Remapped 8oej: author A -> label A
Remapped 8oej: author D -> label D
Remapped 6d6v: author D -> label A
Remapped 8c5y: author J -> label J
Remapped 3u50: author C -> label A
Remapped 1o7i: author B -> label B
Remapped 7wcg: author A -> label A
Remapped 3dm3: author B -> label B
Remapped 2k5v: author A -> label A
Submitting 2 paired MSA queries to the Colabfold MSA server...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:02 remaining: 00:00]
Computing paired MSAs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.92s/it]

For example, checking 3jaj author S1 -> label JC: https://www.rcsb.org/sequence/3JAJ#JC

Note the following:

  • only up to the first 25 templates are parsed and remapped from auth to label asym ids to reduce the amount of cif files that need to be downloaded - we can expose this as an argument to the runner in a later PR
  • the cif files are deduplicated within an inference run and only the non-redundant set of cifs are downloaded, but across multiple runs, the cif files are re-downloaded
  • the cif files are not reused for the actual template processing, which would require a more significant refactor - both this and the re-download across inference runs are areas which we can optimize later

Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the latest update from @gnikolenyi is that there are two main changes:

  • We download the cifs of the templates in the colabfold alignment process. This is necessary to parse the author labeled ids. These cifs are stored temporarily
  • The cifs for the templates are downloaded again in the template pipeline for template feature processing.

No changes / updates were made to the tests. Gergo separately ran an example to process the MSAs of several other examples.

@jandom
Copy link
Collaborator Author

jandom commented Mar 13, 2026

This is a huge step in the right direction but without extensive tests, I'm quite weary. Will review ASP.

@jandom jandom requested a review from jnwei March 18, 2026 15:49
@jandom jandom marked this pull request as ready for review March 18, 2026 15:50
@jandom jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Mar 18, 2026
@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 18, 2026
@jandom
Copy link
Collaborator Author

jandom commented Mar 18, 2026

@gnikolenyi I have somewhat changed this code, the biggest change is skipping the CIF download and using the RSCB API to get the mapping instead (phew!).

Outstanding items

  • remove the original helper methods for the re-mapping from the inference pipeline (everything is now in the colabfold)
  • fix some of the mypy annotations, just because this code is so messy

@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 18, 2026
Copy link
Collaborator

@gnikolenyi gnikolenyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea using the RCSB API instead of a CIF download. Can you add a mock for the TestFetchAuthorToLabelChainIds test instead of directly pinging the server? Generally, we don't want dependencies like this in unit tests (great lesson from @jnwei!). After that it should be good to go on my end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants