fix: chain template alignments auth labelling (inference) by jandom · Pull Request #117 · aqlaboratory/openfold-3

jandom · 2026-02-07T15:44:23Z

Summary

Hopefully helps to solve #101

Changes

So far wrote a test that reproduced the failure, and then added a "fix"

Related Issues

Testing

Other Notes

- auth chain ids vs labelled chain ids - add a test that confims this

jandom · 2026-02-07T15:44:52Z

openfold3/tests/core/data/pipelines/preprocessing/colabfold_template.m8

@@ -0,0 +1,62 @@
+101	1b27_B	1.0	110	0	0	1	110	1	110	1.022e-45	160	110M


This needs to be generated at test-time, I don't want to be adding files into the repo

I think we could be helpful to include a explicit colabfold_template.m8 sample for a few reasons.

It allows the user to directly see the expected inputs to our template parser function

Generating this file at test time would require a call to the colabfold msa server, since we don't use the colabfold binary. It is good practice to avoid dependencies on web services during tests to avoid latency and test breakages due to service unavailability.

If the main concern is adding an additional file, I would suggest adding the example content as a text block in the module using textwrap (example). This particular example could be made much shorter, e.g. we could include just 5 alignment lines, including the offending 1rnb_A line.

I agree with Jennifer, we already have some smaller files in the repo for this purpose under openfold3/tests/test_data. I think it should be fine to have this there.

jandom · 2026-02-07T15:45:56Z

openfold3/tests/core/data/pipelines/preprocessing/test_template.py

+        label_to_author = get_label_to_author_chain_id_dict(cif_file)
+        author_to_label = {v: k for k, v in label_to_author.items()}
+        label_chain_id = author_to_label[template.chain_id]


This is the re-mapping from "auth" chains IDs to "label" chain IDs... very wishful in terms of inputs not being pathological

So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

jnwei

Overall this is great! This was a hard bug to pin down.

A few drive by comments regarding setting up the tests with colabfold web services.

jnwei · 2026-02-09T04:12:19Z

openfold3/tests/core/data/pipelines/preprocessing/test_template.py

+        template = templates[16]
+        assert template.chain_id == "A" and template.entry_id == "1rnb"
+
+        fetch(


Can we mock this call instead of explicitly calling the RCSB database using fetch?

As this is a unit test, it would be good to remove dependencies on web servers so that we don't have latency issues / failures due to the availability of the service.

Switched to just a cif file as fixture

jnwei · 2026-02-09T04:22:55Z

openfold3/tests/core/data/pipelines/preprocessing/colabfold_template.m8

@@ -0,0 +1,62 @@
+101	1b27_B	1.0	110	0	0	1	110	1	110	1.022e-45	160	110M


I think we could be helpful to include a explicit colabfold_template.m8 sample for a few reasons.

It allows the user to directly see the expected inputs to our template parser function

Generating this file at test time would require a call to the colabfold msa server, since we don't use the colabfold binary. It is good practice to avoid dependencies on web services during tests to avoid latency and test breakages due to service unavailability.

If the main concern is adding an additional file, I would suggest adding the example content as a text block in the module using textwrap (example). This particular example could be made much shorter, e.g. we could include just 5 alignment lines, including the offending 1rnb_A line.

gnikolenyi

Looks good so far. Added some comments on top of Jennifer's.

One thing I am missing is the actual mapping being done after the colabfold pipeline pulled the templates. I see you have the primitive but it is not yet being called in openfold3/core/data/tools/colabfold_msa_server.py or anywhere else outside of the unittests. Could you please also add the remapping to the colabfold pipeline itself?

gnikolenyi · 2026-02-09T22:51:31Z

openfold3/tests/core/data/pipelines/preprocessing/colabfold_template.m8

@@ -0,0 +1,62 @@
+101	1b27_B	1.0	110	0	0	1	110	1	110	1.022e-45	160	110M


I agree with Jennifer, we already have some smaller files in the repo for this purpose under openfold3/tests/test_data. I think it should be fine to have this there.

gnikolenyi · 2026-02-09T22:56:23Z

openfold3/tests/core/data/pipelines/preprocessing/test_template.py

+        label_to_author = get_label_to_author_chain_id_dict(cif_file)
+        author_to_label = {v: k for k, v in label_to_author.items()}
+        label_chain_id = author_to_label[template.chain_id]


So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

jandom · 2026-02-10T14:40:17Z

@jnwei @gnikolenyi many thanks for the reviews – I'm not sure why this wasn't a draft, this is clearly not ready for prime time.

Agreed that we need this to use some fixture files (in an integration test context), but I would also like to have an end-to-end test that does everything. Ideally, we would just have in-memory generated fixtures but we don't have the tooling setup atm.

Could you please also add the remapping to the colabfold pipeline itself?

This would be ideal but it's not possible unless we pull in the cif files, which have the info on the peptide chains to do the mapping.

jandom · 2026-02-12T16:49:38Z

Wrapping @ljarosch into this PR, because it's quite hairy. Here is some updated context

this only occurs at inference and only when using colabfold
at training and at manual inference, we provide the correctly formatted templates

Colabfold returns "author" chain IDs rather than "labelled" chain IDs, and this PR fixes how these are handled. However, the code now assumes that author chain IDs are provided and may erroneously correct properly provided chain IDs.

…e pipeline.

gnikolenyi · 2026-03-13T04:32:44Z

@jnwei @jandom Added the template structure download and chain ID remapping logic to the colabfold pipeline and removed it from the template pipeline. See logs below for an example with the remapping printed explicitly:

Submitting 5 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [elapsed: 00:02 remaining: 00:00]
Downloading template CIFs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:04<00:00,  8.53it/s]
Remapped 4uyk: author A -> label A
Remapped 1914: author A -> label A
Remapped 1e8o: author A -> label A
Remapped 3jaj: author S1 -> label JC
Remapped 5aox: author D -> label D
Remapped 4ue5: author E -> label E
Remapped 1e8s: author A -> label A
Remapped 4uyk: author B -> label B
Remapped 7nfx: author t -> label WA
Remapped 5aox: author B -> label B
Remapped 1914: author A -> label A
Remapped 4uyj: author B -> label B
Remapped 5aox: author E -> label E
Remapped 1e8o: author D -> label D
Remapped 7obr: author t -> label WA
Remapped 1e8o: author B -> label B
Remapped 4ue5: author B -> label B
Remapped 2w9j: author B -> label B
Remapped 2w9j: author A -> label A
Remapped 4gnx: author A -> label A
Remapped 3kdf: author A -> label A
Remapped 7uy6: author F -> label E
Remapped 7lmb: author F -> label G
Remapped 6d6v: author F -> label C
Remapped 4gnx: author B -> label B
Remapped 2pqa: author C -> label C
Remapped 2pqa: author A -> label A
Remapped 1l1o: author E -> label E
Remapped 2z6k: author A -> label A
Remapped 1quq: author A -> label A
Remapped 2pi2: author C -> label C
Remapped 6i52: author B -> label B
Remapped 2z6k: author B -> label B
Remapped 1quq: author C -> label C
Remapped 3kdf: author D -> label B
Remapped 4joi: author B -> label B
Remapped 4joi: author A -> label A
Remapped 7u5c: author F -> label F
Remapped 6w6w: author C -> label D
Remapped 8d0k: author B -> label B
Remapped 8c5y: author B -> label B
Remapped 4gnx: author C -> label C
Remapped 1jmc: author A -> label B
Remapped 1fgu: author B -> label B
Remapped 6i52: author C -> label C
Remapped 1l1o: author C -> label C
Remapped 1l1o: author F -> label F
Remapped 1ynx: author A -> label A
Remapped 8aaj: author A -> label A
Remapped 8c5z: author A -> label A
Remapped 8oej: author A -> label A
Remapped 8oej: author D -> label D
Remapped 6d6v: author D -> label A
Remapped 8c5y: author J -> label J
Remapped 3u50: author C -> label A
Remapped 1o7i: author B -> label B
Remapped 7wcg: author A -> label A
Remapped 3dm3: author B -> label B
Remapped 2k5v: author A -> label A
Submitting 2 paired MSA queries to the Colabfold MSA server...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:02 remaining: 00:00]
Computing paired MSAs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.92s/it]

For example, checking 3jaj author S1 -> label JC: https://www.rcsb.org/sequence/3JAJ#JC

Note the following:

only up to the first 25 templates are parsed and remapped from auth to label asym ids to reduce the amount of cif files that need to be downloaded - we can expose this as an argument to the runner in a later PR
the cif files are deduplicated within an inference run and only the non-redundant set of cifs are downloaded, but across multiple runs, the cif files are re-downloaded
the cif files are not reused for the actual template processing, which would require a more significant refactor - both this and the re-download across inference runs are areas which we can optimize later

jnwei

My understanding of the latest update from @gnikolenyi is that there are two main changes:

We download the cifs of the templates in the colabfold alignment process. This is necessary to parse the author labeled ids. These cifs are stored temporarily
The cifs for the templates are downloaded again in the template pipeline for template feature processing.

No changes / updates were made to the tests. Gergo separately ran an example to process the MSAs of several other examples.

jandom · 2026-03-13T08:35:58Z

This is a huge step in the right direction but without extensive tests, I'm quite weary. Will review ASP.

…gnments-auth-labelling

jandom · 2026-03-18T15:51:46Z

@gnikolenyi I have somewhat changed this code, the biggest change is skipping the CIF download and using the RSCB API to get the mapping instead (phew!).

Outstanding items

remove the original helper methods for the re-mapping from the inference pipeline (everything is now in the colabfold)
fix some of the mypy annotations, just because this code is so messy

gnikolenyi

Great idea using the RCSB API instead of a CIF download. Can you add a mock for the TestFetchAuthorToLabelChainIds test instead of directly pinging the server? Generally, we don't want dependencies like this in unit tests (great lesson from @jnwei!). After that it should be good to go on my end

jandom added 2 commits February 7, 2026 15:21

fix #101: template chain alignment

561018c

- auth chain ids vs labelled chain ids - add a test that confims this

further tweak, and maybe working now

0787889

jandom requested a review from gnikolenyi February 7, 2026 15:44

jandom self-assigned this Feb 7, 2026

jandom commented Feb 7, 2026

View reviewed changes

jnwei requested changes Feb 9, 2026

View reviewed changes

gnikolenyi requested changes Feb 9, 2026

View reviewed changes

jandom marked this pull request as draft February 10, 2026 14:38

jandom added 3 commits February 10, 2026 15:17

rename templates

c3c14a5

run a linter

080d0d0

review: comments and improvements

e57d303

jandom requested review from gnikolenyi and jnwei February 11, 2026 18:00

simpler code, happier

865ec86

jandom requested a review from ljarosch February 12, 2026 16:47

Add remapping logic to the colabfold pipeline and remove from templat…

b29f1f6

…e pipeline.

jnwei reviewed Mar 13, 2026

View reviewed changes

refactor the PR slightly

a78196f

jandom requested a review from jnwei March 18, 2026 15:49

jandom marked this pull request as ready for review March 18, 2026 15:50

jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Mar 18, 2026

Merge branch 'public-main' into jandom/2026-02/fix/chain-template-ali…

be1fc4e

…gnments-auth-labelling

jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 18, 2026

jandom added 2 commits March 18, 2026 16:50

fix the test_colabfold_msa

6f7d487

fix: TEST_DIR location

126bcca

jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 18, 2026

gnikolenyi requested changes Mar 21, 2026

View reviewed changes

		@@ -0,0 +1,62 @@
		101 1b27_B 1.0 110 0 0 1 110 1 110 1.022e-45 160 110M

Conversation

jandom commented Feb 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gnikolenyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jandom commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jandom commented Feb 12, 2026

Uh oh!

gnikolenyi commented Mar 13, 2026

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

jandom commented Mar 13, 2026

Uh oh!

jandom commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnikolenyi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jandom commented Feb 10, 2026 •

edited

Loading

jandom commented Mar 18, 2026 •

edited

Loading