feat(inference): Add support for custom residue numbering (resolves #58) by chodec · Pull Request #69 · aqlaboratory/openfold-3

chodec · 2025-12-06T21:25:40Z

This Pull Request implements full support for custom residue numbering in the inference output. This feature allows users to define specific residue numbering in the input JSON, resolving Issue #58.

Summary of Changes

The primary goal was to allow users to define specific residue numbering in the input JSON, rather than relying on the default numbering starting from 1. This includes support for non-sequential numbers and PDB-style insertion codes (e.g., '103A').

The implementation required coordinated changes across three key modules:

Input Definition (inference_query_format.py):
- Added two optional string fields to the Chain class: starting_residue_number (for simple offset) and residue_ids (for explicit lists).
Data Processing (inference.py):
- Implemented logic that ensures residue_ids takes precedence over starting_residue_number. If a valid explicit list is provided, it is used; otherwise, a sequential list is generated based on the start number. The final list is stored in the data batch.
Output Writing (Post-processing in writer.py):
- Implemented the static method OF3OutputWriter._renumber_atom_array. This method executes after model inference but before writing the PDB/mmCIF file.
- It uses regular expressions (re) to safely parse string IDs (e.g., separating '103A' into the integer ID 103 and the insertion code 'A').
- The new IDs are applied directly to the Biotite AtomArray's res_id and ins_code annotations. This ensures the output structure reflects the desired numbering without affecting core model calculations.

Related Issues

Resolves: #58

Testing and Validation

Note on Testing: Due to local environment configuration issues (missing model checkpoints), an end-to-end test run was not possible to perform.

However, the logic has been manually validated to ensure:

Priority and Consistency: The implementation correctly prioritizes residue_ids and handles sequence length mismatch by defaulting to standard numbering (1, 2, 3...).
Parsing Robustness: The regex parsing logic in writer.py correctly extracts insertion codes, which is critical for PDB compliance.

jnwei

Hi @chodec,

Thank you for working on this PR. The custom numbering feature is a tricky feature to get right, but as #58 indicates, it would be very useful for researchers.

I took a first pass over this PR today, and I have a few suggestions / questions:

First, a check of my understanding: In the current implementation, a new method for creating a custom residue ID list (get_custom_residue_ids) in the InferenceDataset class. It looks like the intention is for the new residue IDs to be read by the output writer, but I don't see where the new residue ids are added to the batch?
I would recommend that the custom residue_id list be created upon construction of the Chain class in inference_query_format.py rather than being generated in the InferenceDataset. This way, the logic around parsing the residue ids can be kept in one place, rather than adding extra logic to the InferenceDataset.
- For an example of how to use pydantic validators to create the residue_id list given an input that is either a full list or an int, you might be able to borrow the logic used in InferenceExperimentSettings to generate random seeds from a list or a initial integer seed here
- The InferenceDataset can then be used to create batch features of the custom residue list if it is provided in a chain.
Could you please add unit tests to test the creation of the custom residue numbering? I think it could be helpful to have two tests:
- One test for generating the optional residue_id list in the Chain class, perhaps added here
- One test for writing the outputs, which could be added here).
I would guess that some of the examples you used for manual validation of the numbering might be suitable test cases.

Also assigning @ljarosch to review, as he has more experience working with biotite and renumbering chains and may have additional suggestions regarding the organization.

Please let us know if you have any questions, and thank you again for your work on this issue!

jnwei · 2025-12-08T10:43:56Z

openfold3/core/runners/writer.py

        elif out_fmt == "npz":
            np.savez_compressed(out_file_full, **full_confidence_scores)

+# openfold3/core/runners/writer.py


Please remove this stray comment

lucajovine · 2026-02-01T15:30:34Z

Hello, any development on this front?

chodec · 2026-02-06T15:48:04Z

Hi @lucajovine really sorry for the late reply. Between the holidays, a work crunch, and now my university exams, I completely lost track of this. I’m just heading out for a vacation now, but I’ll jump back into it as soon as I’m back. Thanks for your patience

lucajovine · 2026-02-06T16:39:38Z

No worries and thanks again!

…qlaboratory#58) Implements custom residue numbering feature with support for: - Explicit residue_ids list (e.g., ['1A', '2', '3B']) - Starting residue number offset - Insertion codes in PDB format Changes: - Add residue_ids and starting_residue_number fields to Chain model - Add validation and generation logic in Chain._validate_and_generate_residue_ids - Add custom_residue_ids to batch features in InferenceDataset - Add _renumber_atom_array method in OF3OutputWriter - Add unit tests for Chain validation and writer renumbering Resolves: aqlaboratory#58

chodec · 2026-02-17T15:58:37Z

Hi @jnwei and @ljarosch,

I've finally implemented all the requested changes, sorry for the long delay :D

Moved logic to Chain model - residue_ids are now generated in Chain._validate_and_generate_residue_ids() using Pydantic model_validator as suggested
InferenceDataset simplified - now just extracts the already-validated residue_ids from chains
Added unit tests:
- test_chain_residue_id_generation() in test_structure_from_query.py
- Writer tests in test_writer.py
Rebased on latest upstream/main
The implementation now follows the pattern from InferenceExperimentSettings for generating values from either a list or an initial value.

Hopefully it will be helpful now.

jnwei

@chodec Thank you very much for your work on this PR and for refactoring the residue id construction into the Chain definition. I also really appreciate the additional tests and attention to detail, especially for getting parsing the insertion codes.

I can help run an end to end test for this PR some time next week. But for now, I wanted to provide some suggestions for organization of the tests and default behavior.

@ljarosch Please take a quick look, especially at the _renumber_atom_array logic in OF3OutputWriter.

jnwei · 2026-02-26T10:50:31Z

openfold3/projects/of3_all_atom/config/inference_query_format.py

+from pydantic import model_validator, Field

 from pydantic import (
    BaseModel,
    BeforeValidator,
    DirectoryPath,
    FilePath,
-    field_serializer,
+    field_serializer


nit: The imports seem to have been separated into two groups from pydantic. Can these import statements be merged together?

A linting tool should be able to fix the import statements. For this project, we use ruff with the settings in pyproject.toml.

jnwei · 2026-02-26T10:53:09Z

openfold3/tests/__init__.py

Please revert these changes to tests/__init__.py the final submission

jnwei · 2026-02-26T10:55:51Z

openfold3/__init__.py

Please revert these changes to openfold3/__init__.py

jnwei · 2026-02-26T11:02:58Z

openfold3/projects/of3_all_atom/config/inference_query_format.py

+        else:
+            # Default to numbering starting from 1
+            res_ids = [str(1 + i) for i in range(sequence_length)]
+            data["residue_ids"] = res_ids


I think it would be preferable to leave residue_ids as None if there is no custom residue numbering / start number provided by the user.

I see that later, this custom residue_ids field is passed to the batch, which is then used to trigger the renumber_residue_ids later. If residue_ids is left blank, then default behavior for inference would remain the same (i.e. it would not perform the renumbering step).

jnwei · 2026-02-26T11:06:51Z

examples/example_inference_inputs/query_multimer.json

Would you mind making a new copy of query_multimer.json with the desired residue_ids. Perhaps you could save it as examples/example_inference_inputs/query_multimer_custom_numbering.json

jnwei · 2026-02-26T11:09:18Z

openfold3/core/runners/writer.py

        )

+    @staticmethod
+    def _renumber_atom_array(


@ljarosch Do you have any comments about this method? I assume we have other parts of the data pipeline that would also require renumbering / reannotation of atom arrays. Would it make sense to add this function as one of the pipelines?

jnwei · 2026-02-26T11:14:52Z

openfold3/tests/test_structure_from_query.py

 )
-def test_structure_from_query(query: Query, ground_truth_file: Path):
-    """Tests that the generated structure and reference molecules matches gt."""
+def test_structure_with_ref_mols_from_query(query, ground_truth_file):


The changes to test_structure_with_ref_mols_from_query seem unrelated to this PR? could we revert these changes?

jnwei · 2026-02-26T11:21:13Z

openfold3/tests/test_structure_from_query.py

+    chain_default = Chain.model_validate(base_params)
+    assert chain_default.residue_ids == ['1', '2', '3']
+
+    params_start = base_params.copy()


I would recommend creating a separate test for each test case, rather than updating the base_params. This way, if a test fails, it is easy to see which specific use case has failed.

Perhaps an organization that is something like this (pseudocode, not tested)

class TestCustomResidueIDGeneration: def base_params(self): # could potentially be a pytest.fixture instead return { "molecule_type": MoleculeType.PROTEIN, "chain_ids": ["A"], "sequence": "AAA" } def test_base_definition(self): chain_default = Chain.model_validate(self.base_params()) assert chain_default.residue_ids == ['1', '2', '3'] def test_residue_id_starting_number(self): params_start = self.base_params().update({'starting_residue_number': "100"}) assert chain_start.residue_ids == ['100', '101', '102']

jnwei · 2026-02-26T11:23:11Z

openfold3/tests/test_writer.py

        assert writer.failed_count == 1
        assert writer.success_count == 0
+
+    def test_renumber_atom_array_with_insertion_codes(self):


Would it be possible to explicitly write out the atom array? Similar to the dummy_array created here? https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/conftest.py#L10-L24

I find that it is much easier to check the input example when large inputs such as atom_arrays are written explicitly rather than constructed piece by piece.

* readme for how to build the docs * consistent capitalization * minimal training.md * review: comment from Etowah * specify the bucket * Update PDB s3 path --------- Co-authored-by: Vinay Swamy <[email protected]>

jnwei requested changes Dec 8, 2025

View reviewed changes

jnwei requested a review from ljarosch December 8, 2025 11:21

chodec marked this pull request as draft December 12, 2025 15:26

chodec force-pushed the feature/residue-start-numbering branch from eabfc13 to 43bc59b Compare February 17, 2026 15:57

chodec marked this pull request as ready for review February 17, 2026 15:59

lucajovine mentioned this pull request Feb 22, 2026

Feature request: custom sequence numbering jwohlwend/boltz#414

Closed

jnwei requested changes Feb 26, 2026

View reviewed changes

Conversation

chodec commented Dec 6, 2025

Summary of Changes

Related Issues

Testing and Validation

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucajovine commented Feb 1, 2026

Uh oh!

chodec commented Feb 6, 2026

Uh oh!

lucajovine commented Feb 6, 2026

Uh oh!

chodec commented Feb 17, 2026

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

jnwei Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jnwei Feb 26, 2026 •

edited

Loading