Skip to content

feat(inference): Add support for custom residue numbering (resolves #58)#69

Open
chodec wants to merge 1 commit intoaqlaboratory:mainfrom
chodec:feature/residue-start-numbering
Open

feat(inference): Add support for custom residue numbering (resolves #58)#69
chodec wants to merge 1 commit intoaqlaboratory:mainfrom
chodec:feature/residue-start-numbering

Conversation

@chodec
Copy link
Contributor

@chodec chodec commented Dec 6, 2025

This Pull Request implements full support for custom residue numbering in the inference output. This feature allows users to define specific residue numbering in the input JSON, resolving Issue #58.

Summary of Changes

The primary goal was to allow users to define specific residue numbering in the input JSON, rather than relying on the default numbering starting from 1. This includes support for non-sequential numbers and PDB-style insertion codes (e.g., '103A').

The implementation required coordinated changes across three key modules:

  1. Input Definition (inference_query_format.py):
    • Added two optional string fields to the Chain class: starting_residue_number (for simple offset) and residue_ids (for explicit lists).
  2. Data Processing (inference.py):
    • Implemented logic that ensures residue_ids takes precedence over starting_residue_number. If a valid explicit list is provided, it is used; otherwise, a sequential list is generated based on the start number. The final list is stored in the data batch.
  3. Output Writing (Post-processing in writer.py):
    • Implemented the static method OF3OutputWriter._renumber_atom_array. This method executes after model inference but before writing the PDB/mmCIF file.
    • It uses regular expressions (re) to safely parse string IDs (e.g., separating '103A' into the integer ID 103 and the insertion code 'A').
    • The new IDs are applied directly to the Biotite AtomArray's res_id and ins_code annotations. This ensures the output structure reflects the desired numbering without affecting core model calculations.

Related Issues

Resolves: #58


Testing and Validation

Note on Testing: Due to local environment configuration issues (missing model checkpoints), an end-to-end test run was not possible to perform.

However, the logic has been manually validated to ensure:

  • Priority and Consistency: The implementation correctly prioritizes residue_ids and handles sequence length mismatch by defaulting to standard numbering (1, 2, 3...).
  • Parsing Robustness: The regex parsing logic in writer.py correctly extracts insertion codes, which is critical for PDB compliance.

Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chodec,

Thank you for working on this PR. The custom numbering feature is a tricky feature to get right, but as #58 indicates, it would be very useful for researchers.

I took a first pass over this PR today, and I have a few suggestions / questions:

  • First, a check of my understanding: In the current implementation, a new method for creating a custom residue ID list (get_custom_residue_ids) in the InferenceDataset class. It looks like the intention is for the new residue IDs to be read by the output writer, but I don't see where the new residue ids are added to the batch?

  • I would recommend that the custom residue_id list be created upon construction of the Chain class in inference_query_format.py rather than being generated in the InferenceDataset. This way, the logic around parsing the residue ids can be kept in one place, rather than adding extra logic to the InferenceDataset.

    • For an example of how to use pydantic validators to create the residue_id list given an input that is either a full list or an int, you might be able to borrow the logic used in InferenceExperimentSettings to generate random seeds from a list or a initial integer seed here
    • The InferenceDataset can then be used to create batch features of the custom residue list if it is provided in a chain.
  • Could you please add unit tests to test the creation of the custom residue numbering? I think it could be helpful to have two tests:
    - One test for generating the optional residue_id list in the Chain class, perhaps added here
    - One test for writing the outputs, which could be added here).

  • I would guess that some of the examples you used for manual validation of the numbering might be suitable test cases.

Also assigning @ljarosch to review, as he has more experience working with biotite and renumbering chains and may have additional suggestions regarding the organization.

Please let us know if you have any questions, and thank you again for your work on this issue!

elif out_fmt == "npz":
np.savez_compressed(out_file_full, **full_confidence_scores)

# openfold3/core/runners/writer.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this stray comment

@jnwei jnwei requested a review from ljarosch December 8, 2025 11:21
@chodec chodec marked this pull request as draft December 12, 2025 15:26
@lucajovine
Copy link

Hello, any development on this front?

@chodec
Copy link
Contributor Author

chodec commented Feb 6, 2026

Hi @lucajovine really sorry for the late reply. Between the holidays, a work crunch, and now my university exams, I completely lost track of this. I’m just heading out for a vacation now, but I’ll jump back into it as soon as I’m back. Thanks for your patience

@lucajovine
Copy link

No worries and thanks again!

…qlaboratory#58)

Implements custom residue numbering feature with support for:
- Explicit residue_ids list (e.g., ['1A', '2', '3B'])
- Starting residue number offset
- Insertion codes in PDB format

Changes:
- Add residue_ids and starting_residue_number fields to Chain model
- Add validation and generation logic in Chain._validate_and_generate_residue_ids
- Add custom_residue_ids to batch features in InferenceDataset
- Add _renumber_atom_array method in OF3OutputWriter
- Add unit tests for Chain validation and writer renumbering

Resolves: aqlaboratory#58
@chodec chodec force-pushed the feature/residue-start-numbering branch from eabfc13 to 43bc59b Compare February 17, 2026 15:57
@chodec
Copy link
Contributor Author

chodec commented Feb 17, 2026

Hi @jnwei and @ljarosch,

I've finally implemented all the requested changes, sorry for the long delay :D

  1. Moved logic to Chain model - residue_ids are now generated in Chain._validate_and_generate_residue_ids() using Pydantic model_validator as suggested
  2. InferenceDataset simplified - now just extracts the already-validated residue_ids from chains
  3. Added unit tests:
    • test_chain_residue_id_generation() in test_structure_from_query.py
    • Writer tests in test_writer.py
  4. Rebased on latest upstream/main
    The implementation now follows the pattern from InferenceExperimentSettings for generating values from either a list or an initial value.

Hopefully it will be helpful now.

@chodec chodec marked this pull request as ready for review February 17, 2026 15:59
Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chodec Thank you very much for your work on this PR and for refactoring the residue id construction into the Chain definition. I also really appreciate the additional tests and attention to detail, especially for getting parsing the insertion codes.

I can help run an end to end test for this PR some time next week. But for now, I wanted to provide some suggestions for organization of the tests and default behavior.

@ljarosch Please take a quick look, especially at the _renumber_atom_array logic in OF3OutputWriter.

Comment on lines +16 to +23
from pydantic import model_validator, Field

from pydantic import (
BaseModel,
BeforeValidator,
DirectoryPath,
FilePath,
field_serializer,
field_serializer
Copy link
Contributor

@jnwei jnwei Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The imports seem to have been separated into two groups from pydantic. Can these import statements be merged together?

A linting tool should be able to fix the import statements. For this project, we use ruff with the settings in pyproject.toml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert these changes to tests/__init__.py the final submission

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert these changes to openfold3/__init__.py

else:
# Default to numbering starting from 1
res_ids = [str(1 + i) for i in range(sequence_length)]
data["residue_ids"] = res_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to leave residue_ids as None if there is no custom residue numbering / start number provided by the user.

I see that later, this custom residue_ids field is passed to the batch, which is then used to trigger the renumber_residue_ids later. If residue_ids is left blank, then default behavior for inference would remain the same (i.e. it would not perform the renumbering step).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind making a new copy of query_multimer.json with the desired residue_ids. Perhaps you could save it as examples/example_inference_inputs/query_multimer_custom_numbering.json

)

@staticmethod
def _renumber_atom_array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljarosch Do you have any comments about this method? I assume we have other parts of the data pipeline that would also require renumbering / reannotation of atom arrays. Would it make sense to add this function as one of the pipelines?

)
def test_structure_from_query(query: Query, ground_truth_file: Path):
"""Tests that the generated structure and reference molecules matches gt."""
def test_structure_with_ref_mols_from_query(query, ground_truth_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to test_structure_with_ref_mols_from_query seem unrelated to this PR? could we revert these changes?

chain_default = Chain.model_validate(base_params)
assert chain_default.residue_ids == ['1', '2', '3']

params_start = base_params.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend creating a separate test for each test case, rather than updating the base_params. This way, if a test fails, it is easy to see which specific use case has failed.

Perhaps an organization that is something like this (pseudocode, not tested)

class TestCustomResidueIDGeneration:
  def base_params(self):  # could potentially be a pytest.fixture instead
       return {
        "molecule_type": MoleculeType.PROTEIN,
        "chain_ids": ["A"],
        "sequence": "AAA" 
    }

  def test_base_definition(self):
    chain_default = Chain.model_validate(self.base_params())
    assert chain_default.residue_ids == ['1', '2', '3']

  def test_residue_id_starting_number(self):
         params_start = self.base_params().update({'starting_residue_number': "100"})
         assert chain_start.residue_ids == ['100', '101', '102']

assert writer.failed_count == 1
assert writer.success_count == 0

def test_renumber_atom_array_with_insertion_codes(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to explicitly write out the atom array? Similar to the dummy_array created here? https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/conftest.py#L10-L24

I find that it is much easier to check the input example when large inputs such as atom_arrays are written explicitly rather than constructed piece by piece.

jnwei pushed a commit that referenced this pull request Mar 13, 2026
* readme for how to build the docs

* consistent capitalization

* minimal training.md

* review: comment from Etowah

* specify the bucket

* Update PDB s3 path

---------

Co-authored-by: Vinay Swamy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants