Skip to content

[BUG] Broken test_tokenizer.py tests for biotite>=1.3 #96

@jnwei

Description

@jnwei

Describe the bug
Tests for the tokenization pipeline fail in biotite >=1.3 upon changes in [biotite.structure.get_residue_starts()](https://github.com/biotite-dev/biotite/blob/main/src/biotite/structure/residues.py#L39)

In particular, residue starts are now also identified by changes in the sym_id property.

To Reproduce

  1. Install default openfold3 with biotite == 1.2, e.g.
    pip install openfold3[dev]

  2. Verify that tests pass for [test_tokenization.py](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_tokenizer.py), e.g
    pytest openfold3/tests/test_tokenizer.py

  3. Update biotite to version 1.3 or greater, e.g.
    pip install -U biotite==1.3

  4. Observe that tests fail for [test_tokenization.py](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_tokenizer.py).

Expected behavior
Openfold-3 tests should pass on updated versions of biotite 1.3. We should reexamine our tokenization pipelines to see if we need to change anything, or if updating our test examples to account for the new behavior in get_residue_starts is sufficient.

Stack trace

FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[1ema] - AssertionError: AtomArrays have different values for: token_id.
FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[5tdj] - AssertionError: AtomArrays have different values for: token_id.
FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[6znc] - AssertionError: AtomArrays have different values for: token_id.

Additional context

Some sample code to isolate the issue in the difference in the token starts to the sym_id annotation:

from pathlib import Path

import biotite.structure as struct
from openfold3.core.data.io.structure.atom_array import read_atomarray_from_npz


TEST_DIR = Path("openfold3/tests/test_data/tokenization")
id = "1ema"
input = TEST_DIR / "inputs" / f"{id}_raw_bonds_unfiltered.npz"
output = TEST_DIR / "outputs" / f"{id}_tokenized_bonds_unfiltered.npz"

arr_in = read_atomarray_from_npz(input)
arr_out = read_atomarray_from_npz(output)

expected_residue_starts = np.unique(arr_out.res_id, return_index=True)[1]
>>> array([ 0, 8, 14, 23, 27, 36, 45, 53, 64, 71, ...])




# actual residue start for atoms in question
struc.get_residue_starts(arr_in, add_exclusive_stop=False) 
# in biotite>=1.3 42 is an extra residue start 
>>> array([ 0, 8, 14, 23, 27, 36, 42, 45, 53, 64, 71, ...])

# in bioite==1.2, the residue starts match the output
>>>array([ 0, 8, 14, 23, 27, 36, 45, 53, 64, 71, 75, ...])


# matches v1.3 call in `struc.get_residue_starts_for()`
# If we remove `sym_id`, we get the correct residue starts 
get_segment_starts(arr_in, add_exclusive_stop=True, equal_categories=["chain_id", "res_id", "ins_code", "sym_id"])
>>> array([ 0, 8, 14, 23, 27, 36, 42, 45, 53, 64, 71, ...])

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdata preprocessingRelating to the preprocessing of queries and datasets

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions