-
Notifications
You must be signed in to change notification settings - Fork 84
Description
Describe the bug
Tests for the tokenization pipeline fail in biotite >=1.3 upon changes in [biotite.structure.get_residue_starts()](https://github.com/biotite-dev/biotite/blob/main/src/biotite/structure/residues.py#L39)
In particular, residue starts are now also identified by changes in the sym_id property.
To Reproduce
-
Install default openfold3 with
biotite == 1.2, e.g.
pip install openfold3[dev] -
Verify that tests pass for
[test_tokenization.py](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_tokenizer.py), e.g
pytest openfold3/tests/test_tokenizer.py -
Update biotite to version 1.3 or greater, e.g.
pip install -U biotite==1.3 -
Observe that tests fail for
[test_tokenization.py](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_tokenizer.py).
Expected behavior
Openfold-3 tests should pass on updated versions of biotite 1.3. We should reexamine our tokenization pipelines to see if we need to change anything, or if updating our test examples to account for the new behavior in get_residue_starts is sufficient.
Stack trace
FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[1ema] - AssertionError: AtomArrays have different values for: token_id. FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[5tdj] - AssertionError: AtomArrays have different values for: token_id. FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[6znc] - AssertionError: AtomArrays have different values for: token_id.
Additional context
Some sample code to isolate the issue in the difference in the token starts to the sym_id annotation:
from pathlib import Path
import biotite.structure as struct
from openfold3.core.data.io.structure.atom_array import read_atomarray_from_npz
TEST_DIR = Path("openfold3/tests/test_data/tokenization")
id = "1ema"
input = TEST_DIR / "inputs" / f"{id}_raw_bonds_unfiltered.npz"
output = TEST_DIR / "outputs" / f"{id}_tokenized_bonds_unfiltered.npz"
arr_in = read_atomarray_from_npz(input)
arr_out = read_atomarray_from_npz(output)
expected_residue_starts = np.unique(arr_out.res_id, return_index=True)[1]
>>> array([ 0, 8, 14, 23, 27, 36, 45, 53, 64, 71, ...])
# actual residue start for atoms in question
struc.get_residue_starts(arr_in, add_exclusive_stop=False)
# in biotite>=1.3 42 is an extra residue start
>>> array([ 0, 8, 14, 23, 27, 36, 42, 45, 53, 64, 71, ...])
# in bioite==1.2, the residue starts match the output
>>>array([ 0, 8, 14, 23, 27, 36, 45, 53, 64, 71, 75, ...])
# matches v1.3 call in `struc.get_residue_starts_for()`
# If we remove `sym_id`, we get the correct residue starts
get_segment_starts(arr_in, add_exclusive_stop=True, equal_categories=["chain_id", "res_id", "ins_code", "sym_id"])
>>> array([ 0, 8, 14, 23, 27, 36, 42, 45, 53, 64, 71, ...])