[BUG] Broken test_tokenizer.py tests for biotite>=1.3

**Describe the bug**
Tests for the tokenization pipeline fail in biotite >=1.3 upon changes in `[biotite.structure.get_residue_starts()](https://github.com/biotite-dev/biotite/blob/main/src/biotite/structure/residues.py#L39)`

In particular, residue starts are now also identified by changes in the `sym_id` property.

**To Reproduce**
1. Install default openfold3 with `biotite == 1.2`, e.g.
`pip install openfold3[dev]`

2. Verify that tests pass for `[test_tokenization.py](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_tokenizer.py)`, e.g
`pytest openfold3/tests/test_tokenizer.py`

3. Update biotite to version 1.3 or greater, e.g.
`pip install -U biotite==1.3`

4. Observe that tests fail for `[test_tokenization.py](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_tokenizer.py)`.


**Expected behavior**
Openfold-3 tests should pass on updated versions of biotite 1.3. We should reexamine our tokenization pipelines to see if we need to change anything, or if updating [our test examples](https://github.com/aqlaboratory/openfold-3/tree/main/openfold3/tests/test_data/tokenization) to account for the new behavior in `get_residue_starts` is sufficient.


**Stack trace**

> ```
> FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[1ema] - AssertionError: AtomArrays have different values for: token_id.
> FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[5tdj] - AssertionError: AtomArrays have different values for: token_id.
> FAILED openfold3/tests/test_tokenizer.py::test_tokenizer[6znc] - AssertionError: AtomArrays have different values for: token_id.
> ```


**Additional context**

Some sample code to isolate the issue in the difference in the token starts to the `sym_id` annotation:

```python
from pathlib import Path

import biotite.structure as struct
from openfold3.core.data.io.structure.atom_array import read_atomarray_from_npz


TEST_DIR = Path("openfold3/tests/test_data/tokenization")
id = "1ema"
input = TEST_DIR / "inputs" / f"{id}_raw_bonds_unfiltered.npz"
output = TEST_DIR / "outputs" / f"{id}_tokenized_bonds_unfiltered.npz"

arr_in = read_atomarray_from_npz(input)
arr_out = read_atomarray_from_npz(output)

expected_residue_starts = np.unique(arr_out.res_id, return_index=True)[1]
>>> array([ 0, 8, 14, 23, 27, 36, 45, 53, 64, 71, ...])




# actual residue start for atoms in question
struc.get_residue_starts(arr_in, add_exclusive_stop=False) 
# in biotite>=1.3 42 is an extra residue start 
>>> array([ 0, 8, 14, 23, 27, 36, 42, 45, 53, 64, 71, ...])

# in bioite==1.2, the residue starts match the output
>>>array([ 0, 8, 14, 23, 27, 36, 45, 53, 64, 71, 75, ...])


# matches v1.3 call in `struc.get_residue_starts_for()`
# If we remove `sym_id`, we get the correct residue starts 
get_segment_starts(arr_in, add_exclusive_stop=True, equal_categories=["chain_id", "res_id", "ins_code", "sym_id"])
>>> array([ 0, 8, 14, 23, 27, 36, 42, 45, 53, 64, 71, ...])
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Broken test_tokenizer.py tests for biotite>=1.3 #96

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Broken test_tokenizer.py tests for biotite>=1.3 #96

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions