Problem while parsing bcif files written by py-mmcif

This week, the RCSB PDB switched BCIF encoding from Mol* package to [py-mmcif](https://github.com/rcsb/py-mmcif). Biotite fails at parsing these BCIF files when accessing any empty category, e.g. it often happens for label_alt_id. 

This issue was also reported here: https://github.com/rcsb/py-mmcif/issues/44 . 

The recipe to reproduce it is get file from http://models.rcsb.org/2uzi.bcif and then:

```python
from biotite.structure.io.pdbx import BinaryCIFFile, get_structure

x = BinaryCIFFile.read('2uzi.bcif')
get_structure(x, model=1, use_author_fields=False)
```

The issue seems to be in decoding a `StringArrayEncoding`. The [comment in that issue](https://github.com/rcsb/py-mmcif/issues/44#issuecomment-3361207698) analyses the problem in some more depth.

I am not sure if py-mmcif is fully compliant with the BCIF spec, but the fact is that other decoders have no issue parsing these files: py-mmcif, Mol*, ciftools-java. So somehow they are lenient to this condition. So perhaps biotite should also be lenient.

Looking a little bit into the code, this is what ciftools-java (which follows quite closely the original Mol* implementation) does to process the `offsets` array:

https://github.com/rcsb/ciftools-java/blob/5d5fd56dcf4675ca7695e5977b7f9f1501dcbb4e/src/main/java/org/rcsb/cif/binary/encoding/StringArrayEncoding.java#L78-L82

i.e. in the case of an offset of length 1 it still creates an empty string for the first position.

Could something similar be done in biotite?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem while parsing bcif files written by py-mmcif #831

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem while parsing bcif files written by py-mmcif #831

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions