Skip to content

Problem while parsing bcif files written by py-mmcif #831

@josemduarte

Description

@josemduarte

This week, the RCSB PDB switched BCIF encoding from Mol* package to py-mmcif. Biotite fails at parsing these BCIF files when accessing any empty category, e.g. it often happens for label_alt_id.

This issue was also reported here: rcsb/py-mmcif#44 .

The recipe to reproduce it is get file from http://models.rcsb.org/2uzi.bcif and then:

from biotite.structure.io.pdbx import BinaryCIFFile, get_structure

x = BinaryCIFFile.read('2uzi.bcif')
get_structure(x, model=1, use_author_fields=False)

The issue seems to be in decoding a StringArrayEncoding. The comment in that issue analyses the problem in some more depth.

I am not sure if py-mmcif is fully compliant with the BCIF spec, but the fact is that other decoders have no issue parsing these files: py-mmcif, Mol*, ciftools-java. So somehow they are lenient to this condition. So perhaps biotite should also be lenient.

Looking a little bit into the code, this is what ciftools-java (which follows quite closely the original Mol* implementation) does to process the offsets array:

https://github.com/rcsb/ciftools-java/blob/5d5fd56dcf4675ca7695e5977b7f9f1501dcbb4e/src/main/java/org/rcsb/cif/binary/encoding/StringArrayEncoding.java#L78-L82

i.e. in the case of an offset of length 1 it still creates an empty string for the first position.

Could something similar be done in biotite?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions