This week, the RCSB PDB switched BCIF encoding from Mol* package to py-mmcif. Biotite fails at parsing these BCIF files when accessing any empty category, e.g. it often happens for label_alt_id.
This issue was also reported here: rcsb/py-mmcif#44 .
The recipe to reproduce it is get file from http://models.rcsb.org/2uzi.bcif and then:
from biotite.structure.io.pdbx import BinaryCIFFile, get_structure
x = BinaryCIFFile.read('2uzi.bcif')
get_structure(x, model=1, use_author_fields=False)
The issue seems to be in decoding a StringArrayEncoding. The comment in that issue analyses the problem in some more depth.
I am not sure if py-mmcif is fully compliant with the BCIF spec, but the fact is that other decoders have no issue parsing these files: py-mmcif, Mol*, ciftools-java. So somehow they are lenient to this condition. So perhaps biotite should also be lenient.
Looking a little bit into the code, this is what ciftools-java (which follows quite closely the original Mol* implementation) does to process the offsets array:
https://github.com/rcsb/ciftools-java/blob/5d5fd56dcf4675ca7695e5977b7f9f1501dcbb4e/src/main/java/org/rcsb/cif/binary/encoding/StringArrayEncoding.java#L78-L82
i.e. in the case of an offset of length 1 it still creates an empty string for the first position.
Could something similar be done in biotite?
This week, the RCSB PDB switched BCIF encoding from Mol* package to py-mmcif. Biotite fails at parsing these BCIF files when accessing any empty category, e.g. it often happens for label_alt_id.
This issue was also reported here: rcsb/py-mmcif#44 .
The recipe to reproduce it is get file from http://models.rcsb.org/2uzi.bcif and then:
The issue seems to be in decoding a
StringArrayEncoding. The comment in that issue analyses the problem in some more depth.I am not sure if py-mmcif is fully compliant with the BCIF spec, but the fact is that other decoders have no issue parsing these files: py-mmcif, Mol*, ciftools-java. So somehow they are lenient to this condition. So perhaps biotite should also be lenient.
Looking a little bit into the code, this is what ciftools-java (which follows quite closely the original Mol* implementation) does to process the
offsetsarray:https://github.com/rcsb/ciftools-java/blob/5d5fd56dcf4675ca7695e5977b7f9f1501dcbb4e/src/main/java/org/rcsb/cif/binary/encoding/StringArrayEncoding.java#L78-L82
i.e. in the case of an offset of length 1 it still creates an empty string for the first position.
Could something similar be done in biotite?