Fix handling of oper_expression in get_assemblies by davegrays · Pull Request #792 · biotite-dev/biotite

davegrays · 2025-04-28T00:22:57Z

structure.io.pdbx.get_assembly works in most cases but makes assumptions that are often violated, leading to chain instances that are given the same sym_id when they should be different.

for example, using this code:

from biotite.database import rcsb
import biotite.structure.io.pdbx as pdbx

ciffile = rcsb.fetch(pdb_id, "cif")
cif = pdbx.CIFFile.read(ciffile)
assembly_gen_category = cif.block["pdbx_struct_assembly_gen"]
array = pdbx.get_assembly(cif.block, assembly_id="1")

print("assembly_id, op_expr, asym_id_expr")
print(list(zip(
    assembly_gen_category["assembly_id"].as_array(str),
    assembly_gen_category["oper_expression"].as_array(str),
    assembly_gen_category["asym_id_list"].as_array(str),
)))
print("chain_ids:", np.unique(array.sym_id))

gives me the following output for pdb_id 1nqb assembly 1, which is correct:

assembly_id, op_expr, asym_id_expr
[('1', '1,2,3', 'A,B,C,D')]
chain_ids: array([0, 1, 2])

but the following for pdb_id 4zxb, which creates all chain instances with sym_id 0 when there should be both 0 and 1

assembly_id, op_expr, asym_id_expr
[('1', '1', 'A,B,C,D,E,F,G,H,I,J,K,L,M,N,O'), ('1', '2', 'A,B,C,D,E,F,G,H,I,J,K,L,M,N,O')]
chain_ids: array([0])

The fix is to aggregate the oper_expressions across rows by assembly_id, rather than iterating through them separately (which overwrites the previous sym_id each time)

codspeed-hq · 2025-04-28T00:53:45Z

CodSpeed Performance Report

Merging #792 will not alter performance

_{Comparing davegrays:fix/get-assemblies (9cac0dd) with main (5377013)}

Summary

✅ 59 untouched benchmarks

davegrays · 2025-04-28T06:37:51Z

FYI looks like the CI is failing due to some checks on the new test data that I added (4zxb.bcif, downloaded directly from the pdb landing page). All the tests on the existing data are working tho, as is the test I added, so I'm assuming there's some additional requirements about test data that I'm unaware of. Can you give me any pointers?

padix-key · 2025-04-28T14:36:04Z

Thanks for this important fix! Regarding the failing tests, there are indeed some additional requirements:

Just for completeness, please add the corresponding .cif file as well, then also the other CIF tests can profit from the new entry as well.
Could you also add a small description of the purpose for adding this file (e.g. Multiple entries in pdbx_struct_assembly_gen for the same assembly_id) to https://github.com/biotite-dev/biotite/tree/main/tests/structure/data/README.rst?
One of the tests fails due to missing 3Di sequences. The instructions how to generate them are in https://github.com/biotite-dev/biotite/blob/main/tests/structure/data/alphabet/README.rst. As I have foldseek already installed, I can also generate them, if you prefer.

The last failing test is test_sequence.py::test_pdbx_sequence_consistency, which I assume is specific to this new entry. I can further look into this, but for now it seems that 4zxb introduces not one but two edge cases, that have not been covered before.

davegrays · 2025-04-28T21:30:30Z

I'll take care of the first two bullet points. Can you point me to your foldseek version that you're using? I don't see this listed anywhere in the requirements

davegrays · 2025-04-29T01:15:24Z

@padix-key I've come across more exceptions that fail even for this new logic that I've defined. Have a look for example at assembly "2" from pdb_id 1ncb

assembly_id, op_expr, asym_id_expr
[('1', '1,2,3,4', 'A,B,C,D,E,F,G,H,I,J'), ('2', '1,2,3,4', 'A,C,D,E,F,G,H,J'), ('2', '5,6,7,8', 'A,B,C,D,E,F,G,H,I,J')]

This violates an even more basic assumption of the test I've rewritten: that all asymmetric units of a given assembly have the same chains. Actually I'm not sure why this assumption was there to begin with.

The fix is to generalize the logic even further, by aggregating and applying all the oper_expressions per chain, and eliminate the requirement regarding that false assumption.

LMK if that makes sense to you or you have any concerns.

padix-key · 2025-04-29T18:02:31Z

Can you point me to your foldseek version that you're using? I don't see this listed anywhere in the requirements.

This is true, the reason is that foldseek is not strictly required for the development process, as the generated foldseek reference values are part of the repo. Nevertheless, the version should be specified in the README. When I generated the 3Di sequences I used version 9-427df8a

padix-key · 2025-04-29T18:17:22Z

Actually I'm not sure why this assumption was there to begin with.

This assumption is clearly wrong, however I assume your current version already fixes this: For 1ncb the chain_ops look like

A [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
C [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
D [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
E [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
F [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
G [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
H [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
J [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
B [('5',), ('6',), ('7',), ('8',)]
I [('5',), ('6',), ('7',), ('8',)]

That fits

_pdbx_struct_assembly_gen.assembly_id 
_pdbx_struct_assembly_gen.oper_expression 
_pdbx_struct_assembly_gen.asym_id_list 
2 1,2,3,4 A,C,D,E,F,G,H,J     
2 5,6,7,8 A,B,C,D,E,F,G,H,I,J

right?

I skimmed through the generated assembly and this looked also correct:

import biotite.structure.io.pdbx as pdbx
import biotite.database.rcsb as rcsb
import biotite.structure as struc

pdbx_file = pdbx.CIFFile.read(rcsb.fetch("1ncb", "cif", "."))
assembly = pdbx.get_assembly(pdbx_file, model=1, assembly_id="2", use_author_fields=False)
for atom in assembly[struc.get_chain_starts(assembly)]:
    print(atom.chain_id, atom.sym_id)

Output:

A 0
A 1
A 2
A 3
A 4
A 5
A 6
A 7
C 0
C 1
C 2
C 3
C 4
C 5
C 6
C 7
D 0
D 1
D 2
D 3
D 4
D 5
D 6
D 7
E 0
E 1
E 2
E 3
E 4
E 5
E 6
E 7
F 0
F 1
F 2
F 3
F 4
F 5
F 6
F 7
G 0
G 1
G 2
G 3
G 4
G 5
G 6
G 7
H 0
H 1
H 2
H 3
H 4
H 5
H 6
H 7
J 0
J 1
J 2
J 3
J 4
J 5
J 6
J 7
B 0
B 1
B 2
B 3
I 0
I 1
I 2
I 3

One additional nice feature would be if the assembly was sorted by sym_id instead of chain_id.

Do I miss something here?

davegrays · 2025-04-30T04:52:16Z

@padix-key yes your understanding of the structure for 1ncb is correct, and the current changes produce the correct output. My last comment was just addressing the fact that I made an additional commit that day in order to handle those scenarios.

ok, i can sort the final assembly by sym_id. i assume we would want it sorted by sym_id, then chain_id, then res_id, yes? Anything else to sort on? I'm kind of assuming the atom ordering within residues might get scrambled and that we'd want to avoid that

padix-key · 2025-04-30T06:54:44Z

I would just sort it by sym_id and keep the remaining atom order as it is. This could be achieved with something like

struc.concatenate(
    [assembly[assembly.sym_id == sym_id] for sym_id in range(max_sym_id + 1)]
)

davegrays · 2025-05-01T02:33:40Z

@padix-key - I think I've addressed all your points now.

Let me know if you have time to look into the failing test_sequence.py::test_pdbx_sequence_consistency test. Otherwise I can find a different pdb that passes and we can debug this edge case in another MR

davegrays · 2025-05-02T00:10:02Z

@padix-key alright, I've resolved test_sequence.py::test_pdbx_sequence_consistency as well

TL;DR There is in fact nothing wrong or aberrant with the test data, but the alignments fail if too many residues are missing. I think the sequence test should be skipped for such examples.

The failure occurs on chain C of 4zxb. None of the residues in the entity sequence are non-standard, but some of them are not resolved (33/188). Here's the alignment it produces:

QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATL
QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATL

TADKSSSTAYMH---L-SEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAP--------SVTLGC
TADKSSSTAYMHLSSLTSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPVCGDTTGSSVTLGC

LVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTCNVAHPASSTKVDKK-----------
LVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTSS-TWPSQSITCNVAHPASSTKVDKK

-----------
IEPRGPTIKPC

This is clearly not a good alignment anywhere after the first missing residue. And I can confirm that adding back some of the missing residues (but not all of them) from the end of the sequence fixes it. Here's code to reproduce:

myref = seq.ProteinSequence("QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATLTADKSSSTAYMHLSSLTSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPVCGDTTGSSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTSSTWPSQSITCNVAHPASSTKVDKKIEPRGPTIKPC")

myseq = seq.ProteinSequence("QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATLTADKSSSTAYMHLSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTCNVAHPASSTKVDKK")
best_aln, best_identity = _find_best_match(myseq, {"C": myref})
print(best_identity)

0.93048128342246

myseq = seq.ProteinSequence("QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATLTADKSSSTAYMHLSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTCNVAHPASSTKVDKKIEPRGPTIKPC")
best_aln, best_identity = _find_best_match(myseq, {"C": myref})
print(best_identity)

1.0

LMK if this all makes sense. All the tests should be passing now. Maybe if the alignment tool can better handle missing residues in the future, then we can also drop the skip?

padix-key

There is in fact nothing wrong or aberrant with the test data, but the alignments fail if too many residues are missing

Yes, you are right. The problem was that terminal gaps we not penalized, so the alignment algorithm introduced some mismatches to avoid gap penalties. I added the required change to this PR.

Furthermore, I improved the performance in get_assembly() a bit.

Thanks again for fixing this and analyzing the quirks!
Would you like to have a look at my changes, before I click merge?

davegrays · 2025-05-02T17:11:33Z

Looks great! Thanks for the clean up operation :)

…

On Fri, May 2, 2025 at 8:00 AM Patrick Kunzmann ***@***.***> wrote: ***@***.**** approved this pull request. There is in fact nothing wrong or aberrant with the test data, but the alignments fail if too many residues are missing Yes, you are right. The problem was that terminal gaps we not penalized, so the alignment algorithm introduced some mismatches to avoid gap penalties. I added the required change to this PR. Furthermore, I improved the performance in get_assembly() a bit. Thanks again for fixing this and analyzing the quirks! — Reply to this email directly, view it on GitHub <#792 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACHIWG4PVE5EEAL5FATVQSD24OCAXAVCNFSM6AAAAAB37FJYTCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQMJSGIZTOMBQHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

davegrays added 2 commits April 27, 2025 17:17

Fix handling of oper_expression in get_assemblies

396927d

formatting

4fe816b

davegrays added 3 commits April 28, 2025 04:40

formatting fix

040aa41

add test

ff4221c

add test data

1316b41

davegrays added 4 commits April 28, 2025 23:31

even more general way of applying op_exprs

f1a64cd

add more test data

e5da6ae

remove n/a assertion from sym_id test

8dff349

formatting

f801f6b

davegrays added 2 commits May 1, 2025 02:25

sort get_assembly output by sym_id

7aa059d

foldseek fasta test file

b0ed34a

fix sort by sym_id

e6e45fe

davegrays force-pushed the fix/get-assemblies branch from 1041773 to a8446e9 Compare May 1, 2025 23:40

skip 4zxb due to incorrect alignment

7cc3efa

davegrays force-pushed the fix/get-assemblies branch from a8446e9 to 7cc3efa Compare May 1, 2025 23:40

bug fix sorting an AtomArray or Stack

ebab690

davegrays force-pushed the fix/get-assemblies branch from cce76e8 to ebab690 Compare May 2, 2025 00:09

davegrays temporarily deployed to publish May 2, 2025 00:31 — with GitHub Actions Inactive

Make alignment resilient to missing terminal residues

09de73c

More efficient assembly concatenation

4d3c197

padix-key temporarily deployed to publish May 2, 2025 14:48 — with GitHub Actions Inactive

More efficient assembly sorting

9cac0dd

padix-key approved these changes May 2, 2025

View reviewed changes

padix-key temporarily deployed to publish May 2, 2025 15:13 — with GitHub Actions Inactive

padix-key merged commit 25ac56c into biotite-dev:main May 2, 2025
24 checks passed

davegrays deleted the fix/get-assemblies branch May 2, 2025 23:06

Conversation

davegrays commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #792 will not alter performance

Summary

Uh oh!

davegrays commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

padix-key commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davegrays commented Apr 28, 2025

Uh oh!

davegrays commented Apr 29, 2025

Uh oh!

padix-key commented Apr 29, 2025

Uh oh!

padix-key commented Apr 29, 2025

Uh oh!

davegrays commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

padix-key commented Apr 30, 2025

Uh oh!

davegrays commented May 1, 2025

Uh oh!

davegrays commented May 2, 2025

Uh oh!

padix-key left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davegrays commented May 2, 2025 via email

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davegrays commented Apr 28, 2025 •

edited

Loading

codspeed-hq bot commented Apr 28, 2025 •

edited

Loading

davegrays commented Apr 28, 2025 •

edited

Loading

padix-key commented Apr 28, 2025 •

edited

Loading

davegrays commented Apr 30, 2025 •

edited

Loading

padix-key left a comment •

edited

Loading