Skip to content

Fix handling of oper_expression in get_assemblies#792

Merged
padix-key merged 17 commits intobiotite-dev:mainfrom
davegrays:fix/get-assemblies
May 2, 2025
Merged

Fix handling of oper_expression in get_assemblies#792
padix-key merged 17 commits intobiotite-dev:mainfrom
davegrays:fix/get-assemblies

Conversation

@davegrays
Copy link
Copy Markdown
Contributor

@davegrays davegrays commented Apr 28, 2025

structure.io.pdbx.get_assembly works in most cases but makes assumptions that are often violated, leading to chain instances that are given the same sym_id when they should be different.

for example, using this code:

from biotite.database import rcsb
import biotite.structure.io.pdbx as pdbx

ciffile = rcsb.fetch(pdb_id, "cif")
cif = pdbx.CIFFile.read(ciffile)
assembly_gen_category = cif.block["pdbx_struct_assembly_gen"]
array = pdbx.get_assembly(cif.block, assembly_id="1")

print("assembly_id, op_expr, asym_id_expr")
print(list(zip(
    assembly_gen_category["assembly_id"].as_array(str),
    assembly_gen_category["oper_expression"].as_array(str),
    assembly_gen_category["asym_id_list"].as_array(str),
)))
print("chain_ids:", np.unique(array.sym_id))

gives me the following output for pdb_id 1nqb assembly 1, which is correct:

assembly_id, op_expr, asym_id_expr
[('1', '1,2,3', 'A,B,C,D')]
chain_ids: array([0, 1, 2])

but the following for pdb_id 4zxb, which creates all chain instances with sym_id 0 when there should be both 0 and 1

assembly_id, op_expr, asym_id_expr
[('1', '1', 'A,B,C,D,E,F,G,H,I,J,K,L,M,N,O'), ('1', '2', 'A,B,C,D,E,F,G,H,I,J,K,L,M,N,O')]
chain_ids: array([0])

The fix is to aggregate the oper_expressions across rows by assembly_id, rather than iterating through them separately (which overwrites the previous sym_id each time)

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 28, 2025

CodSpeed Performance Report

Merging #792 will not alter performance

Comparing davegrays:fix/get-assemblies (9cac0dd) with main (5377013)

Summary

✅ 59 untouched benchmarks

@davegrays
Copy link
Copy Markdown
Contributor Author

davegrays commented Apr 28, 2025

FYI looks like the CI is failing due to some checks on the new test data that I added (4zxb.bcif, downloaded directly from the pdb landing page). All the tests on the existing data are working tho, as is the test I added, so I'm assuming there's some additional requirements about test data that I'm unaware of. Can you give me any pointers?

@padix-key
Copy link
Copy Markdown
Member

padix-key commented Apr 28, 2025

Thanks for this important fix! Regarding the failing tests, there are indeed some additional requirements:

  • Just for completeness, please add the corresponding .cif file as well, then also the other CIF tests can profit from the new entry as well.
  • Could you also add a small description of the purpose for adding this file (e.g. Multiple entries in pdbx_struct_assembly_gen for the same assembly_id) to https://github.com/biotite-dev/biotite/tree/main/tests/structure/data/README.rst?
  • One of the tests fails due to missing 3Di sequences. The instructions how to generate them are in https://github.com/biotite-dev/biotite/blob/main/tests/structure/data/alphabet/README.rst. As I have foldseek already installed, I can also generate them, if you prefer.

The last failing test is test_sequence.py::test_pdbx_sequence_consistency, which I assume is specific to this new entry. I can further look into this, but for now it seems that 4zxb introduces not one but two edge cases, that have not been covered before.

@davegrays
Copy link
Copy Markdown
Contributor Author

I'll take care of the first two bullet points. Can you point me to your foldseek version that you're using? I don't see this listed anywhere in the requirements

@davegrays
Copy link
Copy Markdown
Contributor Author

@padix-key I've come across more exceptions that fail even for this new logic that I've defined. Have a look for example at assembly "2" from pdb_id 1ncb

assembly_id, op_expr, asym_id_expr
[('1', '1,2,3,4', 'A,B,C,D,E,F,G,H,I,J'), ('2', '1,2,3,4', 'A,C,D,E,F,G,H,J'), ('2', '5,6,7,8', 'A,B,C,D,E,F,G,H,I,J')]

This violates an even more basic assumption of the test I've rewritten: that all asymmetric units of a given assembly have the same chains. Actually I'm not sure why this assumption was there to begin with.

The fix is to generalize the logic even further, by aggregating and applying all the oper_expressions per chain, and eliminate the requirement regarding that false assumption.

LMK if that makes sense to you or you have any concerns.

@padix-key
Copy link
Copy Markdown
Member

Can you point me to your foldseek version that you're using? I don't see this listed anywhere in the requirements.

This is true, the reason is that foldseek is not strictly required for the development process, as the generated foldseek reference values are part of the repo. Nevertheless, the version should be specified in the README. When I generated the 3Di sequences I used version 9-427df8a

@padix-key
Copy link
Copy Markdown
Member

Actually I'm not sure why this assumption was there to begin with.

This assumption is clearly wrong, however I assume your current version already fixes this: For 1ncb the chain_ops look like

A [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
C [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
D [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
E [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
F [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
G [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
H [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
J [('1',), ('2',), ('3',), ('4',), ('5',), ('6',), ('7',), ('8',)]
B [('5',), ('6',), ('7',), ('8',)]
I [('5',), ('6',), ('7',), ('8',)]

That fits

_pdbx_struct_assembly_gen.assembly_id 
_pdbx_struct_assembly_gen.oper_expression 
_pdbx_struct_assembly_gen.asym_id_list 
2 1,2,3,4 A,C,D,E,F,G,H,J     
2 5,6,7,8 A,B,C,D,E,F,G,H,I,J 

right?

I skimmed through the generated assembly and this looked also correct:

import biotite.structure.io.pdbx as pdbx
import biotite.database.rcsb as rcsb
import biotite.structure as struc

pdbx_file = pdbx.CIFFile.read(rcsb.fetch("1ncb", "cif", "."))
assembly = pdbx.get_assembly(pdbx_file, model=1, assembly_id="2", use_author_fields=False)
for atom in assembly[struc.get_chain_starts(assembly)]:
    print(atom.chain_id, atom.sym_id)

Output:

A 0
A 1
A 2
A 3
A 4
A 5
A 6
A 7
C 0
C 1
C 2
C 3
C 4
C 5
C 6
C 7
D 0
D 1
D 2
D 3
D 4
D 5
D 6
D 7
E 0
E 1
E 2
E 3
E 4
E 5
E 6
E 7
F 0
F 1
F 2
F 3
F 4
F 5
F 6
F 7
G 0
G 1
G 2
G 3
G 4
G 5
G 6
G 7
H 0
H 1
H 2
H 3
H 4
H 5
H 6
H 7
J 0
J 1
J 2
J 3
J 4
J 5
J 6
J 7
B 0
B 1
B 2
B 3
I 0
I 1
I 2
I 3

One additional nice feature would be if the assembly was sorted by sym_id instead of chain_id.

Do I miss something here?

@davegrays
Copy link
Copy Markdown
Contributor Author

davegrays commented Apr 30, 2025

@padix-key yes your understanding of the structure for 1ncb is correct, and the current changes produce the correct output. My last comment was just addressing the fact that I made an additional commit that day in order to handle those scenarios.

ok, i can sort the final assembly by sym_id. i assume we would want it sorted by sym_id, then chain_id, then res_id, yes? Anything else to sort on? I'm kind of assuming the atom ordering within residues might get scrambled and that we'd want to avoid that

@padix-key
Copy link
Copy Markdown
Member

I would just sort it by sym_id and keep the remaining atom order as it is. This could be achieved with something like

struc.concatenate(
    [assembly[assembly.sym_id == sym_id] for sym_id in range(max_sym_id + 1)]
)

@davegrays
Copy link
Copy Markdown
Contributor Author

@padix-key - I think I've addressed all your points now.

Let me know if you have time to look into the failing test_sequence.py::test_pdbx_sequence_consistency test. Otherwise I can find a different pdb that passes and we can debug this edge case in another MR

@davegrays davegrays force-pushed the fix/get-assemblies branch from 1041773 to a8446e9 Compare May 1, 2025 23:40
@davegrays davegrays force-pushed the fix/get-assemblies branch from a8446e9 to 7cc3efa Compare May 1, 2025 23:40
@davegrays davegrays force-pushed the fix/get-assemblies branch from cce76e8 to ebab690 Compare May 2, 2025 00:09
@davegrays
Copy link
Copy Markdown
Contributor Author

@padix-key alright, I've resolved test_sequence.py::test_pdbx_sequence_consistency as well

TL;DR There is in fact nothing wrong or aberrant with the test data, but the alignments fail if too many residues are missing. I think the sequence test should be skipped for such examples.

The failure occurs on chain C of 4zxb. None of the residues in the entity sequence are non-standard, but some of them are not resolved (33/188). Here's the alignment it produces:

QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATL
QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATL

TADKSSSTAYMH---L-SEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAP--------SVTLGC
TADKSSSTAYMHLSSLTSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPVCGDTTGSSVTLGC

LVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTCNVAHPASSTKVDKK-----------
LVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTSS-TWPSQSITCNVAHPASSTKVDKK

-----------
IEPRGPTIKPC

This is clearly not a good alignment anywhere after the first missing residue. And I can confirm that adding back some of the missing residues (but not all of them) from the end of the sequence fixes it. Here's code to reproduce:

myref = seq.ProteinSequence("QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATLTADKSSSTAYMHLSSLTSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPVCGDTTGSSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTSSTWPSQSITCNVAHPASSTKVDKKIEPRGPTIKPC")

myseq = seq.ProteinSequence("QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATLTADKSSSTAYMHLSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTCNVAHPASSTKVDKK")
best_aln, best_identity = _find_best_match(myseq, {"C": myref})
print(best_identity)

0.93048128342246

myseq = seq.ProteinSequence("QVQLQQSGPELVKPGALVKISCKASGYTFTNYDIHWVKQRPGQGLEWIGWIYPGDGSTKYNEKFKGKATLTADKSSSTAYMHLSEKSAVYFCAREWAYWGQGTLVTVSAAKTTAPSVYPLAPSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVTCNVAHPASSTKVDKKIEPRGPTIKPC")
best_aln, best_identity = _find_best_match(myseq, {"C": myref})
print(best_identity)

1.0

LMK if this all makes sense. All the tests should be passing now. Maybe if the alignment tool can better handle missing residues in the future, then we can also drop the skip?

Copy link
Copy Markdown
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is in fact nothing wrong or aberrant with the test data, but the alignments fail if too many residues are missing

Yes, you are right. The problem was that terminal gaps we not penalized, so the alignment algorithm introduced some mismatches to avoid gap penalties. I added the required change to this PR.

Furthermore, I improved the performance in get_assembly() a bit.

Thanks again for fixing this and analyzing the quirks!
Would you like to have a look at my changes, before I click merge?

@davegrays
Copy link
Copy Markdown
Contributor Author

davegrays commented May 2, 2025 via email

@padix-key padix-key merged commit 25ac56c into biotite-dev:main May 2, 2025
24 checks passed
@davegrays davegrays deleted the fix/get-assemblies branch May 2, 2025 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants