Fix handling of oper_expression in get_assemblies#792
Fix handling of oper_expression in get_assemblies#792padix-key merged 17 commits intobiotite-dev:mainfrom
Conversation
CodSpeed Performance ReportMerging #792 will not alter performanceComparing Summary
|
|
FYI looks like the CI is failing due to some checks on the new test data that I added ( |
|
Thanks for this important fix! Regarding the failing tests, there are indeed some additional requirements:
The last failing test is |
|
I'll take care of the first two bullet points. Can you point me to your foldseek version that you're using? I don't see this listed anywhere in the requirements |
|
@padix-key I've come across more exceptions that fail even for this new logic that I've defined. Have a look for example at assembly "2" from pdb_id This violates an even more basic assumption of the test I've rewritten: that all asymmetric units of a given assembly have the same chains. Actually I'm not sure why this assumption was there to begin with. The fix is to generalize the logic even further, by aggregating and applying all the oper_expressions per chain, and eliminate the requirement regarding that false assumption. LMK if that makes sense to you or you have any concerns. |
This is true, the reason is that |
This assumption is clearly wrong, however I assume your current version already fixes this: For That fits right? I skimmed through the generated assembly and this looked also correct: import biotite.structure.io.pdbx as pdbx
import biotite.database.rcsb as rcsb
import biotite.structure as struc
pdbx_file = pdbx.CIFFile.read(rcsb.fetch("1ncb", "cif", "."))
assembly = pdbx.get_assembly(pdbx_file, model=1, assembly_id="2", use_author_fields=False)
for atom in assembly[struc.get_chain_starts(assembly)]:
print(atom.chain_id, atom.sym_id)Output: One additional nice feature would be if the assembly was sorted by Do I miss something here? |
|
@padix-key yes your understanding of the structure for 1ncb is correct, and the current changes produce the correct output. My last comment was just addressing the fact that I made an additional commit that day in order to handle those scenarios. ok, i can sort the final assembly by sym_id. i assume we would want it sorted by sym_id, then chain_id, then res_id, yes? Anything else to sort on? I'm kind of assuming the atom ordering within residues might get scrambled and that we'd want to avoid that |
|
I would just sort it by struc.concatenate(
[assembly[assembly.sym_id == sym_id] for sym_id in range(max_sym_id + 1)]
) |
|
@padix-key - I think I've addressed all your points now. Let me know if you have time to look into the failing |
1041773 to
a8446e9
Compare
a8446e9 to
7cc3efa
Compare
cce76e8 to
ebab690
Compare
|
@padix-key alright, I've resolved TL;DR There is in fact nothing wrong or aberrant with the test data, but the alignments fail if too many residues are missing. I think the sequence test should be skipped for such examples. The failure occurs on chain C of 4zxb. None of the residues in the entity sequence are non-standard, but some of them are not resolved (33/188). Here's the alignment it produces: This is clearly not a good alignment anywhere after the first missing residue. And I can confirm that adding back some of the missing residues (but not all of them) from the end of the sequence fixes it. Here's code to reproduce: LMK if this all makes sense. All the tests should be passing now. Maybe if the alignment tool can better handle missing residues in the future, then we can also drop the skip? |
There was a problem hiding this comment.
There is in fact nothing wrong or aberrant with the test data, but the alignments fail if too many residues are missing
Yes, you are right. The problem was that terminal gaps we not penalized, so the alignment algorithm introduced some mismatches to avoid gap penalties. I added the required change to this PR.
Furthermore, I improved the performance in get_assembly() a bit.
Thanks again for fixing this and analyzing the quirks!
Would you like to have a look at my changes, before I click merge?
|
Looks great! Thanks for the clean up operation :)
…On Fri, May 2, 2025 at 8:00 AM Patrick Kunzmann ***@***.***> wrote:
***@***.**** approved this pull request.
There is in fact nothing wrong or aberrant with the test data, but the
alignments fail if too many residues are missing
Yes, you are right. The problem was that terminal gaps we not penalized,
so the alignment algorithm introduced some mismatches to avoid gap
penalties. I added the required change to this PR.
Furthermore, I improved the performance in get_assembly() a bit.
Thanks again for fixing this and analyzing the quirks!
—
Reply to this email directly, view it on GitHub
<#792 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACHIWG4PVE5EEAL5FATVQSD24OCAXAVCNFSM6AAAAAB37FJYTCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQMJSGIZTOMBQHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
structure.io.pdbx.get_assemblyworks in most cases but makes assumptions that are often violated, leading to chain instances that are given the same sym_id when they should be different.for example, using this code:
gives me the following output for pdb_id
1nqbassembly 1, which is correct:but the following for pdb_id
4zxb, which creates all chain instances with sym_id 0 when there should be both 0 and 1The fix is to aggregate the oper_expressions across rows by assembly_id, rather than iterating through them separately (which overwrites the previous sym_id each time)