Skip to content

fix: CCD/SMILES inconsistency#95

Merged
jandom merged 11 commits intomainfrom
jandom/2026-01/fix/issue-94-ccd-smiles-inconsistency
Jan 30, 2026
Merged

fix: CCD/SMILES inconsistency#95
jandom merged 11 commits intomainfrom
jandom/2026-01/fix/issue-94-ccd-smiles-inconsistency

Conversation

@jandom
Copy link
Collaborator

@jandom jandom commented Jan 21, 2026

Summary

Fixes #94

Changes

  • Added tests that show the problem

Related Issues

Testing

Other Notes

@jandom jandom self-assigned this Jan 21, 2026
@jandom jandom requested review from jnwei and ljarosch January 22, 2026 11:28
@jandom jandom marked this pull request as ready for review January 22, 2026 11:28
@jandom jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Jan 22, 2026

def setup_biotite_ccd() -> None:
logger.info("Starting Biotite CCD setup...")
if not biotite.setup_ccd.OUTPUT_CCD.exists():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we actually expect this condition to ever trigger? Biotite ships a CCD during its own setup, so I'd always expect one to exist there. Imo the problem in Pat's case isn't that there is no Biotite CCD downloaded, it is that the ligands he is working with are so recent that they won't be found in outdated CCDs, which is why one needs to trigger a manual update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually – empirically I found that I need to run this and then tests start to pass. The installed biotite somehow comes with a different CCD file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no I think you're partially right actually – there is some default CCD file that installs with biotite but it's out-of-date and running the setup command updates it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is caused by our version constraint of biotite 1.2.0, which likely ships an outdated CCD. I can confirm that a non-constrained install of biotite doesn't run into this problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a general problem though, of course we can ameliorate it for some time by installing a fresh CCD on setup, but the PDB will keep adding new ligands weekly so you can quickly run into the same problem again. It might be reasonable enough to just include a helpful error message when a ligand could not be found in the CCD that suggests running the setup_ccd command.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and we'd need to get rid of the guard because biotite.setup_ccd.OUTPUT_CCD.exists() should always return true after biotite was installed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is great!

Is there a way to fetch the date / version of the CCD file? Then setup_ccd could prompt the user if they want to download a newer version when it is available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative, we can get an ETag/last modified of the remote CCD file – but it's not the same file as biotite saves, because it does some modification prior. So we can't do a simple ETag1 == ETag2.

Instead I opted for a "stale ETag list" and if the file on disk is stale, it'll re-download.

I excluded all solutions with a sidecar file, this will only make it more messy.

This is overall not great because we hard-code a md5 sum in code but until we move beyond biotite==1.2.0, I don't see a better way.

@jnwei jnwei added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Jan 22, 2026
Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, clean way to update the CCD automatically upon openfold3 setup.

@jandom jandom requested review from jnwei and ljarosch January 22, 2026 14:00
@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Jan 22, 2026
Copy link
Collaborator

@ljarosch ljarosch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I like that we're cleaner about this now and making our tests more robust. However how would this fix this issue for users like Pat at inference time? Is it because this update function will run at setup?

Generally we will keep running into this problem, this fix will update the CCD once, but if you wait a month and try to inference a recent compound you will run into the same issue again. I think it would make sense to add a more prominent pointer to python -m biotite.setup_ccd to our docs, and also reference it in the error message when a CCD entry can't be found.

- Setup a directory for OpenFold3 model parameters [default: `~/.openfold3`]
- Writes the path to `$OPENFOLD_CACHE/ckpt_path`
- Download the model parameters, if the parameter file does not already exist
- Download and setup the [Chemical Component Dictionary (CCD) with Biotite](https://www.biotite-python.org/latest/apidoc/biotite.structure.info.get_ccd.html)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the pointer of this link is super helpful, this is just a function to work with Biotite's CCD representation in code. Why not just add python -m biotite.setup_ccd if the intent is to show users how to download and set it up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could surface this as work-around but that's really our abstraction leaking. One day we might remove biotite, and the users should be none the wiser about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to provide users who might be unfamiliar with the CCD some context about what it is since it is now included in our top level Installation documentation and how biotite is used in relation to the CCD.

We can certainly replace this link with a more general pointer to the CCD (perhaps this one: https://www.wwpdb.org/data/ccd) if that would be more helpful. This might make more sense if we now plan to provide the CCD file ourselves

@jandom
Copy link
Collaborator Author

jandom commented Jan 27, 2026

You do have a point about this working "once only" – only if the bcif has the stale hash, will this ever update, and after-wards it'll just be stuck. But I don't see a way around it other than by introducing state, which I really-really don't want to do.

@jandom
Copy link
Collaborator Author

jandom commented Jan 27, 2026

Sorry, i know this is becoming a PR turned RFC nightmare. Here is what i think we should do, because there is a real limitation to the approach in this PR, in that the update will happen once-only.

Context

  • biotite periodically will update their CCD database, this will happen without our control, and might tail the CCD
  • missing CCD codes cause inference issues, where users need to swap-up "missing" CCD codes for smiles (Pat's problem)
  • we're baked into a particular biotite version for the time being - see [BUG] Broken test_tokenizer.py tests for biotite>=1.3 #96
  • there is no obvious way to detect a stale user CCD relative to the PDB, because biotite performs a transform internally, we can't use the ETag

Options

  1. detect a stale CCD by hash (this PR), will correctly update a stale CCD inside biotite but works only once
  2. manage the CCD file ourselves, this avoids hard-coding anything in code but requires us to manage a file on S3 (not too bad?). We then can control the update logic via ETags and pull in an update version when needed.

@jandom jandom requested a review from ljarosch January 28, 2026 09:30
Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution! Provides an easy way to provide an updated CCD bcif ourselves and inserting into biotite's default CCD. The only potential issue I can think of is if a structure gets added to the CCD that biotite cannot support, but that seems like a very narrow edge case.

Two requests:

  • Could we add a small script / workflow for how to update the CCD file to the bucket?
  • I added a suggestion to make the test for smiles / CCD comparision more robust.

- Setup a directory for OpenFold3 model parameters [default: `~/.openfold3`]
- Writes the path to `$OPENFOLD_CACHE/ckpt_path`
- Download the model parameters, if the parameter file does not already exist
- Download and setup the [Chemical Component Dictionary (CCD) with Biotite](https://www.biotite-python.org/latest/apidoc/biotite.structure.info.get_ccd.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to provide users who might be unfamiliar with the CCD some context about what it is since it is now included in our top level Installation documentation and how biotite is used in relation to the CCD.

We can certainly replace this link with a more general pointer to the CCD (perhaps this one: https://www.wwpdb.org/data/ccd) if that would be more helpful. This might make more sense if we now plan to provide the CCD file ourselves

struct_from_smiles = structure_with_ref_mol_from_smiles(smiles, chain_id="X")
struct_from_ccd = structure_with_ref_mol_from_ccd_code(ccd_code, chain_id="X")

assert len(struct_from_smiles.atom_array) == len(struct_from_ccd.atom_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick suggestions to make this test more comprehensive:

  • Would it make sense to use the custom_assert_utils.assert_atom_array_equal utility here? Or would we expect some differences in atom ordering / bond ordering so that this assert would fail.

  • If assert_atom_array_equal is not a good fit here, perhaps we could add a test to check the number of carbons or some other aggregate property.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it'd be really nice indeed if we could use that – but it currently doesn't work, just tried empirically (atm atom names different between smiles and CCD, eg C01 vs C1) – maybe we can just match on element count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah figures. I think the atom counter is good enough here for now, thanks for adding it.

@jandom jandom requested a review from jnwei January 29, 2026 14:11
@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Jan 29, 2026
Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

struct_from_smiles = structure_with_ref_mol_from_smiles(smiles, chain_id="X")
struct_from_ccd = structure_with_ref_mol_from_ccd_code(ccd_code, chain_id="X")

assert len(struct_from_smiles.atom_array) == len(struct_from_ccd.atom_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah figures. I think the atom counter is good enough here for now, thanks for adding it.

@jandom jandom merged commit 359e6fd into main Jan 30, 2026
2 checks passed
@jandom jandom deleted the jandom/2026-01/fix/issue-94-ccd-smiles-inconsistency branch January 30, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Inference fails with some CCD codes but works with SMILES

3 participants