-
Notifications
You must be signed in to change notification settings - Fork 84
fix: CCD/SMILES inconsistency #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jandom
merged 11 commits into
main
from
jandom/2026-01/fix/issue-94-ccd-smiles-inconsistency
Jan 30, 2026
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
e645c37
fix: CCD/SMILES inconsistency
jandom cafc71b
create a setup_biotite_ccd and update docs
jandom bbbfe6a
Merge branch 'main' into jandom/2026-01/fix/issue-94-ccd-smiles-incon…
jandom c9fb62e
force the download (debug)
jandom 5a6393c
force the download (debug)
jandom 3bc3f92
use a remote S3 reference file to trigger download
jandom 04b1fe5
better logging message
jandom 85f6ec0
remove fixme, better imports
jandom 3a21a38
review: comments from Jennifer
jandom 2be65b8
review: script to update CCD on S3
jandom 30bca1b
factor-out duplication
jandom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| # Copyright 2025 AlQuraishi Laboratory | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """S3 utilities for checksum comparison without downloading.""" | ||
|
|
||
| import base64 | ||
| from pathlib import Path | ||
|
|
||
| import boto3 | ||
| from awscrt import checksums | ||
|
|
||
|
|
||
| def get_s3_checksum(bucket: str, key: str) -> str | None: | ||
| """Get CRC64NVME checksum from S3 object metadata (HEAD request, no download).""" | ||
| s3 = boto3.client("s3") | ||
| response = s3.head_object(Bucket=bucket, Key=key, ChecksumMode="ENABLED") | ||
|
|
||
| if "ChecksumCRC64NVME" in response: | ||
| return response["ChecksumCRC64NVME"] | ||
| return None | ||
|
|
||
|
|
||
| def compute_local_crc64nvme_base64(filepath: Path) -> str: | ||
| """Compute CRC64NVME of local file, return as base64 (S3 format).""" | ||
| crc = 0 | ||
| with open(filepath, "rb") as f: | ||
| for chunk in iter(lambda: f.read(8192), b""): | ||
| crc = checksums.crc64nvme(chunk, crc) | ||
| # Convert to bytes (big-endian) and base64 encode | ||
| crc_bytes = crc.to_bytes(8, byteorder="big") | ||
| return base64.b64encode(crc_bytes).decode() | ||
|
|
||
|
|
||
| def s3_file_matches_local(local_path: Path, bucket: str, key: str) -> bool: | ||
| """ | ||
| Compare local file with S3 object using CRC64NVME checksum. | ||
|
|
||
| Returns True if files match, False if they differ or comparison fails. | ||
| """ | ||
| if not local_path.exists(): | ||
| return False | ||
|
|
||
| s3_checksum = get_s3_checksum(bucket, key) | ||
| if s3_checksum is None: | ||
| # S3 object has no checksum, cannot compare | ||
| return False | ||
|
|
||
| local_checksum = compute_local_crc64nvme_base64(local_path) | ||
| return local_checksum == s3_checksum |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| #!/usr/bin/env python3 | ||
| # Copyright 2025 AlQuraishi Laboratory | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """ | ||
| Script to update the Chemical Component Dictionary (CCD). | ||
|
|
||
| Downloads the latest CCD from WWPDB, processes it to BinaryCIF format, | ||
| and uploads to S3. | ||
| """ | ||
|
|
||
| import argparse | ||
| import logging | ||
| from pathlib import Path | ||
|
|
||
| import biotite.setup_ccd | ||
| import boto3 | ||
|
|
||
| from openfold3.core.utils.s3 import compute_local_crc64nvme_base64 | ||
| from openfold3.setup_openfold import S3_BUCKET, S3_KEY | ||
|
|
||
| logging.basicConfig(level=logging.INFO, format="%(message)s") | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def upload_to_s3(local_path: Path, bucket: str, key: str) -> None: | ||
| """Upload file to S3 with CRC64NVME checksum.""" | ||
| logger.info(f"Uploading {local_path} to s3://{bucket}/{key}...") | ||
|
|
||
| checksum = compute_local_crc64nvme_base64(local_path) | ||
| logger.info(f"Local file checksum (CRC64NVME): {checksum}") | ||
|
|
||
| s3 = boto3.client("s3") | ||
| with open(local_path, "rb") as f: | ||
| s3.put_object( | ||
| Bucket=bucket, | ||
| Key=key, | ||
| Body=f, | ||
| ChecksumAlgorithm="CRC64NVME", | ||
| ChecksumCRC64NVME=checksum, | ||
| ) | ||
|
|
||
| logger.info(f"Upload complete: s3://{bucket}/{key}") | ||
|
|
||
|
|
||
| def main() -> None: | ||
| parser = argparse.ArgumentParser( | ||
| description="Update CCD: download from WWPDB, process, and upload to S3" | ||
| ) | ||
| parser.add_argument( | ||
| "--bucket", | ||
| type=str, | ||
| default=S3_BUCKET, | ||
| help=f"S3 bucket name (default: {S3_BUCKET})", | ||
| ) | ||
| parser.add_argument( | ||
| "--key", | ||
| type=str, | ||
| default=S3_KEY, | ||
| help=f"S3 object key (default: {S3_KEY})", | ||
| ) | ||
| parser.add_argument( | ||
| "--skip-upload", | ||
| action="store_true", | ||
| help="Skip S3 upload (useful for local testing)", | ||
| ) | ||
| args = parser.parse_args() | ||
|
|
||
| # Download and process CCD using biotite's setup_ccd | ||
| logger.info("Downloading and processing CCD from WWPDB...") | ||
| biotite.setup_ccd.main() | ||
| output_path = biotite.setup_ccd.OUTPUT_CCD | ||
| logger.info(f"CCD processed and saved to: {output_path}") | ||
|
|
||
| if not args.skip_upload: | ||
| upload_to_s3(output_path, args.bucket, args.key) | ||
| else: | ||
| logger.info("Skipping S3 upload (--skip-upload flag set)") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
50 changes: 50 additions & 0 deletions
50
openfold3/tests/core/data/primitives/structure/test_query.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| from collections import Counter | ||
|
|
||
| import pytest | ||
|
|
||
| from openfold3.core.data.primitives.structure.query import ( | ||
| structure_with_ref_mol_from_ccd_code, | ||
| structure_with_ref_mol_from_smiles, | ||
| ) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "smiles, ccd_code", | ||
| [ | ||
| # Simple test cases | ||
| ("CCO", "EOH"), # Ethanol | ||
| # Pat Walter's CYP substrates | ||
| ("CC(C)(C)C(=O)Nc1cc(ccc1n2ccnc2)C(F)(F)F", "A1ASV"), # cyp3a4_9bv5 | ||
| ("Cc1nc(cs1)c2ccc(cc2)n3c(cnn3)c4ccc(cc4)OC", "A1ASU"), # cyp3a4_9bv6 | ||
| ("c1ccc(c(c1)CCC(=O)NC[C@@H]2Cc3cccc(c3O2)c4cccnc4)Cl", "A1AST"), # cyp3a4_9bv7 | ||
| ("c1cc(c(cc1C(F)(F)F)NC(=O)C2CCC2)n3ccnc3", "A1ASS"), # cyp3a4_9bv8 | ||
| ("c1ccc(c(c1)NC(=O)Nc2cc(ccc2n3ccnc3)C(F)(F)F)Cl", "A1ASR"), # cyp3a4_9bv9 | ||
| ("c1ccc(c(c1)CCCl)NC(=O)Nc2cc(ccc2n3ccnc3)C(F)(F)F", "A1ASQ"), # cyp3a4_9bva | ||
| ("c1ccc(c(c1)CC(=O)Nc2cc(ccc2n3ccnc3)C(F)(F)F)Cl", "A1ASP"), # cyp3a4_9bvb | ||
| ( | ||
| "c1ccc(c(c1)N(Cc2cccc(c2)O)C(=O)Nc3cc(ccc3n4ccnc4)C(F)(F)F)Cl", | ||
| "A1ASO", | ||
| ), # cyp3a4_9bvc | ||
| ( | ||
| "c1ccc(c(c1)NC(=O)N(Cc2cccc(c2)O)c3cc(ccc3n4ccnc4)C(F)(F)F)Cl", | ||
| "A1BNX", | ||
| ), # cyp3a4_9ms1 | ||
| ( | ||
| "c1ccc(cc1)C(c2ccccc2)([C@@H]3CCN(C3)CCc4ccc5c(c4)CCO5)C(=O)N", | ||
| "A1CIW", | ||
| ), # cyp3a4_9plk | ||
| ], | ||
| ) | ||
| def test_consistent_structure_from_smiles_and_ccd_code(smiles, ccd_code): | ||
| struct_from_smiles = structure_with_ref_mol_from_smiles(smiles, chain_id="X") | ||
| struct_from_ccd = structure_with_ref_mol_from_ccd_code(ccd_code, chain_id="X") | ||
|
|
||
| # Ideally, one day, we'll be able to do just this | ||
| # from openfold3.tests import custom_assert_utils | ||
| # custom_assert_utils.assert_atomarray_equal(struct_from_smiles.atom_array, struct_from_ccd.atom_array) | ||
|
|
||
| assert len(struct_from_smiles.atom_array) == len(struct_from_ccd.atom_array) | ||
|
|
||
| assert Counter(struct_from_smiles.atom_array.element) == Counter( | ||
| struct_from_ccd.atom_array.element | ||
| ) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick suggestions to make this test more comprehensive:
Would it make sense to use the
custom_assert_utils.assert_atom_array_equalutility here? Or would we expect some differences in atom ordering / bond ordering so that this assert would fail.If
assert_atom_array_equalis not a good fit here, perhaps we could add a test to check the number of carbons or some other aggregate property.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it'd be really nice indeed if we could use that – but it currently doesn't work, just tried empirically (atm atom names different between smiles and CCD, eg C01 vs C1) – maybe we can just match on element count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah figures. I think the atom counter is good enough here for now, thanks for adding it.