-
Notifications
You must be signed in to change notification settings - Fork 84
Description
Describe the bug
This bug is not really a surprise, reported by Pat Walters. It relates to how we inject custom CCD .bcif into biotite and how that doesn't happen at inference time, and cannot be configured (today).
To Reproduce
Here is the config that breaks
{
"queries": {
"cyp3a4_9bv5": {
"chains": [
{
"sequence": "MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA",
"molecule_type": "protein",
"chain_ids": [
"A"
]
},
{
"smiles": "C=CC1=C(C)/C2=C/c3c(C)c(CCC(=O)O)c4[n]3[Fe@SP2]35<-[N]2=C1/C=c1/c(C)c(C=C)/c([n]13)=C/C1=[N]->5/C(=C\\4)C(CCC(=O)O)=C1C",
"molecule_type": "ligand",
"chain_ids": [
"B"
]
},
{
"ccd_codes": "A1ASV",
"molecule_type": "ligand",
"chain_ids": [
"C"
]
}
]
}
}
}
Here is the config that works
{
"queries": {
"cyp3a4_9bv5": {
"chains": [
{
"sequence": "MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA",
"molecule_type": "protein",
"chain_ids": [
"A"
]
},
{
"smiles": "C=CC1=C(C)/C2=C/c3c(C)c(CCC(=O)O)c4[n]3[Fe@SP2]35<-[N]2=C1/C=c1/c(C)c(C=C)/c([n]13)=C/C1=[N]->5/C(=C\\4)C(CCC(=O)O)=C1C",
"molecule_type": "ligand",
"chain_ids": [
"B"
]
},
{
"smiles": "CC(C)(C)C(=O)Nc1cc(ccc1n2ccnc2)C(F)(F)F",
"molecule_type": "ligand",
"chain_ids": [
"C"
]
}
]
}
}
}Expected behavior
Ideally, we'd want things to be consistent between CCD and SMILES (Boltz has tens of issues around various inconsistencies).
Configuration (please complete the following information):
- DGX
Additional context
I saw this snippet in inference.py and thought we can provide this at inference time
# Parse CCD
if dataset_config.ccd_file_path is not None:
logger.debug("Parsing CCD file.")
self.ccd = pdbx.CIFFile.read(dataset_config.ccd_file_path)
else:
self.ccd = BiotiteCCDWrapper()
But while we can, this is not actually what ends up being used when we call structure_with_ref_mol_from_ccd_code.
Instead the hack that worked was
# Parse CCD
if dataset_config.ccd_file_path is not None:
logger.debug("Parsing CCD file.")
self.ccd = pdbx.CIFFile.read(dataset_config.ccd_file_path)
else:
self.ccd = BiotiteCCDWrapper()
import biotite.structure as struc
struc.info.set_ccd_path("/home/jandom/workspace/openfold-3/components.bcif")
Now this is not ideal because it involves downloading and preparing the bcif file, like so
wget https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz
gunzip components.cif.gz
python scripts/data_preprocessing/preprocess_ccd_biotite.py components.cif components.bcif
Stack trace
Trace
run_openfold predict --query_json=research/2024-q1/pat-walters/data/inputs/ccd_codes/cyp3a4_9bv5.json --output_dir=./output/pat/cyp3a4_9bv5/
/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/torch/cuda/__init__.py:283: UserWarning:
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)
warnings.warn(
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
WARNING:openfold3.core.data.tools.colabfold_msa_server:Using output directory: /tmp/of3_colabfold_msas for ColabFold MSAs.
Submitting 1 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [elapsed: 00:02 remaining: 00:00]
/home/jandom/workspace/openfold-3/openfold3/core/data/tools/colabfold_msa_server.py:331: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
tar_gz.extractall(path)
No complexes found for paired MSA generation. Skipping...
Preprocessing templates...
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/multiprocessing/popen_fork.py:67: DeprecationWarning: This process (pid=267696) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
WARNING:openfold3.core.data.framework.single_datasets.inference:----------------------------------------
Failed to process cyp3a4_9bv5 with preferredException type: KeyError
Traceback: Traceback (most recent call last):
File "/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/biotite/structure/info/atoms.py", line 82, in residue
component = get_component(
get_ccd(),
res_name=res_name,
allow_missing_coord=allow_missing_coord,
)
File "/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/biotite/structure/io/pdbx/convert.py", line 1330, in get_component
raise KeyError(
...<2 lines>...
)
KeyError: "No rows with residue name 'A1ASV' found in 'chem_comp_atom' category"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/single_datasets/inference.py", line 338, in __getitem__
features = self.create_all_features(query)
File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/single_datasets/inference.py", line 295, in create_all_features
structure_objs = self.get_structure_with_ref_mols(
query=query,
)
File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/single_datasets/inference.py", line 191, in get_structure_with_ref_mols
atom_array, processed_reference_molecules = structure_with_ref_mols_from_query(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
query=query,
^^^^^^^^^^^^
)
^
File "/home/jandom/workspace/openfold-3/openfold3/core/data/primitives/structure/query.py", line 603, in structure_with_ref_mols_from_query
structure_with_ref_mol_from_ccd_code(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
ccd_code=chain.ccd_codes[0],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
chain_id=chain_id,
^^^^^^^^^^^^^^^^^^
)
^
File "/home/jandom/workspace/openfold-3/openfold3/core/data/primitives/structure/query.py", line 465, in structure_with_ref_mol_from_ccd_code
atom_array = atom_array_from_ccd_code(
ccd_code,
...<2 lines>...
molecule_type=MoleculeType.LIGAND,
)
File "/home/jandom/workspace/openfold-3/openfold3/core/data/primitives/structure/query.py", line 121, in atom_array_from_ccd_code
res_array = get_residue_cached(ccd_code)
File "/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/biotite/structure/info/atoms.py", line 88, in residue
raise KeyError(f"No atom information found for residue '{res_name}' in CCD")
KeyError: "No atom information found for residue 'A1ASV' in CCD"