Skip to content

[BUG] Inference fails with some CCD codes but works with SMILES #94

@jandom

Description

@jandom

Describe the bug

This bug is not really a surprise, reported by Pat Walters. It relates to how we inject custom CCD .bcif into biotite and how that doesn't happen at inference time, and cannot be configured (today).

To Reproduce

Here is the config that breaks

{
  "queries": {
    "cyp3a4_9bv5": {
      "chains": [
        {
          "sequence": "MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA",
          "molecule_type": "protein",
          "chain_ids": [
            "A"
          ]
        },
        {
          "smiles": "C=CC1=C(C)/C2=C/c3c(C)c(CCC(=O)O)c4[n]3[Fe@SP2]35<-[N]2=C1/C=c1/c(C)c(C=C)/c([n]13)=C/C1=[N]->5/C(=C\\4)C(CCC(=O)O)=C1C",
          "molecule_type": "ligand",
          "chain_ids": [
            "B"
          ]
        },
        {
          "ccd_codes": "A1ASV",
          "molecule_type": "ligand",
          "chain_ids": [
            "C"
          ]
        }
      ]
    }
  }
}

Here is the config that works

{
  "queries": {
    "cyp3a4_9bv5": {
      "chains": [
        {
          "sequence": "MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA",
          "molecule_type": "protein",
          "chain_ids": [
            "A"
          ]
        },
        {
          "smiles": "C=CC1=C(C)/C2=C/c3c(C)c(CCC(=O)O)c4[n]3[Fe@SP2]35<-[N]2=C1/C=c1/c(C)c(C=C)/c([n]13)=C/C1=[N]->5/C(=C\\4)C(CCC(=O)O)=C1C",
          "molecule_type": "ligand",
          "chain_ids": [
            "B"
          ]
        },
        {
          "smiles": "CC(C)(C)C(=O)Nc1cc(ccc1n2ccnc2)C(F)(F)F",
          "molecule_type": "ligand",
          "chain_ids": [
            "C"
          ]
        }
      ]
    }
  }
}

Expected behavior

Ideally, we'd want things to be consistent between CCD and SMILES (Boltz has tens of issues around various inconsistencies).

Configuration (please complete the following information):

  • DGX

Additional context

I saw this snippet in inference.py and thought we can provide this at inference time

# Parse CCD
        if dataset_config.ccd_file_path is not None:
            logger.debug("Parsing CCD file.")
            self.ccd = pdbx.CIFFile.read(dataset_config.ccd_file_path)
        else:
            self.ccd = BiotiteCCDWrapper()

But while we can, this is not actually what ends up being used when we call structure_with_ref_mol_from_ccd_code.

Instead the hack that worked was

        # Parse CCD
        if dataset_config.ccd_file_path is not None:
            logger.debug("Parsing CCD file.")
            self.ccd = pdbx.CIFFile.read(dataset_config.ccd_file_path)
        else:
            self.ccd = BiotiteCCDWrapper()

        import biotite.structure as struc
        struc.info.set_ccd_path("/home/jandom/workspace/openfold-3/components.bcif")

Now this is not ideal because it involves downloading and preparing the bcif file, like so

wget https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz
gunzip components.cif.gz
python scripts/data_preprocessing/preprocess_ccd_biotite.py components.cif components.bcif

Stack trace

Trace
run_openfold predict     --query_json=research/2024-q1/pat-walters/data/inputs/ccd_codes/cyp3a4_9bv5.json     --output_dir=./output/pat/cyp3a4_9bv5/
/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
WARNING:openfold3.core.data.tools.colabfold_msa_server:Using output directory: /tmp/of3_colabfold_msas for ColabFold MSAs.
Submitting 1 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [elapsed: 00:02 remaining: 00:00]
/home/jandom/workspace/openfold-3/openfold3/core/data/tools/colabfold_msa_server.py:331: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
  tar_gz.extractall(path)
No complexes found for paired MSA generation. Skipping...
Preprocessing templates...
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/multiprocessing/popen_fork.py:67: DeprecationWarning: This process (pid=267696) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
WARNING:openfold3.core.data.framework.single_datasets.inference:----------------------------------------
Failed to process cyp3a4_9bv5 with preferredException type: KeyError
Traceback: Traceback (most recent call last):
  File "/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/biotite/structure/info/atoms.py", line 82, in residue
    component = get_component(
        get_ccd(),
        res_name=res_name,
        allow_missing_coord=allow_missing_coord,
    )
  File "/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/biotite/structure/io/pdbx/convert.py", line 1330, in get_component
    raise KeyError(
    ...<2 lines>...
    )
KeyError: "No rows with residue name 'A1ASV' found in 'chem_comp_atom' category"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/single_datasets/inference.py", line 338, in __getitem__
    features = self.create_all_features(query)
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/single_datasets/inference.py", line 295, in create_all_features
    structure_objs = self.get_structure_with_ref_mols(
        query=query,
    )
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/single_datasets/inference.py", line 191, in get_structure_with_ref_mols
    atom_array, processed_reference_molecules = structure_with_ref_mols_from_query(
                                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        query=query,
        ^^^^^^^^^^^^
    )
    ^
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/primitives/structure/query.py", line 603, in structure_with_ref_mols_from_query
    structure_with_ref_mol_from_ccd_code(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        ccd_code=chain.ccd_codes[0],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        chain_id=chain_id,
        ^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/primitives/structure/query.py", line 465, in structure_with_ref_mol_from_ccd_code
    atom_array = atom_array_from_ccd_code(
        ccd_code,
    ...<2 lines>...
        molecule_type=MoleculeType.LIGAND,
    )
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/primitives/structure/query.py", line 121, in atom_array_from_ccd_code
    res_array = get_residue_cached(ccd_code)
  File "/home/jandom/micromamba/envs/openfold3-env/lib/python3.13/site-packages/biotite/structure/info/atoms.py", line 88, in residue
    raise KeyError(f"No atom information found for residue '{res_name}' in CCD")
KeyError: "No atom information found for residue 'A1ASV' in CCD"

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinginferenceRelating to the inference pipeline

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions