DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis
This repository provides implementations and code examples for DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis. DeePFAS projects raw MS/MS data into the latent space of chemical structures for PFAS (Per- and Polyfluoroalkyl Substances) identification, facilitating the inference of structurally similar compounds by comparing spectra to multiple candidate molecules within this latent chemical space.
Run the following code to install DeePFAS
git clone git@github.com:CMDM-Lab/DeePFAS.git
conda create -n DeePFAS python=3.10.0 --yes
conda activate DeePFAS
cd DeePFAS/DeePFAS
pip install -r requirements.txt
Option 1: Download the models automatically by running the following shell script.
Option 2: Manually download the model parameters from https://zenodo.org/records/15083140
- Copy
ae_best_model.ptinto theae/ae_saveddirectory. - Copy the DeePFAS model files into the
DeePFAS/deepfas_saveddirectory. - The default model is set to
deepfas_r2_over_best_model.pt. You can change the model by modifying thesave_model_pathparameter inDeePFAS/config/deepfas_config.json.
cd DeePFAS/DeePFAS
./download_models.sh
The wastewaster sample was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry by for details.
cd DeePFAS/DeePFAS
./download_wwtp3.sh
The PFAS standard mixtures was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry for details
cd DeePFAS/DeePFAS
./download_std_150.sh
The NIST PFAS Database (version 1.1) is a public database and can be downloaded on https://data.nist.gov/od/id/mds2-2905 with SQLite format. The MGF-format file was extracted and converted by Heng Wang.
cd DeePFAS/DeePFAS
./download_nist_pfas.sh
A small molecule database mol_dataset/mol_database.hdf5 includes approximately 50000 molecules
for rapid testing and PFAS annotation. Larger molecule database within chemical embedding
is available on huggingface
Please convert MS/MS spectra as .mgf format and execute script test_deepfas.sh to quickstart PFAS annotation
cd DeePFAS/DeePFAS
./test_deepfas.sh
.mgf file is converted by python package pyteomics
from pyteomics import mgf
import numpy as np
data = []
intensity = [0.1, 1.0, 0.3, 0.4]
m_z = [11.1, 23.23, 111.44, 55.2]
spectrum = {
'params': {
# identifier of spectra in .mgf file (necessary)
'title': 0,
# ms level (necessary)
'mslevel': 2,
# precursor m/z (necessary)
'pepmass': 562.957580566406,
# adduct type (necessary)
'precursor_type': '[M-H]-',
# In eval mode, canonicalsmiless is necessary (unnecessary)
'canonicalsmiles': 'O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F',
# collision energy (necessary)
# absolute collision energy (ACE) format: 'collision_energy': 12
# normalized collision energy (NCE) format: 'collision_energy': 'NCE=37.5%'
'collision_energy': 'NCE=37.5%'
},
# m/z array (necessary)
'm/z array': np.array(intensity),
# intensity array (necessary)
'intensity array': np.array(m_z)
}
data.append(spectrum)
mgf.write(data, 'spectra.mgf', file_mode='w', write_charges=False)Molecule library and its chemical embedding are stored as .hdf5 format in order to save storage space. Overwrite path of molecule file to dataset_path in gen_latent_space_config.json
cd DeePFAS/DeePFAS
python3 ae/gen_latent_space.py \
--deepfas_config_pth DeePFAS/config/deepfas_config.json \
--ae_config_pth ae/config/gen_latent_space_config.json \
--latent_space_out_pth your_file_name.hdf5 \
--chunk_size 100000 \
--compression_level 9
-
Download Models
Make sure you have downloaded the required models mentioned in the first section. -
Load MS2 Spectra
ClickLoad MS2 Spectra (.mgf)to import the input MS2 spectra in.mgfformat. -
Load Molecule Database
ClickLoad Molecule Database (.hdf5)to load the reference molecular database.
You can use the sample database provided at:mol_dataset/mol_database.hdf5 -
Set Output Directory
ClickLoad Output Results Dirto choose the folder where results will be saved.
cd DeePFAS/DeePFAS
# (Linux / MacOS)
./run_DeePFAS.sh
# (Windows)
run_DeePFAS.bat
