Skip to content

CMDM-Lab/DeePFAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis

This repository provides implementations and code examples for DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis. DeePFAS projects raw MS/MS data into the latent space of chemical structures for PFAS (Per- and Polyfluoroalkyl Substances) identification, facilitating the inference of structurally similar compounds by comparing spectra to multiple candidate molecules within this latent chemical space.

Getting started

Installation

Run the following code to install DeePFAS

git clone git@github.com:CMDM-Lab/DeePFAS.git

conda create -n DeePFAS python=3.10.0 --yes
conda activate DeePFAS
cd DeePFAS/DeePFAS

pip install -r requirements.txt

Quickstart

Download pretrained models

Option 1: Download the models automatically by running the following shell script.

Option 2: Manually download the model parameters from https://zenodo.org/records/15083140

  • Copy ae_best_model.pt into the ae/ae_saved directory.
  • Copy the DeePFAS model files into the DeePFAS/deepfas_saved directory.
  • The default model is set to deepfas_r2_over_best_model.pt. You can change the model by modifying the save_model_path parameter in DeePFAS/config/deepfas_config.json.
cd DeePFAS/DeePFAS
./download_models.sh

Download the mass spectra of a wastewater sample (WWTP3)

The wastewaster sample was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry by for details.

cd DeePFAS/DeePFAS
./download_wwtp3.sh

Download the mass spectra of PFAS standard mixtures (std_150)

The PFAS standard mixtures was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry for details

cd DeePFAS/DeePFAS
./download_std_150.sh

Download the mass spectra of NIST PFAS database with MGF file format (Mascot Generic Format)

The NIST PFAS Database (version 1.1) is a public database and can be downloaded on https://data.nist.gov/od/id/mds2-2905 with SQLite format. The MGF-format file was extracted and converted by Heng Wang.

cd DeePFAS/DeePFAS
./download_nist_pfas.sh

Download PubChem molecule database with chemical embedding generated by AutoEncoder

A small molecule database mol_dataset/mol_database.hdf5 includes approximately 50000 molecules for rapid testing and PFAS annotation. Larger molecule database within chemical embedding is available on huggingface

PFAS annotation

Please convert MS/MS spectra as .mgf format and execute script test_deepfas.sh to quickstart PFAS annotation

cd DeePFAS/DeePFAS
./test_deepfas.sh

Convert MS/MS spectra data to .mgf format

.mgf file is converted by python package pyteomics

from pyteomics import mgf
import numpy as np
data = []

intensity = [0.1, 1.0, 0.3, 0.4]
m_z = [11.1, 23.23, 111.44, 55.2]
spectrum = {
    'params': {
        # identifier of spectra in .mgf file (necessary)
        'title': 0,
        # ms level (necessary)
        'mslevel': 2,
        # precursor m/z (necessary)
        'pepmass': 562.957580566406,
        # adduct type (necessary)
        'precursor_type': '[M-H]-',
        # In eval mode, canonicalsmiless is necessary (unnecessary)
        'canonicalsmiles': 'O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F',
        # collision energy (necessary)
        # absolute collision energy (ACE) format: 'collision_energy': 12
        # normalized collision energy (NCE) format: 'collision_energy': 'NCE=37.5%'
        'collision_energy': 'NCE=37.5%'
    },
    # m/z array (necessary)
    'm/z array': np.array(intensity), 
    # intensity array (necessary)
    'intensity array': np.array(m_z)
}


data.append(spectrum)
mgf.write(data, 'spectra.mgf', file_mode='w', write_charges=False)

Generate customized molecule library

Molecule library and its chemical embedding are stored as .hdf5 format in order to save storage space. Overwrite path of molecule file to dataset_path in gen_latent_space_config.json

cd DeePFAS/DeePFAS
python3 ae/gen_latent_space.py \
 --deepfas_config_pth DeePFAS/config/deepfas_config.json \
 --ae_config_pth ae/config/gen_latent_space_config.json \
 --latent_space_out_pth your_file_name.hdf5 \
 --chunk_size 100000 \
 --compression_level 9

A simple GUI interface

  1. Download Models
    Make sure you have downloaded the required models mentioned in the first section.

  2. Load MS2 Spectra
    Click Load MS2 Spectra (.mgf) to import the input MS2 spectra in .mgf format.

  3. Load Molecule Database
    Click Load Molecule Database (.hdf5) to load the reference molecular database.
    You can use the sample database provided at: mol_dataset/mol_database.hdf5

  4. Set Output Directory
    Click Load Output Results Dir to choose the folder where results will be saved.

cd DeePFAS/DeePFAS
# (Linux / MacOS)
./run_DeePFAS.sh
# (Windows)
run_DeePFAS.bat

About

Encoding MS/MS spectra to chemical representation for identification of PFAS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors