P2DFlow

P2DFlow is a protein ensemble generative model with SE(3) flow matching based on ESMFold, the ensembles generated by P2DFlow could aid in understanding protein functions across various scenarios.

Technical details and evaluation results are provided in our paper:

P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching

The codes have also been uploaded to Huggingface !

Installation

In an environment with cuda 11.7, run:

conda env create -f environment.yml

To activate the environment, run:

conda activate P2DFlow

Prepare Dataset

(tips: If you want to use the data we have preprocessed, please go directly to `3. Process selected dataset`; if you prefer to process the data from scratch or work with your own data, please start from the beginning)

1. Download raw ATLAS dataset

(i) Download the Analysis & MDs dataset from ATLAS or Baidu Cloud Drive, or you can use ./dataset/download.py by running:

python ./dataset/download.py

We will use .pdb and .xtc files for the following calculation.

2. Calculate the 'approximate energy' and select representative structures

(i) Use gaussian_kde to calculate the 'approximate energy' (You need to put all files above in ./dataset, just like ATLAS_init_example in Google Drive):

python ./dataset/traj_analyse_select.py

And you will get selected representative structures in select dir and traj_info_select.csv for 'approximate energy'.

3. Process selected dataset

(i) Download the selected dataset (or get it from the two steps above) from Google Drive whose filename is selected_dataset_v1.tar or selected_dataset_v2.tar ('v1' selects ~10 structures from MD, 'v2' selects ~100 structures from MD), and decompress it using:

tar -xzvf select_dataset_v1.tar

(ii) Preprocess .pdb files to get .pkl files, compute node representation and pair representation using ESM-2, predict static structure using ESMFold, and get merged .csv file:

python ./data/process_pdb_files.py --pdb_dir ${pdb_dir} --write_dir ${write_dir} --traj_info_file ${traj_info_file} --valid_seq_file ${valid_seq_file} --merged_output_file ${merged_output_file}

And you will get .pkl files (large file size) and metadata_merged.csv. (if you are using your own data, you need to split dataset to get validation set as ${valid_seq_file} first, an example is ./inference/valid_seq.csv). Processed data will be similar to ATLAS_processed_example.tar.gz in Google Drive

Model weights

Download the pretrained checkpoint from Google Drive whose filename is pretrained.ckpt, and put it into ./weights folder. You can use the pretrained weight for inference.

Training

To train P2DFlow, firstly make sure you have prepared the dataset according to Prepare Dataset, and put it in the right folder, then modify ./configs/base.yaml (especially for csv_path). After this, you can run:

python experiments/train_se3_flows.py

And you will get the checkpoints in ./ckpt.

Inference

To infer for specified protein sequence, firstly modify ./configs/inference.yaml (especially for ckpt_path and validset_path), then run:

python experiments/inference_se3_flows.py

And you will get the results in ./inference_outputs/weights/.

Evaluation

To evaluate metrics related to validity, fidelity and dynamics, run:

python ./analysis/eval_result.py --pred_org_dir ${pred_org_dir} --valid_csv_file ${valid_csv_file} --pred_merge_dir ${pred_merge_dir} --target_dir ${target_dir} --crystal_dir ${crystal_dir}

To evaluate PCA, run:

python ./analysis/pca_analyse.py --pred_pdb_dir ${pred_pdb_dir} --target_dir ${target_dir} --crystal_dir ${crystal_dir}

Evaluation results will be similar to evaluation_example in Google Drive

License

This project is licensed under the terms of the GPL-3.0 license.

Citation

@article{jin2025p2dflow,
  title={P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching},
  author={Yaowei Jin, Qi Huang, Ziyang Song, Mingyue Zheng, Dan Teng, Qian Shi},
  journal={Journal of Chemical Theory and Computation},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.cache		.cache
analysis		analysis
configs		configs
data		data
dataset		dataset
experiments		experiments
inference		inference
models		models
openfold		openfold
resources		resources
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

P2DFlow

Table of Contents

Installation

Prepare Dataset

(tips: If you want to use the data we have preprocessed, please go directly to `3. Process selected dataset`; if you prefer to process the data from scratch or work with your own data, please start from the beginning)

1. Download raw ATLAS dataset

2. Calculate the 'approximate energy' and select representative structures

3. Process selected dataset

Model weights

Training

Inference

Evaluation

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

P2DFlow

Table of Contents

Installation

Prepare Dataset

(tips: If you want to use the data we have preprocessed, please go directly to 3. Process selected dataset; if you prefer to process the data from scratch or work with your own data, please start from the beginning)

1. Download raw ATLAS dataset

2. Calculate the 'approximate energy' and select representative structures

3. Process selected dataset

Model weights

Training

Inference

Evaluation

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

(tips: If you want to use the data we have preprocessed, please go directly to `3. Process selected dataset`; if you prefer to process the data from scratch or work with your own data, please start from the beginning)

Packages