English | 简体中文
Online prediction of other models and prediction tasks can be found here.
Anaconda package manager is recommended for building the training environment. For pre-train and fine-tune models, please ensure that you have a Nvidia GPU and the corresponding drivers are installed. For inference, devices without Nvidia GPU (CPU only, AMD GPU, Apple Silion, etc.) are also acceptable.
1.1 Download and install Anaconda package manager
conda create -n llms python=3.11
conda activate llmsIf you want to pre-train or fine-tune models, make sure you are using Nvidia GPU(s).
Install Nvidia driver and corresponding version of CUDA driver (> 11.0, we used CUDA 12.1).
Also Pytorch (>=2.0) with corresponding CUDA version should also install.
We recommend to use pip to install python packages that needed. Please be sure to install the corresponding CUDA and Torch versions carefully, the CUDA version used in this test environment is 12.1. Please refer to Official Website for the detailed installation tutorial of pytorch.
pip install 'torch<2.4' --index-url https://download.pytorch.org/whl/cu121If you just want to use models for inference (prediction), you can install Pytorch GPU version (above) or install Pytorch CPU version if your machine has no Nvidia GPU.
pip install 'torch<2.4' --index-url https://download.pytorch.org/whl/cpuNext install other required dependencies.
git clone --recursive https://github.com/zhangtaolab/Plant_DNA_LLMs
cd Plant_DNA_LLMs
pip install -r requirements.txt(Optional) If you want to train a mamba model, you need to install several extra dependencies, also you should have a Nvidia GPU.
pip install 'causal-conv1d<=1.3'
pip install 'mamba-ssm<2'glt-lfs is required for download large models and datasets,git-lfs installation can be refer to git-lfs install.
If git-lfs is installed, run the following command
$ git lfs versionwill get message like this
git-lfs/3.3.0 (GitHub; linux amd64; go 1.19.8)To pretrain our DNA models from scratch, please first download the desired pretrained models from HuggingFace or ModelScope to local. You can use git clone (which may require git-lfs to be installed) to retrieve the model or directly download the model from the website.
In the activated llms python environment, use the model_pretrain_from_scratch.py script to pretrain a model use your own dataset.
Before training the model, please prepare a training dataset that contains the splited sequences (length shorted than 2000 bp) from the target genomes. (Detailed information can be found in the supplemental notes out our manuscript)
The dataset file should look like this.
- Here is the pretrain models list
We use DNAGPT model as an example to perform the pretrain.
# prepare a output directory
mkdir pretrain
# download pretrain model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE models/plant-dnagpt-BPE
# prepare your own dataset for pretraining, data can be stored at the data directory
# example: data/pretrain_data.txt- Note: If downloading from huggingface encounters network error, please try to download model from ModelScope or change to the accelerate mirror before downloading.
# Download with git
git clone https://hf-mirror.com/[organization_name/repo_name]
# Download with huggingface-cli
export HF_ENDPOINT="https://hf-mirror.com"
huggingface-cli download [organization_name/repo_name]After preparing the model and dataset, using the following script to pretrain the model.
python model_pretrain_from_scratch.py \
--model_name_or_path models/plant-dnagpt-BPE \
--train_data data/pretrain_data.txt \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 24 \
--num_train_epochs 5 \
--learning_rate 1e-4 \
--warmup_ratio 0.05 \
--bf16 \
--logging_strategy steps \
--logging_steps 100 \
--save_strategy steps \
--save_steps 500 \
--output_dir pretrain/dnagpt-BPE_updatedIn this script:
--model_name_or_path: Path to the foundation model you downloaded--train_data: Path to the train dataset--per_device_train_batch_size: Batch size for training model--gradient_accumulation_steps: Number of updates steps to accumulate the gradients for, before performing a backward/update pass--num_train_epochs: Epoch for training model (also you can train model with steps, then you should change the strategies for save, logging and evaluation)--learning_rate: Learning rate for training model--warmup_ratio: Ratio of total training steps used for a linear warmup from 0 tolearning_rate--bf16: Use bf16 precision for training--logging_strategy: Strategy for logging training information, can beepochorsteps--logging_steps: Steps for logging training information--save_strategy: Strategy for saving model, can beepochorsteps--save_steps: Steps for saving model checkpoints--output_dir: Where to save the pretrained model
Detailed descriptions of the arguments can be referred here.
Finally, wait for the progress bar completed, and the pretrained model will be saved in the pretrain/dnagpt-BPE_updated directory. In this directory, there will be checkpoint directories, a runs directory, and a saved pretrained model.
To fine-tune the plant DNA LLMs, please first download the desired models from HuggingFace or ModelScope to local. You can use git clone (which may require git-lfs to be installed) to retrieve the model or directly download the model from the website.
In the activated llms python environment, use the model_finetune.py script to fine-tune a model for downstream task.
Our script accepts .csv format data (separated by ,) as input, when preparing the training data, please make sure the data contain a header and at least these two columns:
sequence,label
Where sequence is the input sequence, and label is the corresponding label for the sequence.
We also provide several plant genomic datasets for fine-tuning on the HuggingFace and ModelScope.
- Here is the pretrain models list
We use Plant DNAGPT model as example to fine-tune a model for active core promoter prediction.
First download a pretrain model and corresponding dataset from HuggingFace or ModelScope:
# prepare a output directory
mkdir finetune
# download pretrain model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE models/plant-dnagpt-BPE
# download train dataset
git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promoters data/plant-multi-species-core-promoters- Note: If downloading from huggingface encounters network error, please try to download model/dataset from ModelScope or change to the accelerate mirror before downloading.
# Download with git
git clone https://hf-mirror.com/[organization_name/repo_name]
# Download with huggingface-cli
export HF_ENDPOINT="https://hf-mirror.com"
huggingface-cli download [organization_name/repo_name]After preparing the model and dataset, using the following script to finetune model (here is a promoter prediction example)
python model_finetune.py \
--model_name_or_path models/plant-dnagpt-BPE \
--train_data data/plant-multi-species-core-promoters/train.csv \
--test_data data/plant-multi-species-core-promoters/test.csv \
--eval_data data/plant-multi-species-core-promoters/dev.csv \
--train_task classification \
--labels 'Not promoter;Core promoter' \
--run_name plant_dnagpt_BPE_promoter \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 8 \
--learning_rate 1e-5 \
--num_train_epochs 5 \
--load_best_model_at_end \
--metric_for_best_model 'f1' \
--save_strategy epoch \
--logging_strategy epoch \
--evaluation_strategy epoch \
--output_dir finetune/plant-dnagpt-BPE-promoterIn this script:
--model_name_or_path: Path to the foundation model you downloaded--train_data: Path to the train dataset--test_data: Path to the test dataset, omit it if no test data available--dev_data: Path to the validation dataset, omit it if no validation data available--train_task: Determine the task type, should be classification, multi-classification or regression--labels: Set the labels for classification task, separated by;--run_name: Name of the fine-tuned model--per_device_train_batch_size: Batch size for training model--per_device_eval_batch_size: Batch size for evaluating model--learning_rate: Learning rate for training model--num_train_epochs: Epoch for training model (also you can train model with steps, then you should change the strategies for save, logging and evaluation)--load_best_model_at_end: Whether to load the model with the best performance on the evaluated data, default isTrue--metric_for_best_model: Use which metric to determine the best model, default isloss, can beaccuracy,precison,recall,f1ormatthews_correlationfor classification task, andr2orspearmanrfor regression task--save_strategy: Strategy for saving model, can beepochorsteps--logging_strategy: Strategy for logging training information, can beepochorsteps--evaluation_strategy: Strategy for evaluating model, can beepochorsteps--output_dir: Where to save the fine-tuned model
Detailed descriptions of the arguments can be referred here.
Finally, wait for the progress bar completed, and the fine-tuned model will be saved in the plant-dnagpt-BPE-promoter directory. In this directory, there will be a checkpoint directory, a runs directory, and a saved fine-tuning model.
To use a fine-tuned model for inference, please first download the desired models from HuggingFace or ModelScope to local or provide a model trained by yourself.
- Here is the recommended models for different genomic tasks
| Genomic task | Recommended model | Link |
|---|---|---|
| Core promoters | Plant DNAGPT 6mer | Huggingface / Modelscope |
| Sequence conservation | Plant DNAMamba 6mer | Huggingface / Modelscope |
| H3K27ac | Plant DNAMamba 6mer | Huggingface / Modelscope |
| H3K27me3 | Plant DNAMamba 4mer | Huggingface / Modelscope |
| H3K4me3 | Plant DNAMamba 5mer | Huggingface / Modelscope |
| lncRNAs | Plant DNAGemma 6mer | Huggingface / Modelscope |
| Open chromatin | Plant DNAMamba BPE | Huggingface / Modelscope |
| Promoter strength (leaf) | Plant NT singlebase | Huggingface / Modelscope |
| Promoter strength (protoplast) | Plant DNAGemma singlebase | Huggingface / Modelscope |
- All the finetune models list
We use Plant DNAGPT model as example to predict active core promoter in plants.
First download a fine-tuned model and corresponding dataset from HuggingFace or ModelScope
# prepare a work directory
mkdir inference
# download fine-tuned model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE-promoter models/plant-dnagpt-BPE-promoter
# download train dataset
git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promoters data/plant-multi-species-core-promotersWe provide a script named model_inference.py for model inference.
Here is an example that use the script to predict histone modification:
# (method 1) Inference with local model, directly input a sequence
python model_inference.py -m models/plant-dnagpt-BPE-promoter -s 'TTACTAAATTTATAACGATTTTTTATCTAACTTTAGCTCATCAATCTTTACCGTGTCAAAATTTAGTGCCAAGAAGCAGACATGGCCCGATGATCTTTTACCCTGTTTTCATAGCTCGCGAGCCGCGACCTGTGTCCAACCTCAACGGTCACTGCAGTCCCAGCACCTCAGCAGCCTGCGCCTGCCATACCCCCTCCCCCACCCACCCACACACACCATCCGGGCCCACGGTGGGACCCAGATGTCATGCGCTGTACGGGCGAGCAACTAGCCCCCACCTCTTCCCAAGAGGCAAAACCT'
# (method 2) Inference with local model, provide a file contains multiple sequences to predict
python model_inference.py -m models/plant-dnagpt-BPE-promoter -f data/plant-multi-species-core-promoters/test.csv -o inference/promoter_predict_results.txt
# (method 3) Inference with an online model (Auto download the model trained by us from huggingface or modelscope)
python model_inference.py -m zhangtaolab/plant-dnagpt-BPE-promoter -ms huggingface -s 'GGGAAAAAGTGAACTCCATTGTTTTTTCACGCTAAGCAGACCACAATTGCTGCTTGGTACGAAAAGAAAACCGAACCCTTTCACCCACGCACAACTCCATCTCCATTAGCATGGACAGAACACCGTAGATTGAACGCGGGAGGCAACAGGCTAAATCGTCCGTTCAGCCAAAACGGAATCATGGGCTGTTTTTCCAGAAGGCTCCGTGTCGTGTGGTTGTGGTCCAAAAACGAAAAAGAAAGAAAAAAGAAAACCCTTCCCAAGACGTGAAGAAAAGCAATGCGATGCTGATGCACGTTA'In this script:
-m: Path to the fine-tuned model that is used for inference-s: Input DNA sequence, only nucleotide A, C, G, T, N are acceptable-f: Input file that contain multiple sequences, one line for each sequence. If you want to keep more information, file with,of\tseparator is acceptable, but a header containssequencecolumn must be specified.-ms: Download the model fromhuggingfaceormodelscopeif the model is not local. The format of model name iszhangtaolab/model-name, users can copy model name here:
Output results contains the original sequence, input sequence length. If the task type is classification, predicted label and probability of each label will provide; If the task type is regression, a predicted score will provide.
Environment deployment for LLMs may be an arduous job. To simplify this process, we also provide a docker version of our model inference code.
The images of the docker version are here, and the usage of docker implementation is shown below.
For GPU inference (with Nvidia GPU), please pull the image with gpu tag, and make sure your computer has install the Nvidia Container Toolkit.
First download a finetune model from Huggingface or ModelScope, here we use Plant DNAMamba model as an example to predict active core promoters。
# prepare a work directory
mkdir LLM_inference
cd LLM_inference
git clone https://huggingface.co/zhangtaolab/plant-dnamamba-BPE-promoterThen download the corresponding dataset, and if users have their own data, users can also prepare a custom dataset based on the previously mentioned inference data format.
git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promotersOnce the model and dataset are ready, pull our model inference image from docker and test if it works.
docker pull zhangtaolab/plant_llms_inference:gpu
docker run --runtime=nvidia --gpus=all -v ./:/home/llms zhangtaolab/plant_llms_inference:gpu -husage: inference.py [-h] [-v] -m MODEL [-f FILE] [-s SEQUENCE] [-t THRESHOLD]
[-l MAX_LENGTH] [-bs BATCH_SIZE] [-p SAMPLE] [-seed SEED]
[-d {cpu,gpu,mps,auto}] [-o OUTFILE] [-n]
Script for Plant DNA Large Language Models (LLMs) inference
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-m MODEL Model path (should contain both model and tokenizer)
-f FILE File contains sequences that need to be classified
-s SEQUENCE One sequence that need to be classified
-t THRESHOLD Threshold for defining as True class (Default: 0.5)
-l MAX_LENGTH Max length of tokenized sequence (Default: 512)
-bs BATCH_SIZE Batch size for classification (Default: 1)
-p SAMPLE Subsampling for testing (Default: 1e7)
-seed SEED Random seed for subsampling (Default: None)
-d {cpu,gpu,mps,auto}
Choose CPU or GPU to do inference (require specific
drivers) (Default: auto)
-o OUTFILE Prediction results (Default: stdout)
-n Whether or not save the runtime locally (Default:
False)
Example:
docker run --runtime=nvidia --gpus=all -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -f seqfile.csv -o output.txt
docker run --runtime=nvidia --gpus=all -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -s 'ATCGGATCTCGACAGT' -o output.txtIf the preceding information is displayed, the image is downloaded and the inference script can run normally. Inference is performed below using previously prepared models and datasets.
docker run --runtime=nvidia --gpus=all -v ./:/home/llms zhangtaolab/plant_llms_inference:gpu -m /home/llms/plant-dnamamba-BPE-promoter -f /home/llms/plant-multi-species-core-promoters/test.csv -o /home/llms/predict_results.txtAfter the inference progress bar is completed, see the output file predict_results.txt in the current local directory, which saves the prediction results corresponding to each sequence in the input file.
For CPU inference, please pull the image with cpu tag, this image support computer without NVIDIA GPU, such as cpu-only or Apple M-series Silicon. (Note that Inference of DNAMamba model is not supported in CPU mode)
First download a finetune model from Huggingface or ModelScope, here we use Plant DNAGPT model as an example to predict active core promoters。
# prepare a work directory
mkdir LLM_inference
cd LLM_inference
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE-promoterThen download the corresponding dataset, and if users have their own data, users can also prepare a custom dataset based on the previously mentioned inference data format.
git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promotersOnce the model and dataset are ready, pull our model inference image from docker and test if it works.
docker pull zhangtaolab/plant_llms_inference:cpu
docker run -v ./:/home/llms zhangtaolab/plant_llms_inference:cpu -husage: inference.py [-h] [-v] -m MODEL [-f FILE] [-s SEQUENCE] [-t THRESHOLD]
[-l MAX_LENGTH] [-bs BATCH_SIZE] [-p SAMPLE] [-seed SEED]
[-d {cpu,gpu,mps,auto}] [-o OUTFILE] [-n]
Script for Plant DNA Large Language Models (LLMs) inference
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-m MODEL Model path (should contain both model and tokenizer)
-f FILE File contains sequences that need to be classified
-s SEQUENCE One sequence that need to be classified
-t THRESHOLD Threshold for defining as True class (Default: 0.5)
-l MAX_LENGTH Max length of tokenized sequence (Default: 512)
-bs BATCH_SIZE Batch size for classification (Default: 1)
-p SAMPLE Subsampling for testing (Default: 1e7)
-seed SEED Random seed for subsampling (Default: None)
-d {cpu,gpu,mps,auto}
Choose CPU or GPU to do inference (require specific
drivers) (Default: auto)
-o OUTFILE Prediction results (Default: stdout)
-n Whether or not save the runtime locally (Default:
False)
Example:
docker run -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -f seqfile.csv -o output.txt
docker run -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -s 'ATCGGATCTCGACAGT' -o output.txtIf the preceding information is displayed, the image is downloaded and the inference script can run normally. Inference is performed below using previously prepared models and datasets.
docker run -v ./:/home/llms zhangtaolab/plant_llms_inference:cpu -m /home/llms/plant-dnagpt-BPE-promoter -f /home/llms/plant-multi-species-core-promoters/test.csv -o /home/llms/predict_results.txtAfter the inference progress bar is completed, see the output file predict_results.txt in the current local directory, which saves the prediction results corresponding to each sequence in the input file.
- The detailed usage is the same as the section Inference.
In order to facilitate users to use the model to predict DNA analysis tasks, we also provide online prediction platforms.
Please refer to online prediction platform
For developers who want to use our inference code in the Jupyter Notebook or other places, we developed a simple API package in the pdllib, which allows users directly call the inference function.
Besides, we provide a Demo that shows the usage of our API, see notebook/inference_demo.ipynb.
- Liu GQ, Chen L, Wu YC, Han YS, Bao Y, Zhang T*. PDLLMs: A group of tailored DNA large language models for analyzing plant genomes. Molecular Plant 2025, 18(2):175-178
