`wtsi-pbsc`: a Pacbio long-read single-cell RNA-seq processing pipeline for large-scale analyses based on Nextflow

This pipeline based on Nextflow performs common large-scale preprocessing for Pacbio single-cell RNA-seq data prepared using the Kinnex library preparation kit. Although most of the tools provided for the initial preprocessing steps are fairly easy to run, I wrote this in order to make the analysis of a large number of samples easier with nextflow.

Specifying pipeline parameters

To run the pipeline you need to create:

A parameter file params.yaml: a YAML file that contains these required parameters to run the different steps. Not that each module will require different sets of parameters (detailed in each section below). These YAML syntax to specify these parameters:

#Comment
param_path: "/path/to/file.txt"
param_integer: 20

An executor configuration file exec.config: specifies the runtime parameters such as memory and cpus to ask for for each step as well as the executor, queue name and error strategy. You can find more on the Nextflow documentation website here. We also provide a file sanger.config with the parameters that we used for isogut, although the appropriate runtime parameters may need to change per dataset and per execution environemnt.

Components of `wtsi-pbsc`

wtsi-pbsc consists of four modules that represent different stages of pre-processing. Each module can be run indpendently provided the input files and parameters are correctly specified (e.g. using Nextflow option entry fltnc). Each step needs to be run with a set of parameters specified in the parameters.yaml file.

1- Step 1: HiFi reads to FLTNC reads

Using module fltnc. fltnc takes BAM files with HiFi reads (usually what you get from sequencing) and produces full-length tagged non-concatemaer reads (FLTNC reads).

2- Step 2: Barcode correction and PCR deduplication

3- Step 3: Mapping to the genome using pbmm2 (which is a Pacbio wrapper around minimap2)

4- Step 4: Quantification using IsoQuant

As Quantification can be extremely time and memory-consuming, we implement several parallelisation strategies to speed up the process.

No parallelisation:

IsoQuant runs on all samples jointly and using all regions of the genome together. Suitable for a small number of samples (default IsoQuant behaviour)

By chromosome:

IsoQuant runs on all samples jointly but splitting by chromosome. Suitable for a moderate number of samples (e.g. < 20)

By chunk:

IsoQuant runs on all samples jointly but further splitting each chromosome into smaller chunks. Splitting points within regions of zero-coverage across all samples. Suitable for a large number of samples (e.g. < 60).

Two-pass approach:

A two-pass approach where IsoQuant first quantifies known isoforms on a sample-by-sample basis (i.e. using IsoQuant's command-line argument --no_model_construction). Second, IsoQuant pools all remaining reads turning on isoform discovery (i.e. inconsistent,inconsistent_ambiguous and inconsistent_non_intronic).

Parameters in `params.yaml`:

Parameter	Description
`input_samples_path`	Path to a comma-separated file with a header `sample_id`,`long_read_path`. Each entry should contain the `sample_id` and the path to the input BAM file with HiFi reads.
`exclude_samples`	File containing at least one column (`sample_id`) and listing the samples to be excluded from analysis.
`skera_primers`	Path to primers that link concatenated reads. Required for `skera` to de-concatenate reads. Download here.
`tenx_primers`	Path to 10X 3'/5' primers. Required for `lima` to remove 10X primers. Download here.
`threeprime_whitelist`	10X cell barcode whitelist. Download here.
`barcode_correction_method`	Method used by `isoseq` to correct cell barcodes. Available options: `percentile`, `knee`. Default is `percentile`. Note that this method works better in our data.
`barcode_correction_percentile`	If `barcode_correction_method` is specified, use this to set a threshold for cell calling.
`gtf_f`	GTF file used for quantification. This pipeline was tested with GENCODE v46.
`genome_fasta_f`	Genome in FASTA format.
`min_polya_length`	Minimum polyA tail length. Any reads with fewer A in their polyA tails will be removed. Recommended: 20.
`results_output`	Path to output directory. Note that BAM and other types of files will be stored inside a subdirectory `qc`.
`chunks`	Number of chunks used for the parallelisation of IsoQuant.
`isoquant_exclusion_regions_bed`	Regions to exclude from quantification. In the data directory, V(D)J regions are provided as they are problematic for quantification in a large number of samples. Any BED file can be provided.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
bin		bin
conf		conf
data		data
dev		dev
modules		modules
sample_input		sample_input
scripts		scripts
subworkflows		subworkflows
.gitignore		.gitignore
README.md		README.md
isoseq2.nf		isoseq2.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`wtsi-pbsc`: a Pacbio long-read single-cell RNA-seq processing pipeline for large-scale analyses based on Nextflow

Specifying pipeline parameters

Components of `wtsi-pbsc`

1- Step 1: HiFi reads to FLTNC reads

2- Step 2: Barcode correction and PCR deduplication

3- Step 3: Mapping to the genome using pbmm2 (which is a Pacbio wrapper around minimap2)

4- Step 4: Quantification using IsoQuant

No parallelisation:

By chromosome:

By chunk:

Two-pass approach:

Parameters in `params.yaml`:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wtsi-pbsc: a Pacbio long-read single-cell RNA-seq processing pipeline for large-scale analyses based on Nextflow

Specifying pipeline parameters

Components of wtsi-pbsc

1- Step 1: HiFi reads to FLTNC reads

2- Step 2: Barcode correction and PCR deduplication

3- Step 3: Mapping to the genome using pbmm2 (which is a Pacbio wrapper around minimap2)

4- Step 4: Quantification using IsoQuant

No parallelisation:

By chromosome:

By chunk:

Two-pass approach:

Parameters in params.yaml:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`wtsi-pbsc`: a Pacbio long-read single-cell RNA-seq processing pipeline for large-scale analyses based on Nextflow

Components of `wtsi-pbsc`

Parameters in `params.yaml`:

Packages