A Bash pipeline for CEL-Seq2 data processing that:
- Reads a manifest CSV describing your samples
- Extracts cell barcodes/UMIs (via
concatenator.py) - Optionally builds a STAR index (if one does not exist)
- Generates introns/exons BEDs from a GTF (if they do not exist)
- Runs Trim Galore/cutadapt, STAR, and intron/exon counting
- Produces an end-of-run MultiQC report summarizing quality across steps
The script is designed to work with tools found on your current environment (PATH). You can still override any tool path via flags or environment variables when needed.
.
├── CEL-Seq2_pipeline.sh # main script
├── tools/
│ ├── concatenator.py
│ ├── get_IntronsExons_fromGTF_mod.py
│ └── getIntronsExons.sh
├── data/
│ └── manifest.csv # sample,raw_fq_dir,cbc_file
└── results/ # outputs per sample + MultiQC report
- Linux or macOS; Bash 4+; Python 3.8+
- Tools available on
PATH(or provide paths via flags):STAR,bedtools,samtools,trim_galore,cutadapt,multiqc
- Python scripts in
tools/:concatenator.py(barcode/UMI extraction)get_IntronsExons_fromGTF_mod.py(makes intron/exon BEDs from GTF)
Create an isolated environment with all dependencies:
# or with conda
conda create -n celseq -c conda-forge -c bioconda \
python=3.10 \
star \
bedtools \
samtools \
trim-galore \
cutadapt \
multiqc \
pandas \
numpy
conda activate celseq(Optionally pin specific versions for strict reproducibility.)
This repository includes a default CBC whitelist file named Celseq2_cbc_file.txt.
If it suits your experiment, you can point the cbc_file column in your manifest directly to this file.
Example manifest using the bundled CBC file (relative path from repo root):
sample,raw_fq_dir,cbc_file
S1,/path/to/fastqs,./Celseq2_cbc_file.txt
S2,/path/to/fastqs,./Celseq2_cbc_file.txtYou can also provide a custom CBC file per sample by giving its path in the cbc_file column.
- Manifest CSV with header:
sample,raw_fq_dir,cbc_filesample,raw_fq_dir,cbc_file S1,/path/to/fastqs,/path/to/S1_barcodes.txt S2,/path/to/fastqs,/path/to/S2_barcodes.txt
- FASTQs for each sample must be named:
<sample>_R1.fq.gzand<sample>_R2.fq.gzinsideraw_fq_dir. - Genome FASTA and GTF should correspond to the same genome build/annotation.
Get help:
./CEL-Seq2_pipeline.sh --helpIf you provide only the FASTA and GTF, the script will:
- build a STAR index (if not already present),
- generate introns/exons BEDs from the GTF,
- run the full pipeline, and
- create a MultiQC report at the end.
./CEL-Seq2_pipeline.sh --manifest data/manifest.csv --genome-fasta /ref/Mus_musculus.GRCm39.dna.primary_assembly.fa --gtf /ref/Mus_musculus.GRCm39.112.gtf --threads 16Artifacts:
results/<sample>/with logs and outputsresults/multiqc/multiqc_report.html
If you already have a STAR index and the intron and exon BED files:
- Point STAR to the index with
--genome-index. - Point the pipeline to the location and basename of your BED files using
--exint-dirand--exint-basename.
Important: The intron/exon BED files must be generated by the
tools/get_IntronsExons_fromGTF_mod.pyscript included in this repository to ensure compatibility with the counting step. Name them as<basename>_introns.bedand<basename>_exons.bedand keep both together in the same directory.
Example:
# Suppose you have:
# STAR index: /ref/STAR_Index/
# Introns BED: /ref/exint/Mousem39EXINt_introns.bed
# Exons BED: /ref/exint/Mousem39EXINt_exons.bed
./CEL-Seq2_pipeline.sh --manifest data/manifest.csv --genome-index /ref/STAR_Index --exint-dir /ref/exint --exint-basename Mousem39EXINt --threads 16In this mode, the pipeline detects the existing BEDs at:
/ref/exint/Mousem39EXINt_introns.bed/ref/exint/Mousem39EXINt_exons.bedand skips regenerating them.
By default the script uses tools from your current environment (PATH). You can override any tool via flags or environment variables:
Flags:
--star /path/to/STAR
--bedtools /path/to/bedtools
--samtools /path/to/samtools
--trim-galore /path/to/trim_galore
--cutadapt /path/to/cutadapt
--multiqc /path/to/multiqc
--multiqc-config /path/to/multiqc_config.yaml
--no-multiqc # disable MultiQCEnvironment variables (equivalent effect):
STAR_BIN=/path/to/STAR BEDTOOLS_BIN=/path/to/bedtools SAMTOOLS_BIN=/path/to/samtools TRIM_GALORE=/path/to/trim_galore CUTADAPT_BIN=/path/to/cutadapt MULTIQC_BIN=/path/to/multiqc MULTIQC_CONFIG=/path/to/multiqc_config.yaml RUN_MULTIQC=true ./CEL-Seq2_pipeline.sh ...For each sample:
results/<sample>/
├── logs/ # step-by-step logs
├── <sample>_cbc.fastq.gz # barcode/UMI-processed reads (trimmed versions also present)
├── <sample>_Aligned.sortedByCoord.out.bam
└── ... # additional per-step outputs
The pipeline produces three gene-expression matrices:
coutc— read counts per gene (raw read-level counts).coutb— observed UMI counts per gene (deduplicated UMIs).coutt— estimated transcript counts per gene (see Nature Methods).
Global reports:
results/multiqc/multiqc_report.html
- Keep FASTA, GTF, and (if used) prebuilt STAR index in sync (same assembly + annotation build).
- If your BED files don’t match the provided GTF, regenerate them with:
python3 tools/get_IntronsExons_fromGTF_mod.py /ref/Mus_musculus.GRCm39.112.gtf Mousem39EXINt # Produces: Mousem39EXINt_introns.bed and Mousem39EXINt_exons.bed in the current directory - If a tool isn’t found, either install it into your environment or pass the corresponding
--toolflag.
- Tool not found: install via conda/mamba, or point the script to the binary with a
--toolflag. - STAR index missing/corrupt: delete and let the script regenerate, or pass a valid
--genome-index. - Mismatched annotation: ensure FASTA, GTF, and BED files all correspond to the same genome build.
- No MultiQC output: make sure
multiqcis installed onPATH, or disable with--no-multiqc.
This project derives from Mouse gastruloids transcriptomics analysis (GPL-3.0) by Anna Alemany and contributors.
Upstream repository: https://github.com/anna-alemany/mouseGastruloids_scRNAseq_tomoseq
See ATTRIBUTION.md for details of changes and file mapping.
Related study: S. C. van den Brink et al. (Nature, 2020) on scRNA-seq and tomo-seq in mouse gastruloids.