CEL-Seq2 Mapping Pipeline

A Bash pipeline for CEL-Seq2 data processing that:

Reads a manifest CSV describing your samples
Extracts cell barcodes/UMIs (via concatenator.py)
Optionally builds a STAR index (if one does not exist)
Generates introns/exons BEDs from a GTF (if they do not exist)
Runs Trim Galore/cutadapt, STAR, and intron/exon counting
Produces an end-of-run MultiQC report summarizing quality across steps

The script is designed to work with tools found on your current environment (PATH). You can still override any tool path via flags or environment variables when needed.

Contents (expected layout)

.
├── CEL-Seq2_pipeline.sh              # main script
├── tools/
│   ├── concatenator.py
│   ├── get_IntronsExons_fromGTF_mod.py
│   └── getIntronsExons.sh
├── data/
│   └── manifest.csv                  # sample,raw_fq_dir,cbc_file
└── results/                          # outputs per sample + MultiQC report

Requirements

Linux or macOS; Bash 4+; Python 3.8+
Tools available on PATH (or provide paths via flags):
- STAR, bedtools, samtools, trim_galore, cutadapt, multiqc
Python scripts in tools/:
- concatenator.py (barcode/UMI extraction)
- get_IntronsExons_fromGTF_mod.py (makes intron/exon BEDs from GTF)

Install with conda/mamba (recommended)

Create an isolated environment with all dependencies:

# or with conda
conda create -n celseq -c conda-forge -c bioconda \
    python=3.10 \
    star \
    bedtools \
    samtools \
    trim-galore \
    cutadapt \
    multiqc \
    pandas \
    numpy

conda activate celseq

(Optionally pin specific versions for strict reproducibility.)

Input conventions

Note on the CBC file (`cbc_file` column)

This repository includes a default CBC whitelist file named Celseq2_cbc_file.txt. If it suits your experiment, you can point the cbc_file column in your manifest directly to this file.

Example manifest using the bundled CBC file (relative path from repo root):

sample,raw_fq_dir,cbc_file
S1,/path/to/fastqs,./Celseq2_cbc_file.txt
S2,/path/to/fastqs,./Celseq2_cbc_file.txt

You can also provide a custom CBC file per sample by giving its path in the cbc_file column.

Manifest CSV with header: sample,raw_fq_dir,cbc_file

sample,raw_fq_dir,cbc_file
S1,/path/to/fastqs,/path/to/S1_barcodes.txt
S2,/path/to/fastqs,/path/to/S2_barcodes.txt

FASTQs for each sample must be named: <sample>_R1.fq.gz and <sample>_R2.fq.gz inside raw_fq_dir.
Genome FASTA and GTF should correspond to the same genome build/annotation.

Running the pipeline

Get help:

./CEL-Seq2_pipeline.sh --help

Example A — Minimal run (FASTA + GTF only)

If you provide only the FASTA and GTF, the script will:

build a STAR index (if not already present),
generate introns/exons BEDs from the GTF,
run the full pipeline, and
create a MultiQC report at the end.

./CEL-Seq2_pipeline.sh   --manifest data/manifest.csv   --genome-fasta /ref/Mus_musculus.GRCm39.dna.primary_assembly.fa   --gtf /ref/Mus_musculus.GRCm39.112.gtf   --threads 16

Artifacts:

results/<sample>/ with logs and outputs
results/multiqc/multiqc_report.html

Example B — Use an existing STAR index + existing intron/exon BEDs

If you already have a STAR index and the intron and exon BED files:

Point STAR to the index with --genome-index.
Point the pipeline to the location and basename of your BED files using --exint-dir and --exint-basename.

Important: The intron/exon BED files must be generated by the tools/get_IntronsExons_fromGTF_mod.py script included in this repository to ensure compatibility with the counting step. Name them as <basename>_introns.bed and <basename>_exons.bed and keep both together in the same directory.

Example:

# Suppose you have:
#   STAR index:           /ref/STAR_Index/
#   Introns BED:          /ref/exint/Mousem39EXINt_introns.bed
#   Exons BED:            /ref/exint/Mousem39EXINt_exons.bed

./CEL-Seq2_pipeline.sh   --manifest data/manifest.csv   --genome-index /ref/STAR_Index   --exint-dir /ref/exint   --exint-basename Mousem39EXINt   --threads 16

In this mode, the pipeline detects the existing BEDs at:

/ref/exint/Mousem39EXINt_introns.bed
/ref/exint/Mousem39EXINt_exons.bed and skips regenerating them.

Tool paths & overrides

By default the script uses tools from your current environment (PATH). You can override any tool via flags or environment variables:

Flags:

--star /path/to/STAR
--bedtools /path/to/bedtools
--samtools /path/to/samtools
--trim-galore /path/to/trim_galore
--cutadapt /path/to/cutadapt
--multiqc /path/to/multiqc
--multiqc-config /path/to/multiqc_config.yaml
--no-multiqc                 # disable MultiQC

Environment variables (equivalent effect):

STAR_BIN=/path/to/STAR BEDTOOLS_BIN=/path/to/bedtools SAMTOOLS_BIN=/path/to/samtools TRIM_GALORE=/path/to/trim_galore CUTADAPT_BIN=/path/to/cutadapt MULTIQC_BIN=/path/to/multiqc MULTIQC_CONFIG=/path/to/multiqc_config.yaml RUN_MULTIQC=true ./CEL-Seq2_pipeline.sh ...

Outputs

For each sample:

results/<sample>/
├── logs/                         # step-by-step logs
├── <sample>_cbc.fastq.gz         # barcode/UMI-processed reads (trimmed versions also present)
├── <sample>_Aligned.sortedByCoord.out.bam
└── ...                           # additional per-step outputs

Output gene expression matrices

The pipeline produces three gene-expression matrices:

coutc — read counts per gene (raw read-level counts).
coutb — observed UMI counts per gene (deduplicated UMIs).
coutt — estimated transcript counts per gene (see Nature Methods).

Global reports:

results/multiqc/multiqc_report.html

Notes & tips

Keep FASTA, GTF, and (if used) prebuilt STAR index in sync (same assembly + annotation build).

If your BED files don’t match the provided GTF, regenerate them with:

python3 tools/get_IntronsExons_fromGTF_mod.py /ref/Mus_musculus.GRCm39.112.gtf Mousem39EXINt
# Produces: Mousem39EXINt_introns.bed and Mousem39EXINt_exons.bed in the current directory

If a tool isn’t found, either install it into your environment or pass the corresponding --tool flag.

Troubleshooting

Tool not found: install via conda/mamba, or point the script to the binary with a --tool flag.
STAR index missing/corrupt: delete and let the script regenerate, or pass a valid --genome-index.
Mismatched annotation: ensure FASTA, GTF, and BED files all correspond to the same genome build.
No MultiQC output: make sure multiqc is installed on PATH, or disable with --no-multiqc.

Credits

This project derives from Mouse gastruloids transcriptomics analysis (GPL-3.0) by Anna Alemany and contributors. Upstream repository: https://github.com/anna-alemany/mouseGastruloids_scRNAseq_tomoseq See ATTRIBUTION.md for details of changes and file mapping.

Related study: S. C. van den Brink et al. (Nature, 2020) on scRNA-seq and tomo-seq in mouse gastruloids.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CEL-Seq2 Mapping Pipeline

Contents (expected layout)

Requirements

Install with conda/mamba (recommended)

Input conventions

Note on the CBC file (`cbc_file` column)

Running the pipeline

Example A — Minimal run (FASTA + GTF only)

Example B — Use an existing STAR index + existing intron/exon BEDs

Tool paths & overrides

Outputs

Output gene expression matrices

Notes & tips

Troubleshooting

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
tools		tools
ATTRIBUTION.md		ATTRIBUTION.md
CEL-Seq2_pipeline.sh		CEL-Seq2_pipeline.sh
Celseq2_cbc_file.txt		Celseq2_cbc_file.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CEL-Seq2 Mapping Pipeline

Contents (expected layout)

Requirements

Install with conda/mamba (recommended)

Input conventions

Note on the CBC file (cbc_file column)

Running the pipeline

Example A — Minimal run (FASTA + GTF only)

Example B — Use an existing STAR index + existing intron/exon BEDs

Tool paths & overrides

Outputs

Output gene expression matrices

Notes & tips

Troubleshooting

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Note on the CBC file (`cbc_file` column)

Packages