Skip to content

moutazhelal/celseq2-mapping-wrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CEL-Seq2 Mapping Pipeline

A Bash pipeline for CEL-Seq2 data processing that:

  • Reads a manifest CSV describing your samples
  • Extracts cell barcodes/UMIs (via concatenator.py)
  • Optionally builds a STAR index (if one does not exist)
  • Generates introns/exons BEDs from a GTF (if they do not exist)
  • Runs Trim Galore/cutadapt, STAR, and intron/exon counting
  • Produces an end-of-run MultiQC report summarizing quality across steps

The script is designed to work with tools found on your current environment (PATH). You can still override any tool path via flags or environment variables when needed.


Contents (expected layout)

.
├── CEL-Seq2_pipeline.sh              # main script
├── tools/
│   ├── concatenator.py
│   ├── get_IntronsExons_fromGTF_mod.py
│   └── getIntronsExons.sh
├── data/
│   └── manifest.csv                  # sample,raw_fq_dir,cbc_file
└── results/                          # outputs per sample + MultiQC report

Requirements

  • Linux or macOS; Bash 4+; Python 3.8+
  • Tools available on PATH (or provide paths via flags):
    • STAR, bedtools, samtools, trim_galore, cutadapt, multiqc
  • Python scripts in tools/:
    • concatenator.py (barcode/UMI extraction)
    • get_IntronsExons_fromGTF_mod.py (makes intron/exon BEDs from GTF)

Install with conda/mamba (recommended)

Create an isolated environment with all dependencies:

# or with conda
conda create -n celseq -c conda-forge -c bioconda \
    python=3.10 \
    star \
    bedtools \
    samtools \
    trim-galore \
    cutadapt \
    multiqc \
    pandas \
    numpy

conda activate celseq

(Optionally pin specific versions for strict reproducibility.)


Input conventions

Note on the CBC file (cbc_file column)

This repository includes a default CBC whitelist file named Celseq2_cbc_file.txt. If it suits your experiment, you can point the cbc_file column in your manifest directly to this file.

Example manifest using the bundled CBC file (relative path from repo root):

sample,raw_fq_dir,cbc_file
S1,/path/to/fastqs,./Celseq2_cbc_file.txt
S2,/path/to/fastqs,./Celseq2_cbc_file.txt

You can also provide a custom CBC file per sample by giving its path in the cbc_file column.

  • Manifest CSV with header: sample,raw_fq_dir,cbc_file
    sample,raw_fq_dir,cbc_file
    S1,/path/to/fastqs,/path/to/S1_barcodes.txt
    S2,/path/to/fastqs,/path/to/S2_barcodes.txt
  • FASTQs for each sample must be named: <sample>_R1.fq.gz and <sample>_R2.fq.gz inside raw_fq_dir.
  • Genome FASTA and GTF should correspond to the same genome build/annotation.

Running the pipeline

Get help:

./CEL-Seq2_pipeline.sh --help

Example A — Minimal run (FASTA + GTF only)

If you provide only the FASTA and GTF, the script will:

  • build a STAR index (if not already present),
  • generate introns/exons BEDs from the GTF,
  • run the full pipeline, and
  • create a MultiQC report at the end.
./CEL-Seq2_pipeline.sh   --manifest data/manifest.csv   --genome-fasta /ref/Mus_musculus.GRCm39.dna.primary_assembly.fa   --gtf /ref/Mus_musculus.GRCm39.112.gtf   --threads 16

Artifacts:

  • results/<sample>/ with logs and outputs
  • results/multiqc/multiqc_report.html

Example B — Use an existing STAR index + existing intron/exon BEDs

If you already have a STAR index and the intron and exon BED files:

  • Point STAR to the index with --genome-index.
  • Point the pipeline to the location and basename of your BED files using --exint-dir and --exint-basename.

Important: The intron/exon BED files must be generated by the tools/get_IntronsExons_fromGTF_mod.py script included in this repository to ensure compatibility with the counting step. Name them as <basename>_introns.bed and <basename>_exons.bed and keep both together in the same directory.

Example:

# Suppose you have:
#   STAR index:           /ref/STAR_Index/
#   Introns BED:          /ref/exint/Mousem39EXINt_introns.bed
#   Exons BED:            /ref/exint/Mousem39EXINt_exons.bed

./CEL-Seq2_pipeline.sh   --manifest data/manifest.csv   --genome-index /ref/STAR_Index   --exint-dir /ref/exint   --exint-basename Mousem39EXINt   --threads 16

In this mode, the pipeline detects the existing BEDs at:

  • /ref/exint/Mousem39EXINt_introns.bed
  • /ref/exint/Mousem39EXINt_exons.bed and skips regenerating them.

Tool paths & overrides

By default the script uses tools from your current environment (PATH). You can override any tool via flags or environment variables:

Flags:

--star /path/to/STAR
--bedtools /path/to/bedtools
--samtools /path/to/samtools
--trim-galore /path/to/trim_galore
--cutadapt /path/to/cutadapt
--multiqc /path/to/multiqc
--multiqc-config /path/to/multiqc_config.yaml
--no-multiqc                 # disable MultiQC

Environment variables (equivalent effect):

STAR_BIN=/path/to/STAR BEDTOOLS_BIN=/path/to/bedtools SAMTOOLS_BIN=/path/to/samtools TRIM_GALORE=/path/to/trim_galore CUTADAPT_BIN=/path/to/cutadapt MULTIQC_BIN=/path/to/multiqc MULTIQC_CONFIG=/path/to/multiqc_config.yaml RUN_MULTIQC=true ./CEL-Seq2_pipeline.sh ...

Outputs

For each sample:

results/<sample>/
├── logs/                         # step-by-step logs
├── <sample>_cbc.fastq.gz         # barcode/UMI-processed reads (trimmed versions also present)
├── <sample>_Aligned.sortedByCoord.out.bam
└── ...                           # additional per-step outputs

Output gene expression matrices

The pipeline produces three gene-expression matrices:

  • coutc — read counts per gene (raw read-level counts).
  • coutb — observed UMI counts per gene (deduplicated UMIs).
  • coutt — estimated transcript counts per gene (see Nature Methods).

Global reports:

results/multiqc/multiqc_report.html

Notes & tips

  • Keep FASTA, GTF, and (if used) prebuilt STAR index in sync (same assembly + annotation build).
  • If your BED files don’t match the provided GTF, regenerate them with:
    python3 tools/get_IntronsExons_fromGTF_mod.py /ref/Mus_musculus.GRCm39.112.gtf Mousem39EXINt
    # Produces: Mousem39EXINt_introns.bed and Mousem39EXINt_exons.bed in the current directory
  • If a tool isn’t found, either install it into your environment or pass the corresponding --tool flag.

Troubleshooting

  • Tool not found: install via conda/mamba, or point the script to the binary with a --tool flag.
  • STAR index missing/corrupt: delete and let the script regenerate, or pass a valid --genome-index.
  • Mismatched annotation: ensure FASTA, GTF, and BED files all correspond to the same genome build.
  • No MultiQC output: make sure multiqc is installed on PATH, or disable with --no-multiqc.

Credits

This project derives from Mouse gastruloids transcriptomics analysis (GPL-3.0) by Anna Alemany and contributors. Upstream repository: https://github.com/anna-alemany/mouseGastruloids_scRNAseq_tomoseq See ATTRIBUTION.md for details of changes and file mapping.

Related study: S. C. van den Brink et al. (Nature, 2020) on scRNA-seq and tomo-seq in mouse gastruloids.

About

A Bash pipeline for CEL-Seq2 data processing

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors