Single-cell and single-cell lineage tracing analysis in python (soon a Nextflow pipeline).
Aim of this project is to provide a toolkit for the exploration of scRNA-seq. These tools perform common single-cell analysis tasks (i.e., data pre-processing and integration, cell clustering and annotation (*)) with the following features:
- Multiple methods for each analytical step, to ensure that, for a given task, the best performing method have a fair chance of being selected
- Evaluation steps, to benchmark methods performance at some task (N.B., without any ground truth reference available)
- Fine control at minimal effort, thanks to flexible Command Line Interfaces (CLIs) through which users may perform either individual analytical tasks or complete analysis on their data, with minimal (or none at all) coding required.
- Focus on gene modules and signatures scoring
- Focus on classification models, used to prioritize "distinguishing features" (i.e., transcriptional features with high discriminative power in classification tasks defined over categorical cells metadata) and validate clustering results (*)
- Utils for lentiviral-based single-cell lineage tracing data (sclt) analysis
- Scalability, to thousands of cells
- Automatic handling of folders creation/removal, as individual CLIs are designed to create and populate folders at some user defined location, and to switch among analysis versions without the need for user to handle Input/Output operations manually
- Graphical User Interfaces (GUIs), to visually explore results
For the time being, the main Cellula workflow implements the following tasks:
- (By sample) cell and gene quality Control (QC), followed by expression matrices merging (
qc.py), data pre-processing (pp.py) and batch effects assessment (kBET.py) - (Optional, if needed) correction of batch effects (
integration.py) followed by the assembly of the final pre-processed data (integration_evaluation.py) - (Leiden) cell clustering at multiple, tunable resolutions, coupled to cluster markers computation (
clustering.py) - Clustering solutions evaluation and choice (
clustering_diagnostics.py) - Signatures (i.e., gene sets, either defined by the user or retrieved by data-driven approaches) scoring (
signatures.py) - Distinguishing features ranking, through Differential Expression (DE) and classification methods (
dist_features.py) - Interactive exploration of the results (
cellula_app.py)
Cellula has been designed for command-line usage. However, individual functions and classes can be imported individually by users that look for even more flexibility.
A complete documentation (with nice tutorials and so on) will be provided when the project will reach sufficient stability for being packaged and released (we will get there soon :)).
For now, the following serves as a simple quickstart, including instructions to install (*) and run Cellula. To get better understanding on individual CLIs, modules, classes and function features, consider CLIs help messages, source code comments and docstrings, and the (temporary) documentation provided in this repo.
Even if Cellula cannot be installed from source, it is already possibile to download its code, and make it work on a local machine or an HPC cluster with few simple commands.
In order to do that, first thing first, clone this repo locally:
git clone [email protected]:andrecossa5/Cellula.git
# or git clone https://github.com/andrecossa5/Cellula.gitThen, cd to ./Cellula/envs and create the conda environment for your operating system (N.B.: Cellula has been tested only on Linux and macOS machines. In Cellula/envs you can find different OSX and Linux .yml files, storing receipts for both OS environments. mamba is used here for performance reasons, but conda works fine as well).
For a Linux machine:
cd ./Cellula/envs
mamba env create -f environment_Linux.yml -n cellula_exampleAfter that, you have to link the cloned repository path to your newly created environment:
mamba activate cellula_example
mamba develop . # you have to be in the cloned repo pathThat's it. To check if you are able to run Cellula's code, run the newly installed python interpreter
pythonand
import CellulaIf you are not seeing any errors, you are ready to go.
To begin a new single-cell analysis, cd to a location of choice on your machine, and create a new folder, This folder will host all data and results of your project. We will refer to this main folder with its absolute path, and we will assign this path to a bash environment variable, $path_main.
cd <your_coince_here>
mkdir $main_folder_name
cd $main_folder_name
path_main=`pwd`/Once in $path_main, we need to setup this folder in order for the analysis. At the bare minimum, the user to needs to create two new folders in $path_main, matrices and data:
-
matriceshosts all sample matrices for the project (i.e., CellRanger [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger] or STARsolo [https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md] outputs, including, for each sample, 3 files: barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz). If one has to deal with sclt data, each sample directory need to store additional lentiviral-clones info (see below). In this repo,test_datacontains a minimal example ofmatricesfolder, with data from two samples, a and b. Please, follow the same directory structure to create yourmatricesfolder with your data. -
datawill host all the intermediate files fromCellulaanalysis steps. In the most simple case, one can just initialize this as an empty folder. However, one may want to include other project-specific data, e.g., a list of curated gene sets to score. In this repo, thetest_datafolder contains a simple example on howdataneeds to be structured in this case, with 6 manually curated gene sets stored indata/curated_signaturesin.txtformat. Please, use the same folder structure with your data.
To setup your $path_main folder:
cdto$path_main- create and fill
matricesanddatafolders (for the following demo, just copy/linktest_data/matricesandtest_data/datato$path_main). cdto your Cellula repository clone,cdto thescriptsfolder, and run:
bash prepare_folder.sh $path_mainYou should be able to see two new folders created at $path_main: results_and_plots and runs.
A properly configured $path_main folder for a Cellula analysis should look something like this (using tree):
├── data
│ ├── curated_signatures
│ │ ├── Inflammation.txt
│ │ ├── Invasion.txt
│ │ ├── Metastasis.txt
│ │ ├── Proliferation.txt
│ │ ├── Quiescence.txt
│ │ └── Stemness.txt
│ └── removed_cells
├── matrices
│ ├── a
│ │ └── filtered_gene_bc_matrix
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ └── b
│ └── filtered_gene_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── results_and_plots
│ ├── clustering
│ ├── dist_features
│ ├── pp
│ ├── signatures
│ └── vizualization
│ ├── QC
│ ├── clustering
│ ├── dist_features
│ ├── pp
│ └── signatures
└── runsWith $path_main correctly configured, we can proceed with the analysis.
We will first perform Quality Control and matrix pre-processing.
Note 1 In a single Cellula workflow (i.e., its CLIs calls and results) one choose a unique set of options for each task. These options will likely affect the final results. Therefore, one is commonly interested in varying them and compare their results without loosing previously precomputed analysis. In order to do this, all Cellula CLIs all have a --version (or -v) argument to activate and write on a specific version folder. This way, a single place (i.e., the main folder) can store and organize all the results obtained on the same data with different, user-defined strategies. Run the same task changing -v and see how the $path_main folder structure is modified. We are currently implementing a new CLI to create a new branche starting from another, existing one (i.e., wihtout having the re-run all the steps from qc.py on).
Note 2 One might want to inspect every output of a Cellula CLI before running the next one, or might want to run an entire analysis with the least number of CLIs calls possible, inspecting results only at the end. In this quickstart, we propose a recipe for the second scenario, but human inspection is always encouraged (expecially at this stage of the project).
For now, all CLIs must be called form the Cellula/script directory (i.e., one still has to cd to this folder to launch these scripts in a batch job on a HPC cluster).
To perform cell and gene QC and merge expression data, run:
python qc.py -p $path_main -v default --mode filtered --qc_mode seuratHere we have specifically activated the 'default' version. You would see two newly created files in data/default, QC.h5ad and cells_meta.csv. Here, you have two choices:
- Go on with pre-processing (as we will do with this demo), using default cells meta-data
- Format your cells meta-data before pre-processing, adding/removing columns to
cells_meta.csv.
This is important with complex single-cell studies where cells/samples are grouped by a number of categorical covariates dependent on the study design. The newly formattedcells_meta.csvfile will be read byppafterwords if--custom_metais specified
For this demo, we will go with default cells meta-data, and run:
python pp.py -p $path_main -v default --norm scanpy --n_HVGs 2000 --score scanpy --embsAfter pre-processing, in this case we will skip batch effects evaluation and data integration sections, as a and b samples come from the same experiment, lab and sequencing run (tutorials on how to handle more complicated situations leveraging Cellula functionalities at full will be soon available). Here, we will choose to retain the original 'PCA' embedding obtained by reducing (and scaling) the full gene expression matrix to the top 2000 hyper-variable genes (HVGs), a common choice in single_cell analysis (see pp.py, kBET.py and integration scripts for further details and alternatives). This data representation will be used for kNN construction, multiple resolution clustering and markers computation. All clustering solutions will be then evaluated for their quality. These three steps (i.e., choice of a cell representation to go with, clustering and initial clustering diagnostics) can be obtained by running:
python integration_diagnostics.py -p $path_main -v default --chosen scaled:original
python clustering.py -p $path_main -v default --range 0.2:1.0 --markers
python clustering_diagnostics.py -p $path_main -v defaultThe user can inspects the clustering and clustering visualization folder to visualize properties of the "best" clustering solutions obtained, and then choose one to perform the last steps of Cellula workflow. In this case we will select the 30_NN_30_0.29 solution.
python clustering_diagnostics.py -p $path_main -v default --chosen 30_NN_30_0.29Lastly, we will retrieve and score potentially meaningful gene sets in our data, and we will search for features (i.e., single genes, Principal Components or Gene Sets scores) able to distinguishing groups of cells in our data. First, we will retrieve and score Gene Sets with
python signatures.py -p $path_main -v default --Hotspot --barkley --wu --curated --scoring scanpyThen, we will look for distinguishing features. Specifically, here we will look for distinguishing features discriminating individual samples and leiden clusters (chosen solution) with respect to all the other cells. We will make use of DE and classification models for both tasks. In order to do that, we need to pass a configuration file to dist_features.py, encoding all the info needed to retrieve cell groups and specify the type of features and models one would like to use to rank distinguishing features.
For this demo, we will pass the example configuration file stored in test_data/contrasts, sample_and_leiden.yml. dist_features.py look at .yml files in $path_data/contrasts/, so:
- Create a
contrastsfolder in$path_main - Copy
test_data/contrasts/samples_and_leiden.ymlin$path_main/contrasts/
and run
python dist_features.py -p $path_main -v default --contrasts sample_and_leidenIf you want to explore other distinguishing features, just create and pass your own file. Arbitrary analyses can be specified by changing the provided .yml file.
For example, consider the case when one would be interest in the distinguishing features among cluster 0 and 1 from sample a and b, respectively, using both DE and classification, and all the available feature types. The related .yml file would be something like:
custom: # Contrast "family" name
<example_query>: # Name you would like to give to the new contrast
query:
a: leiden in ["0", "1"] & sample == "a" # Cell groups. <name> : string eval expression
b: leiden in ["0", "1"] & sample == "b"
methods: # Methods used
DE: wilcoxon # Uses only genes, by default
ML:
features: # Features to use as input of ML models
- genes # may be added, but requires >> time in full mode
- PCs
- signatures # (i.e., gene sets from signatures.py)
models: # Classifiers used
- logit
- xgboost
mode: fast # Training mode. For a full hyperparameters optimization, write 'full' hereSave your custom .yml files in test_data/contrasts/ and pass it to dist_features.py to run your analyses.
Lastly, Cellula comes with two streamlit GUIs to interactivaly explore its outputs. This may be useful also for non-computational users exploration. Indeed, once Cellula has been setup (see the Installation paragraph) and run on some data (see above demo), results can be shared and queried as follows (always in the same environment that you have created during the installation):
cdto your Cellula repository clone,cdto thescriptsfolder, and run:
python prepare_archive.py -p $path_main -n <your-project-name-here>This will generate a <your-project-name-here>.tar.gz file in $path_main that contains all the info needed by the GUIs.
- Upload the
<your-project-name-here>.tar.gzfile to<some-path-here>and un-tar the archive:
tar -xf <your-project-name-here>.tar.gz
rm <your-project-name-here>.tar.gz- In the same environment,
cdto the locally clonedCellularepo,cdtoappsand launch one of the two GUIs by running:
streamlit run cellula_app.py <some-path-here>A multi-page GUI will automatically starts on your web browser.
This folder is organized as follows:
.
├── Cellula
├── apps
├── docs
├── envs
├── scripts
└── testsenvscontains the .yml file of the conda environment needed for package setup.docscontais all documentations files.testscontains all package unit tests.appscontains the .py scripts that launchstreamlitGUIs.scriptscontains all the CLIs which produce Cellula workflow.Cellulacontains all the modules needed byscripts.
- This is still a preliminary version of this project, undergoing major and minor refractoring.
- The Cellula.drawio.html sketch represents the data flow across Cellula CLIs, along with their dependencies.
tests,docsandsetup.pyneeds to be fully implemented yet.