diff --git a/docs/usage/DEanalysis/de_rstudio.md b/docs/usage/DEanalysis/de_rstudio.md
new file mode 100644
index 000000000..f3adcc30b
--- /dev/null
+++ b/docs/usage/DEanalysis/de_rstudio.md
@@ -0,0 +1,609 @@
+---
+order: 4
+---
+
+# Differential Analysis with DESeq2
+
+In this section of the tutorial, we will guide you through the practical steps necessary to set up the RStudio environment, load the required libraries and data, and execute the DESeq2 analysis. By the end of this section, you will have a fully functional DESeq2 analysis pipeline set up in RStudio, ready to uncover the differentially expressed genes in your dataset.
+
+## Launching the RStudio environment
+
+Once the nf-core/rnaseq pipeline is terminated, the resulting data are stored in the folder `results_star_salmon`. Now, we can analyse the results by running DESeq2 on RStudio. First of all, we need to launch it:
+
+```bash
+sudo rstudio-server start && sleep 5000
+```
+
+A pop-up will appear and by clicking on **Open**, we will be redirected to the RStudio login page. By inserting the username and the password reported below, we will be able to connect to RStudio:
+
+```bash
+Username: gitpod
+Password: pass
+```
+
+:::note
+Using `sleep` will keep the Gitpod session active, preventing disconnection and providing enough time to complete our analysis without interruptions
+:::
+
+## Differential Expression Analysis
+
+As in all analysis, firstly we need to create a new project:
+
+1. Go to the **File** menu and select **New Project**;
+
+2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;
+
+
+ { width="400" }
+
+
+3. The new project will be automatically opened in RStudio.
+
+We can check whether we are in the correct working directory with `getwd()`. The path `/workspace/gitpod/training/DE_analysis/` should appear on your console.
+To store our results in an organized way, we will create a folder named **de_results** using the **New Folder** button in the bottom right panel. We will save all our resulting tables and plots in this folder. Next, go to the **File menu**, select **New File** and then **R Script** to create a script editor in which we will save all commands required for the analysis. In the editor type:
+
+```r
+#### Differential expression analysis with DESeq2 ####
+```
+
+and save the file as **de_script.R**. From now on, each command described in the tutorial can be added to your script. The resulting working directory should look like this:
+
+
+ { width="600" }
+
+
+The analysis requires several R packages. To utilise them, we need to load the following libraries:
+
+```r
+#### Loading libraries ####
+
+# tidyverse: collection of R packages for data manipulation, visualization and modeling
+
+library("tidyverse")
+
+# DESeq2: package for differential gene expression analysis
+
+library("DESeq2")
+
+# pheatmap: package for creating heatmaps, which will be used to visualise the results
+
+install.packages("pheatmap") # To install the package missing in the current RStudio env
+
+library("pheatmap")
+
+# RColorBrewer: package for creating color palettes, which will be used to customise the heatmaps
+
+library("RColorBrewer")
+
+# ggrepel: package that provides geoms for ggplot2 to repel overlapping text labels in the plots
+
+library("ggrepel")
+```
+
+and the pre-computed DESeq2 object (`dds`) generated by the nfcore/rnaseq pipeline. In this tutorial we will analyse the `dds` object generated by running the alignment with **STAR** and the quantification with **Salmon**:
+
+```r
+#### Import the dds obtained from nfcore/rnaseq ####
+
+load("/workspace/gitpod/training/results_star_salmon/star_salmon/deseq2_qc/deseq2.dds.RData")
+```
+
+Alternatively, a user could choose to analyse the the `dds` object generated by running only **Salmon** for both lightweight alignment and quantification.
+
+In DESEq2, the `dds` object is a central data structure that contains the following components:
+
+- `countData`: a matrix of raw count data, where each row represents a gene and each column represents a sample;
+
+- `colData`: a data frame containing information about the samples, such as the experimental design, treatment and other relevant metadata;
+
+- `design`: a formula specifying the experimental design utilised to estimate the dispersion and the log2 fold change.
+
+All these components can be checked with specific commands:
+
+```r
+#### dds inspection ####
+
+head(counts(dds)) # to check the raw counts
+
+colData(dds) # to check the sample info
+
+design(dds) # to check the design formula
+```
+
+The `colData` and the `design` are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the `dds`. We will rename the column of the `colData`, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the `colData` of the `dds` object with our newly created metadata.
+
+```r
+#### Creation of metadata starting from the dds colData ####
+
+metadata <- DataFrame(
+ sample = colData(dds)$sample,
+ condition = colData(dds)$Group1,
+ replica = colData(dds)$Group2
+)
+
+# Assign names to rows of metadata
+
+rownames(metadata) <- colnames(counts(dds))
+
+# Fill the dds colData with the generated metadata
+
+colData(dds) <- metadata
+```
+
+:::note
+With this operation we also eliminate the `sizeFactors` already estimated by the nfcore/rnaseq pipeline.
+:::
+
+To avoid errors in DESeq2 is essential to check that sample names match between the `colData` and the `countData`, and that the sample are in the correct order:
+
+```r
+#### Check that sample names match in both files ####
+
+all(colnames(dds$counts) %in% rownames(metadata)) # Must be TRUE
+
+all(colnames(dds$counts) == rownames(metadata)) # Must be TRUE
+```
+
+Now that everything is setted up, we can proceed to generate a new DESeq2 object with the corrected metadata and the right design:
+
+```r
+#### Creation of a new dds ####
+
+dds_new <- DESeqDataSet(dds, design = ~ condition)
+
+# dds inspection
+
+head(counts(dds_new)) # to check the raw counts
+
+colData(dds_new) # to check the sample info
+
+design(dds_new) # to check the design formula
+```
+
+Comparing the structure of the newly created dds (`dds_new`) with the one automatically produced by the pipeline (`dds`), we can observe the differences:
+
+
+ { width="400" }
+
+
+Before running the different steps of the analysis, a good practice consists in pre-filtering the genes to remove those with very low counts. This is useful to improve computional efficiency and enhance interpretability. In general, it is reasonable to keep only genes with a sum counts of at least 10 for a minimal number of 3 samples:
+
+```r
+#### Pre-filtering ####
+
+# Select a minimal number of samples = 3
+
+smallestGroupSize <- 3
+
+# Select genes with a sum counts of at least 10 in 3 samples
+
+keep <- rowSums(counts(dds_new) >= 10) >= smallestGroupSize
+
+# Keep only the genes that pass the threshold
+
+dds_filtered <- dds_new[keep,]
+```
+
+Now, it is time to run the differential expression analysis with the `DESeq()` function:
+
+```r
+#### Run the DESeq2 analysis ####
+
+dds_final <- DESeq(dds_filtered)
+```
+
+The `DESeq()` function is a high-level wrapper that simplifies the process of differential expression analysis by combining multiple steps into a single function call. This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that **DESeq2** calls include:
+
+- `estimateSizeFactors`: to normalise the count data;
+
+- `estimateDispersion`: to estimate the dispersion;
+
+- `nbinomWaldTest`: to perform differential expression test.
+
+The individual functions can be carried out also singularly as shown below:
+
+```r
+#### Differential expression analysis step-by-step ####
+
+dds_final <- estimateSizeFactors(dds_filtered)
+
+dds_final <- estimateDispersions(dds_final)
+
+dds_final <- nbinomWaldTest(dds_final)
+```
+
+The next step in the DESeq2 workflow is to perform quality control (QC) analysis on our data. This analysis is crucial for identifying potential issues ensuring that the data are suitable for downstream analysis. For QC analysis, it is useful to work with transformed versions of the count data, `variance-stabilised (vst)` or `regularised log-transformed (rlog)` counts. While, the rlog is more robust to outliers and extreme values, vst is computationally faster and so preferred for larger dataset.
+
+:::warning
+These transformations are used for visualisation purposes, while DESeq2 requires raw counts for differential expression analysis.
+:::
+
+```r
+#### Transform normalised counts for data visualisation ####
+# A user can choose among vst and rlog. In this tutorial we will work with rlog transformed data.
+
+rld <- rlog(dds_final, blind = TRUE)
+```
+
+The `rlog` and the `vst` transformations have an argument, **blind** that can be set to:
+
+- **TRUE** (default): useful for QC analysis because it re-estimates the dispersion, allowing for comparison of samples in an unbiased manner with respect to experimental conditions;
+
+- **FALSE**: the function utilizes the already estimated dispersion, generally applied when differences in counts are expected to be due to the experimental design.
+
+Next, we perform Principal Component Analysis (PCA) to explore the data. DESeq2 provides a built-in function, `plotPCA()`, which uses [ggplot2](https://ggplot2.tidyverse.org) for visualisation, taking the `rld` (or the `vst`) object as input.
+Since the **treatment** is the principal condition of interest in our metadata, we will use the `condition` information from our metadata to plot the PCA:
+
+```r
+#### Plot PCA ####
+
+pca_plot <- plotPCA(rld, intgroup = "condition")
+
+# Save the plot
+
+ggsave("de_results/pca_plot.png", plot = pca_plot, width = 6, height = 5, dpi = 300)
+```
+
+The second essential step in QC analysis is hierarchical clustering. Although DESeq2 does not have a built-in function for this analysis, we can use the `pheatmap()` function from the **pheatmap** package.
+We will extract the matrix of rlog-transformed counts from the `rld` object (pheatmap input), compute pairwise correlations and plot the heatmap:
+
+```r
+#### Plot sample to sample distance (hierarchical clustering) ####
+
+# Extract the matrix of rlog-transformed counts from the rld object
+
+sampleDists <- dist(t(assay(rld))) # Calculate pairwise distances between samples using the dist() function with Euclidean distance as the default method. By transposing the matrix with t(), we ensure that samples become rows and genes become columns, so that the dist function computes pairwise distances between samples.
+
+# Convert distances to a matrix
+
+sampleDistMatrix <- as.matrix(sampleDists)
+
+# Set the row and column names of the distance matrix
+
+rownames(sampleDistMatrix) <- paste(rld$condition, rld$replica, sep = "_")
+
+colnames(sampleDistMatrix) <- paste(rld$condition, rld$replica, sep = "_")
+
+# Define a color palette for the heatmap
+
+colors <- colorRampPalette(rev(brewer.pal(9, "Greens")))(255) # function from RColorBrewer package
+
+# Create the heatmap
+
+clustering_plot <- pheatmap(sampleDistMatrix,
+ clustering_distance_rows = sampleDists,
+ clustering_distance_cols = sampleDists,
+ col = colors,
+ fontsize_col = 8,
+ fontsize_row = 8)
+
+# Save the plot
+
+ggsave("de_results/clustering_plot.png", plot = clustering_plot, width = 6, height = 5, dpi = 300)
+```
+
+The normalised counts stored in the `dds` can be inspected with the `counts()` function and saved in our results folder:
+
+```r
+#### Inspect the normalised counts ####
+
+# Display the first few rows of the normalised counts to inspect the data
+
+head(counts(dds_final, normalized = TRUE))
+
+# Display the first few rows of the raw counts (not normalised) to compare with the normalised counts
+
+head(counts(dds_final))
+
+# Convert the normalised counts from the DESeq2 object to a tibble
+
+normalised_counts <- as_tibble(counts(dds_final, normalized = TRUE))
+
+# Add a column for gene names to the normalised counts tibble
+
+normalised_counts$gene <- rownames(counts(dds_final))
+
+# Relocate the gene column to the first position
+
+normalised_counts <- normalised_counts %>%
+ relocate(gene, .before = control_rep1)
+
+# Save the normalised counts
+
+write.csv(normalised_counts, file = "de_results/normalised_counts.csv")
+```
+
+The `results()` function in DESeq2 is used to extract the results of the DE analysis. This function takes the `dds` object as input and returns a DataFrame containing the results of the analysis:
+
+- **baseMean**: the average expression level of the gene across all samples;
+
+- **log2FoldChange**: the log2 fold change of the gene between the condition of interest and the reference level;
+
+- **lfcSE**: the standard error of the log2 fold change;
+
+- **stat**: the Wald statistic, which is used to calculate the p-value;
+
+- **pvalue**: the p-value from the Wald test indicates the probability of observing the measured difference in gene expression (log2 fold change) by chance, assuming no true difference exists (null hypothesis). A low p-value suggests that the observed expression change between samples is unlikely due to random chance, so we can reject the null hypothesis --> the gene is differentially expressed;
+
+- **padj**: the adjusted p-value, which takes into account multiple testing corrections, (Benjamini-Hochberg method default) to control the false discovery rate;
+
+The `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast (specific comparison between two or more levels).
+
+:::tip
+The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level.
+:::
+
+Notice that in this tutorial the contrast is already correctly specified.
+
+```r
+#### Extract results table from the dds object ####
+
+res <- results(dds_final)
+
+# Visualise the results
+
+head(res)
+
+# Summarise the results showing the number of tested genes (genes with non-zero total read count), the genes up- and down-regulated at the selected threshold (alpha) and the number of genes excluded by the multiple testing due to a low mean count
+
+summary(res)
+
+# DESeq2 function to extract the name of the contrast
+
+resultsNames(dds_final)
+
+# res <- results(dds, contrast = c("design_formula", "condition_of_interest", "reference_condition"))
+# Command to set the contrast, if necessary
+
+# Store the res object inside another variable because the original res file will be required for other functions
+
+res_viz <- res
+
+# Add gene names as a new column to the results table
+
+res_viz$gene <- rownames(res)
+
+# Convert the results to a tibble for easier manipulation and relocate the gene column to the first position
+
+res_viz <- as_tibble(res_viz) %>%
+ relocate(gene, .before = baseMean)
+
+# Save the results table
+
+write.csv(res_viz, file = "de_results/de_result_table.csv")
+```
+
+In the _Experimental Design_ section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change of at least 1 or -1 as differentially expressed.
+
+```r
+#### Extract significant DE genes from the results ####
+
+# Filter the results to include only significantly DE genes with a padj less than 0.05 and a log2FoldChange of at least 1 or -1
+
+resSig <- subset(res_viz, padj < 0.05 & abs(log2FoldChange) > 1)
+
+# Convert the results to a tibble for easier manipulation and relocate the gene column to the first position
+
+resSig <- as_tibble(resSig) %>%
+ relocate(gene, .before = baseMean)
+
+# Order the significant genes by their adjusted p-value (padj) in ascending order
+
+resSig <- resSig[order(resSig$padj),]
+
+# Display the final results table of significant genes
+
+resSig
+
+# Save the significant DE genes
+
+write.csv(resSig, file = "de_results/sig_de_genes.csv")
+```
+
+## Plot the results
+
+Now that we have obtained the results of the differential expression analysis, it's time to visualise the data to gain a deeper understanding of the biological processes that are affected by the experimental conditions. Visualisation is a crucial step in RNA-seq analysis, as it allows us to identify patterns and trends in the data that may not be immediately apparent from the numerical results.
+
+In the following sections, we will explore different types of plots that are commonly used to visualise the results of RNA-seq analysis, including:
+
+- **MA plot**: scatter plot commonly utilised to visualise the results of the DE analysis for all the samples. The plot displays the mean of the normalised counts on the x-axis and the log2 fold change on the y-axis. This allows the visualisation of the relationship between the magnitude of the fold change and the mean expression level of the genes. Genes that are differentially expressed will appear farthest from the horizontal line, while genes with low expression levels will appear closer to the line.
+
+```r
+#### MA plot ####
+
+# The MA plot is not a ggplot, so we have to save it in a different way
+
+# Open a graphics device to save the plot as a PNG file
+
+png("MA_plot.png", width = 1500, height = 100, res = 300)
+
+# Generate the MA plot (it will be saved to the file instead of displayed on screen)
+
+plotMA(res, ylim = c(-2, 2))
+
+# Close the device to save the file
+
+dev.off()
+```
+
+- **counts plot**: plot of the normalised counts for a single gene across the different conditions in your experiment. It’s particularly useful for visualising the expression levels of specific genes of interest and comparing them across sample groups.
+
+```r
+#### Plot a specific gene in this case ENSG00000142192, a DE gene ####
+
+png("de_results/plotCounts.png", width = 1000, height = 1200, res = 300)
+
+plotCounts(dds_final, gene = "ENSG00000142192")
+
+dev.off()
+```
+
+**heatmap**: plot of the normalised counts for all the significant genes obtained with the `pheatmap()` function. The heatmap provides insights into genes and sample relationships that may not be apparent from individual gene plots alone.
+
+```r
+#### Heatmap ####
+
+# Extract only the first column (gene names) from the result object containing the significant genes
+
+significant_genes <- resSig[, 1]
+
+# Extract normalised counts for significant genes from the normalised counts matrix and convert the gene column to row names
+
+significant_counts <- inner_join(normalised_counts, significant_genes, by = "gene") %>%
+ column_to_rownames("gene")
+
+# Create the heatmap using pheatmap
+
+heatmap <- pheatmap(significant_counts,
+ cluster_rows = TRUE,
+ fontsize = 8,
+ scale = "row",
+ fontsize_row = 8,
+ height = 10)
+
+# Save the plot
+
+ggsave("de_results/heatmap.png", plot = heatmap, width = 6, height = 5, dpi = 300)
+
+```
+
+- **volcano plot**: scatter plot that displays the log2 fold change on the x-axis and the log transformed padj on the y-axis. This allows for the visualisation of both the magnitude and significance of the changes in gene expression between two conditions. Genes that are differentially expressed (i.e., have a large log2 fold change) and are statistically significant (i.e., have a low padj) will appear in the left (downregulated genes) or in the right (upregulated genes) corners of the plot making easier their identification.
+
+```r
+#### Volcano plot ####
+
+# Convert the results to a tibble and add a column indicating differential expression status
+
+res_tb <- as_tibble(res) %>%
+ mutate(diffexpressed = case_when(
+ log2FoldChange > 1 & padj < 0.05 ~ 'upregulated',
+ log2FoldChange < -1 & padj < 0.05 ~ 'downregulated',
+ TRUE ~ 'not_de'))
+
+# Add a new column with gene names
+
+res_tb$gene <- rownames(res)
+
+# Relocate the gene column to the first position
+
+res_tb <- res_tb %>%
+ relocate(gene, .before = baseMean)
+
+# Order the table by padj and add a new column for gene labels
+
+res_tb <- res_tb %>% arrange(padj) %>%
+ mutate(genelabels = "")
+
+# Label the top 5 most significant genes
+
+res_tb$genelabels[1:5] <- res_tb$gene[1:5]
+
+# Create a volcano plot using ggplot2
+
+volcano_plot <- ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj), col = diffexpressed)) +
+ geom_point(size = 0.6) +
+ geom_text_repel(aes(label = genelabels), size = 2.5, max.overlaps = Inf) +
+ ggtitle("DE genes treatment versus control") +
+ geom_vline(xintercept = c(-1, 1), col = "black", linetype = 'dashed', linewidth = 0.2) +
+ geom_hline(yintercept = -log10(0.05), col = "black", linetype = 'dashed', linewidth = 0.2) +
+ theme(plot.title = element_text(size = rel(1.25), hjust = 0.5),
+ axis.title = element_text(size = rel(1))) +
+ scale_color_manual(values = c("upregulated" = "red",
+ "downregulated" = "blue",
+ "not_de" = "grey")) +
+ labs(color = 'DE genes') +
+ xlim(-3,5)
+
+# Save the plot
+
+ggsave("de_results/volcano_plot.png", plot = volcano_plot, width = 6, height = 5, dpi = 300)
+```
+
+## Functional analysis
+
+The output of the differential expression analysis is a list of significant DE genes. To uncover the underlying biological mechanisms, various downstream analyses can be performed, such as functional enrichment analysis (identify overrepresented biological processes, molecular functions, cellular components or pathways), and network analysis (group genes based on similar expression patterns to identify novel interactions). To facilitate the interpretation of the resulting list of DE genes, a range of freely available web- and R-based tools can be employed.
+
+In this tutorial, we will explore an enrichment analysis technique known as Over-Representation Analysis (ORA), a powerful tool for identifying biological pathways or processes significantly enriched within the list of DE genes. The underlying statistic behind ORA is the **hypergeometric test**, which considers three key components:
+
+- **Universe**: the background list of genes (for example the genes annotated in a genome);
+
+- **GeneSet**: a collection of genes annotated by a reference database (such as Gene Ontology), and known to be involved in a particular biological pathway or process;
+
+- **Gene List**: the differentially expressed genes.
+
+The hypergeometric test calculates the probability of observing a certain number of genes from the gene set (pathway or process) within the gene list (DE genes) by chance.
+An important aspect of this analysis is the concept of **membership**. It defines the relationship between DE genes and genes from the analysed gene set. By knowing which genes belong to which pathway/process, we can determine whether the observed overlap between DE genes and the particular pathway/process is greater than what would be expected by random chance.
+
+```r
+#### Enrichment analysis (ORA) ####
+
+# Loading libraries
+
+# clusterProfiler: package for enrichment analysis
+
+library(clusterProfiler)
+
+# org.Hs.eg.db: package for the human gene annotation database
+
+library(org.Hs.eg.db)
+
+# cowplot: package for combining multiple plots
+
+install.packages("cowplot") # To install the package missing in the current RStudio env
+
+library(cowplot)
+
+# Prepare gene list
+# Extract the log2 fold change values from the results data frame
+
+gene_list <- res$log2FoldChange
+
+# Name the vector with the corresponding gene identifiers
+
+names(gene_list) <- res$gene
+
+# Sort the list in decreasing order (required for clusterProfiler)
+
+gene_list <- sort(gene_list, decreasing = TRUE)
+
+# Extract the significantly differentially expressed genes from the results data frame
+
+res_genes <- resSig$gene
+
+# Run GO enrichment analysis using the enrichGO function
+
+go_enrich <- enrichGO(
+ gene = res_genes, # Genes of interest
+ universe = names(gene_list), # Background gene set
+ OrgDb = org.Hs.eg.db, # Annotation database
+ keyType = 'ENSEMBL', # Key type for gene identifiers
+ readable = TRUE, # Convert gene IDs to gene names
+ ont = "ALL", # Ontology: can be "BP", "MF", "CC", or "ALL"
+ pvalueCutoff = 0.05, # P-value cutoff for significance
+ qvalueCutoff = 0.10 # Q-value cutoff for significance
+)
+
+# Create a bar plot of the top enriched GO terms
+
+barplot <- barplot(
+ go_enrich,
+ title = "Enrichment analysis barplot",
+ font.size = 8
+)
+
+# Create a dot plot of the top enriched GO terms
+
+dotplot <- dotplot(
+ go_enrich,
+ title = "Enrichment analysis dotplot",
+ font.size = 8
+)
+
+# Combine the bar plot and dot plot into a single plot grid
+
+go_plot <- plot_grid(barplot, dotplot, col = 2)
+
+# Save the plot
+
+ggsave("de_results/go_plot.png", plot = go_plot, width = 13, height = 6, dpi = 300)
+```
diff --git a/docs/usage/DEanalysis/img/DESeq_function.png b/docs/usage/DEanalysis/img/DESeq_function.png
new file mode 100644
index 000000000..0d4abd5fb
Binary files /dev/null and b/docs/usage/DEanalysis/img/DESeq_function.png differ
diff --git a/docs/usage/DEanalysis/img/Excalidraw_RNAseq.png b/docs/usage/DEanalysis/img/Excalidraw_RNAseq.png
new file mode 100644
index 000000000..f90da8214
Binary files /dev/null and b/docs/usage/DEanalysis/img/Excalidraw_RNAseq.png differ
diff --git a/docs/usage/DEanalysis/img/MA_plot.png b/docs/usage/DEanalysis/img/MA_plot.png
new file mode 100644
index 000000000..2891bfd4a
Binary files /dev/null and b/docs/usage/DEanalysis/img/MA_plot.png differ
diff --git a/docs/usage/DEanalysis/img/RNA_seq_scheme_tutorial.png b/docs/usage/DEanalysis/img/RNA_seq_scheme_tutorial.png
new file mode 100644
index 000000000..c278dffe2
Binary files /dev/null and b/docs/usage/DEanalysis/img/RNA_seq_scheme_tutorial.png differ
diff --git a/docs/usage/DEanalysis/img/count_distribution.png b/docs/usage/DEanalysis/img/count_distribution.png
new file mode 100644
index 000000000..213ea68ab
Binary files /dev/null and b/docs/usage/DEanalysis/img/count_distribution.png differ
diff --git a/docs/usage/DEanalysis/img/dds_comparison.png b/docs/usage/DEanalysis/img/dds_comparison.png
new file mode 100644
index 000000000..cc940808a
Binary files /dev/null and b/docs/usage/DEanalysis/img/dds_comparison.png differ
diff --git a/docs/usage/DEanalysis/img/dispersion_estimates.png b/docs/usage/DEanalysis/img/dispersion_estimates.png
new file mode 100644
index 000000000..2b50a84ff
Binary files /dev/null and b/docs/usage/DEanalysis/img/dispersion_estimates.png differ
diff --git a/docs/usage/DEanalysis/img/enrichment_plot.png b/docs/usage/DEanalysis/img/enrichment_plot.png
new file mode 100644
index 000000000..f073bdc3f
Binary files /dev/null and b/docs/usage/DEanalysis/img/enrichment_plot.png differ
diff --git a/docs/usage/DEanalysis/img/heatmap_de_genes.png b/docs/usage/DEanalysis/img/heatmap_de_genes.png
new file mode 100644
index 000000000..656e00934
Binary files /dev/null and b/docs/usage/DEanalysis/img/heatmap_de_genes.png differ
diff --git a/docs/usage/DEanalysis/img/hierarchical_clustering.png b/docs/usage/DEanalysis/img/hierarchical_clustering.png
new file mode 100644
index 000000000..ece9bc431
Binary files /dev/null and b/docs/usage/DEanalysis/img/hierarchical_clustering.png differ
diff --git a/docs/usage/DEanalysis/img/nf-core-rnaseq_metro_map_grey.png b/docs/usage/DEanalysis/img/nf-core-rnaseq_metro_map_grey.png
new file mode 100644
index 000000000..0dbf23f81
Binary files /dev/null and b/docs/usage/DEanalysis/img/nf-core-rnaseq_metro_map_grey.png differ
diff --git a/docs/usage/DEanalysis/img/overdispersion.png b/docs/usage/DEanalysis/img/overdispersion.png
new file mode 100644
index 000000000..0ba5bae17
Binary files /dev/null and b/docs/usage/DEanalysis/img/overdispersion.png differ
diff --git a/docs/usage/DEanalysis/img/pca_plot.png b/docs/usage/DEanalysis/img/pca_plot.png
new file mode 100644
index 000000000..73008fe90
Binary files /dev/null and b/docs/usage/DEanalysis/img/pca_plot.png differ
diff --git a/docs/usage/DEanalysis/img/plotCounts.png b/docs/usage/DEanalysis/img/plotCounts.png
new file mode 100644
index 000000000..2865d0575
Binary files /dev/null and b/docs/usage/DEanalysis/img/plotCounts.png differ
diff --git a/docs/usage/DEanalysis/img/project_R.png b/docs/usage/DEanalysis/img/project_R.png
new file mode 100644
index 000000000..6514107a4
Binary files /dev/null and b/docs/usage/DEanalysis/img/project_R.png differ
diff --git a/docs/usage/DEanalysis/img/volcanoplot.png b/docs/usage/DEanalysis/img/volcanoplot.png
new file mode 100644
index 000000000..fa72cb9b5
Binary files /dev/null and b/docs/usage/DEanalysis/img/volcanoplot.png differ
diff --git a/docs/usage/DEanalysis/img/workdir_RStudio.png b/docs/usage/DEanalysis/img/workdir_RStudio.png
new file mode 100644
index 000000000..06b0374ab
Binary files /dev/null and b/docs/usage/DEanalysis/img/workdir_RStudio.png differ
diff --git a/docs/usage/DEanalysis/index.md b/docs/usage/DEanalysis/index.md
new file mode 100644
index 000000000..56d9d5e24
--- /dev/null
+++ b/docs/usage/DEanalysis/index.md
@@ -0,0 +1,44 @@
+---
+order: 1
+---
+
+# Welcome
+
+These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/rnaseq](https://nf-co.re/rnaseq).
+
+In this workshop, we will recap the application of next generation sequencing to identify differentially expressed genes. You will learn how to use the rnaseq pipeline to carry out this data-intensive workflow efficiently. We will cover topics such as configuration of the pipeline, code execution and data interpretation.
+
+Please note that this is not an introductory workshop, and we will assume some basic familiarity with Nextflow.
+
+By the end of this workshop, you will be able to:
+
+- analyse simple NGS datasets with the nf-core/rnaseq workflow
+- understand the key concepts behind RNAseq differential expression analysis
+- customise some of its features for your own analyses
+- integrate different sources of information to interpret the results
+
+Let's get started!
+
+## Running with Gitpod
+
+In order to run this using GitPod, please make sure:
+
+1. You have a GitHub account: if not, create one [here](https://github.com/signup)
+2. Once you have a GitHub account, sign up for GitPod using your GitHub user [here](https://gitpod.io/login/) choosing "continue with GitHub".
+
+Now you're all set and can use the following button to launch the service:
+
+[](https://gitpod.io/#https://github.com/lescai-teaching/rnaseq-tutorial)
+
+## Additional documentation
+
+- You can find detailed documentation on **Nextflow** [here](https://www.nextflow.io/docs/latest/)
+- You can find additional training on [these pages](https://training.nextflow.io)
+
+## Credits & Copyright
+
+This training material has been written and completed by [Lorenzo Sola](https://github.com/LorenzoS96), [Francesco Lescai](https://github.com/lescai), and [Mariangela Santorsola](https://github.com/msantorsola) during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024. Thank you to [Victoria Cepeda](https://github.com/vcepeda) for her contributions to the tutorial's revision. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.
+
+The Docker image and Gitpod environment used in this repository have been created by [Seqera](https://seqera.io) but have been made open-source ([CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)) for the community.
+
+All examples and descriptions are licensed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/).
diff --git a/docs/usage/DEanalysis/interpretation.md b/docs/usage/DEanalysis/interpretation.md
new file mode 100644
index 000000000..476f627cd
--- /dev/null
+++ b/docs/usage/DEanalysis/interpretation.md
@@ -0,0 +1,91 @@
+---
+order: 5
+---
+
+# Interpretation
+
+Once DE genes have been identified, the next crucial step is to interpret the results. This involves the inspection of tables and plots generated during the analysis to understand the biological significance of the data. In this part of the tutorial, we will explore the results by discussing the significant DE genes and we will examine various plots generated during the analysis.
+
+:::note
+The results illustrated in this section might show slight variations compared to your runs due to randomness in the STAR algorithm. This randomness arises from using variable seed values and parallel processing, leading to minor differences in results between runs on the same data. These small discrepancies are not biologically significant and may affect counts and subsequent plots (such as PCA and count plots). However, the overall patterns and main findings should remain consistent. While exact reproducibility is ideal, minor variations are acceptable in practice, as long as they do not impact the main conclusions of the study.
+:::
+
+## Quality control plots
+
+The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: `sample`, `condition`, and `replica`. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age, and developmental stage.
+
+
+ { width="400" }
+
+
+By plotting the PCA on the PC1 and PC2 axes, using `condition` as the main variable of interest, we can quickly identify the primary source of variation in our data. By accounting for this variation in our design model, we should be able to detect more differentially expressed genes related to `condition`. When working with real data, it's often useful to plot the data using different variables to explore how much variation is explained by the first two PCs. Depending on the results, it may be informative to examine variation on additional PC axes, such as PC3 and PC4, to gain a more comprehensive understanding of the data.
+
+Next, we will examine the hierarchical clustering plot to explore the relationships between samples based on their gene expression profiles. The heatmap is organized such that samples with similar expression profiles are close to each other, allowing us to identify patterns in the data.
+
+
+ { width="400" }
+
+
+Remember that to create this plot, we utilized the `dist()` function, so in the legend on the right, a value of 0 corresponds to high correlation, while a value of 5 corresponds to very low correlation. Similar to PCA, we can see that samples tend to cluster together according to `condition`, indeed we can observe a high degree of correlation between the three control samples and between the three treated samples.
+
+Overall, the integration of these plots suggests that we are working with high-quality data and we can confidently proceed to the differential expression analysis.
+
+## Differential expression results
+
+In this part of the tutorial, we will examine plots that are generated after the differential expression analysis. These plots are not quality control plots, but rather plots that help us to interpret the results.
+After running the `results()` function, a good way to start to have an idea about the results is to look at the MA plot.
+
+
+ { width="500" }
+
+
+By default, genes are coloured in blue if the padj is less than 0.1 and the log2 fold change greater than or less than 0. Genes that fall outside the plotting region are represented as open triangles. At this stage, we have not yet applied a filter to select only significant DE genes, which we define as those with a padj value less than 0.5 and a log2 fold change of at least 1 or -1.
+
+After filtering our genes of interest according to our threshold, let's have a look to our significatnt genes:
+
+```tsv
+gene baseMean log2FoldChange lfcSE stat pvalue padj
+ENSG00000205726 121645.5908 2.894480 0.1515387 19.100600 2.496005e-81 5.840651e-79
+ENSG00000142192 51065.3192 3.025489 0.1891258 15.997230 1.335883e-57 1.562983e-55
+ENSG00000142156 20805.8078 2.977705 0.2159277 13.790287 2.915972e-43 2.274458e-41
+ENSG00000159231 458.9277 -1.194777 0.3058100 -3.906926 9.347790e-05 5.468457e-03
+ENSG00000156282 481.7624 1.095272 0.2969594 3.688289 2.257672e-04 1.056590e-02
+```
+
+To gain a comprehensive overview of the transcriptional profile, the volcano plot represents a highly informative tool.
+
+
+ { width="400"}
+
+
+The treatment induced differential expression in five genes: one downregulated and four upregulated. This plot visually represents the numerical results reported in the table above.
+
+After the identification of DE genes, it's informative to visualise the expression of specific genes of interest. The `plotCounts()` function applied directly on the `dds` object allows us to examine individual gene expression profiles without accessing the full `res` object.
+
+
+ { width="400" }
+
+
+In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions.
+
+Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression.
+
+
+ { width="400" }
+
+
+By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualisation allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics.
+
+## Over Representation Analysis (ORA)
+
+Finally, we can attempt to assign biological significance to our differentially expressed genes through **Over Representation Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes.
+
+
+ { width="400" }
+
+
+The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study.
+
+## Conclusions
+
+In this tutorial, we have walked through the steps of the RNA-seq analysis, from launching the nfcore/rnaseq pipeline to interpreting differential expression results. You learned how the data are generated, identified differentially expressed genes, and conducted enrichment analysis. By following this tutorial, you should now be able to use the nfcore/rnaseq pipeline and perform differential expression analysis with DESeq2, interpreting the results within the context of your research.
diff --git a/docs/usage/DEanalysis/rnaseq.md b/docs/usage/DEanalysis/rnaseq.md
new file mode 100644
index 000000000..9fd0e123e
--- /dev/null
+++ b/docs/usage/DEanalysis/rnaseq.md
@@ -0,0 +1,138 @@
+---
+order: 3
+---
+
+# The nf-core/rnaseq pipeline
+
+In order to carry out a RNA-Seq analysis we will use the nf-core pipeline [rnaseq](https://nf-co.re/rnaseq/3.14.0).
+
+## Overview
+
+The pipeline is organised following the diffent blocks shown below: pre-processing, traditional alignment (or lightweight alignment) and quantification, post-processing and final QC.
+
+
+ { width="1000"}
+
+
+In each process, the users can choose among a range of different options. Importantly, the users can decide to follow one of the two different routes in the alignment and quantification step:
+
+- traditional alignment and quantification (stage 2);
+
+- lightweight alignment and quantification (stage 3).
+
+## Experimental Design
+
+The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression.
+
+:::warning
+This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple time.
+:::
+
+To obtain optimal results, it is crucial to balance the number of biological replicates and the sequencing depth. While increasing the depth of sequencing enhances the ability to detect genes with low expression levels, there is a plateau beyond which no further benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps to establish a suitable log2 fold change threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results, and increasing the reliability of the findings.
+
+## Library design
+
+RNA-seq library design involves important decisions, particularly the choice between paired-end and single-end sequencing. Paired-end sequencing offers insights into structural variations and transcript isoforms, significantly improving mapping accuracy for longer transcripts and repetitive regions. In contrast, single-end sequencing—where only one end of the fragment is sequenced—can be a more cost-effective option while still delivering high-quality data for gene expression analysis. The choice between these two methods ultimately depends on the research question and experimental objectives. Paired-end sequencing is ideal for identifying novel transcripts or characterizing isoforms, whereas single-end sequencing is often sufficient for quantifying gene expression. Factors such as the type of RNA (e.g., mRNA or total RNA), read length, budget, and available computational resources also influence this decision.
+
+## Reference genome
+
+nf-core pipelines make use of the Illumina iGenomes collection as [reference genomes](https://nf-co.re/docs/usage/reference_genomes).
+
+Before starting the analysis, the users might want to check whether the genome they need is part of this collection. They also might want to consider downloading the reference locally, when running on premises: this would be useful for multiple runs and to speed up the analysis. In this case the parameter `--igenomes_base` might be used to pass the root directory of the downloaded references.
+
+One might also need to use custom files: in this case the users might either provide specific parameters at command line (`--fasta` option followed by the genome of choiche), or create a config file adding a new section to the `genome` object. See [here](https://nf-co.re/docs/usage/reference_genomes#custom-genomes) for more details.
+
+In this tutorial we will edit the config file, since the data we will be using have been simulated on chromosome 21 of the Human GRCh38 reference, and we have prepared genome fasta and genome index containing only this chromosome locally.
+
+The two files are `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa` and `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa.fai`, respectively.
+
+## Reference annoation
+
+The reference annotation plays a crucial role in the RNA-seq analysis. Without a high-quality reference annotation, RNA-seq analysis would result in inaccurate or incomplete results. The reference annotation provides a precise guide for aligning sequencing reads to specific genomic regions, allowing to identify genes, transcripts, and regulatory elements, as well as novel transcripts and alternative splicing events.
+
+nf-core pipelines make use of the Illumina iGenomes collection also as [reference annotation](https://nf-co.re/docs/usage/reference_genomes).
+The reference annotations are vastly out of date with respect to current annotations and miss certain features. So, the general recommendation is to download a newest annotation version compatible with the genome. A user can utilize the `--gtf` or the `--gff` options to specify the annottation files of choiche, or create a config file adding a new section to the `genome` object.
+
+Similarly to the approach utilised for the genome, in this tutorial we will edit the config file. The annotation files include only the annotated transcripts on chromosome 21 of the Human GRCh38 reference genome and we have already prepared these files locally.
+
+The two files are `/workspace/gitpod/training/data/refs/gencode_v29_chr21.gff` and `/workspace/gitpod/training/data/refs/gencode_v29_transcripts_chr21.fa`, respectively.
+
+## Input files
+
+The input data should be provided in a CSV file, according to a format that is largely common for nf-core pipelines.
+The format is described in the [rnaseq usage page](https://nf-co.re/rnaseq/3.14.0/docs/usage).
+
+The input file is `/workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv`.
+
+## Running nf-core/rnaseq
+
+In the following sections we will:
+
+- prepare our references;
+
+- set our computational resources in order to be able to run the pipeline on a gitpod VM;
+
+- edit the optional settings;
+
+- run the pipeline.
+
+## Reference and annotation files
+
+Following the considerations above, we will first of all edit the `nextflow.config` file in our working directory to add a new genome.
+It is sufficient to add the following code to the `parameters` directive in the config.
+
+```groovy title="nextflow.config"
+igenomes_base = '/workspace/gitpod/training/data/refs/'
+genomes {
+ 'GRCh38chr21' {
+ fasta = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta"
+ fasta_fai = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta.fai"
+ gff = "${params.igenomes_base}/gencode_v29_chr21_parsed.gff"
+ transcript_fasta = "${params.igenomes_base}/gencode.v29.transcripts_chr21.fa"
+ star_index = "${params.igenomes_base}/star_index_chr21.tar.gz"
+ salmon_index = "${params.igenomes_base}/salmon_index_chr21.tar.gz"
+ }
+}
+```
+
+To speed up the analysis we will include the `star_index` and the `salmon_index` in the config. These files have already been created locally.
+
+## Computing resources
+
+Based on the choices we made when starting up the gitpod environment, we recommend to use the following additional parameters.
+They can also be added to the parameters directive in the config file we just edited.
+
+```groovy title="nextflow.config"
+params {
+ max_cpus = 2
+ max_memory = '6.5GB'
+ max_time = '2.h'
+}
+```
+
+## Launching the pipeline
+
+Now we are ready to launch the pipeline, and we can use the following command line:
+
+```bash
+nextflow run nf-core/rnaseq -r 3.12.0 \
+--input /workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv \
+--outdir ./results_star_salmon \
+--genome GRCh38chr21 \
+--aligner star_salmon \
+--pseudo_aligner salmon \
+--skip_biotype_qc \
+--skip_stringtie \
+--skip_bigwig \
+--skip_umi_extract \
+--skip_trimming \
+--skip_fastqc \
+--skip_markduplicates \
+--skip_dupradar \
+--skip_rseqc \
+--skip_qualimap
+```
+
+Notice that we will run the pipeline with STAR as aligner and Salmon in alignment-based mode to quantify gene expression. We will also run the pipeline with Salmon in quasi-mapping mode to perform a lightweight alignment and quantification.
+
+The `skip` parameters were inserted to reduce the running time.
diff --git a/docs/usage/DEanalysis/theory.md b/docs/usage/DEanalysis/theory.md
new file mode 100644
index 000000000..dd932fbb6
--- /dev/null
+++ b/docs/usage/DEanalysis/theory.md
@@ -0,0 +1,169 @@
+---
+order: 2
+---
+
+# RNAseq Analysis
+
+Before we dive into the nf-core pipeline for analysing RNA-sequencing data, it's worth looking at some theoretical aspects of RNA-seq.
+
+## Overview
+
+Given the central role of RNA in a wide range of molecular functions, RNA-seq has emerged as a powerful tool for measuring the presence and levels of RNA species in biological samples. The technique is based on next-generation sequencing (NGS) technologies and is now considered the gold standard in the field of transcriptomics.
+
+After RNA extraction and reverse transcription into complementary DNA (cDNA), the biological material is sequenced, generating NGS "reads" that correspond to the RNA captured in a specific cell, tissue, or organ at a given time. The sequencing data is then bioinformatically processed through a typical workflow summarised in the diagram below:
+
+
+ { width="1000" }
+
+
+In the scheme, we can identify three key phases in the workflow:
+
+- **data pre-processing**: raw reads are handled to remove adapters or contaminants enhancing their quality;
+
+- **traditional or lightweight alignment and quantification**: reads are mapped to a reference genome, and gene abundance is estimated. The workflow can also follow an alternative route based on lightweight alignment and quantification, reducing the time required for the analysis.
+
+- **differential expression analysis**: differentially expressed genes are identified using statistical tests, annotated, and visualised.
+
+Depending on the user's needs, the workflow can include additional downstream analyses such as functional enrichment analysis, co-expression analysis, and integration with other omics data.
+
+## Pre-processing
+
+The pre-processing of sequencing reads from RNA-seq data is a critical step to ensure the quality and accuracy of downstream analysis. The raw reads obtained from the sequencer are stored as [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files, which contain both the sequence data and quality scores. The initial processing step involves evaluating the quality of the reads to identify potential issues such as adapter contamination, sequencing errors, or low-quality bases. The presence of adapters (short DNA sequences ligated to the ends of DNA fragments during library preparation) is firstly detected through comparison with known adapter sequences or by using algorithms that identify adapter-like sequences. These sequences are then removed in a process known as **read trimming**. Next, reads containing contaminants (genomic DNA and/or rRNA), and those with low-quality bases are filtered out. Finally, the quality of the filtered reads is checked to ensure their suitability for downstream processing.
+
+## Alignment (or lightweight alignment) and quantification
+
+In the RNA-seq workflow, the alignment step involves mapping sequencing reads to a reference genome or transcriptome to determine the position and orientation of each read relative to the reference sequence.
+
+Errors, gaps, regions of poor sequence quality, insertions/deletions (INDELs), as well as duplicated, and repetitive regions in the reference sequence, make this step challenging. Addressing these issues by choosing a high-quality reference and an appropriate aligner is essential for accurate results. A crucial component in the alignment step is the [annotation](https://www.ncbi.nlm.nih.gov/genbank/genomes_gff) file, which can be in General Feature Format (GFF) or Gene Transfer Format (GTF). These files contain key information about the location, and structure of genes and transcripts, playing an essential role in accurate mapping and gene expression quantification. Additionally, RNA-seq data often includes reads that span exon-exon junctions, and the annotation files provide information about splice junctions, allowing for the detection of different isoforms.
+
+The alignment and quantification steps can follow two different approaches depending on user preferences:
+
+- traditional alignment and quantification;
+
+- lightweight alignment and quantification.
+
+In RNA-seq analysis, traditional alignment is often performed with **splice-aware aligners**, which are designed to recognise the splicing process. These aligners can align reads across known splice junctions and detect novel splice sites or alternative splicing events. Popular splice-aware aligners include [STAR](https://github.com/alexdobin/STAR), and [HISAT2](https://github.com/DaehwanKimLab/hisat2). Once alignment is complete, the following step is quantification, which involves estimating the abundance (number of reads) associated with each gene. Tools like [featureCounts](https://subread.sourceforge.net/featureCounts.html), [HTSeq](https://htseq.readthedocs.io), [Salmon](http://combine-lab.github.io/salmon), and [RSEM](http://deweylab.github.io/RSEM) are commonly used for this purpose.
+
+Traditional alignment tools are time-consuming and require substantial computational resources, particularly in terms of memory, and CPU usage. Alternatively, methods based on **lightweight alignment** allows for faster analysis by determining the likely origin of reads without a full, base-by-base alignment. Lightweight alignment tools like [Kallisto](https://pachterlab.github.io/kallisto/about.html), [Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish), and [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) avoid full-scale alignment, providing quantification estimates more quickly than traditional splice-aware algorithms, while still maintaining high accuracy. The quantification results from these tools are often called **pseudocounts** or **abundance estimates**, which can be used for downstream analysis.
+
+:::info
+Salmon can be run in two different modes: alignment-based mode or quasi-mapping mode. In the first mode, Salmon estimates abudance starting from `bam` file obtained with the chosen aligner. In the second mode, Salmon requires a reference transcriptome and raw reads to perform both mapping and quantification.
+:::
+
+## Differential expression (DE) analysis with DESeq2
+
+Differential expression analysis is a statistical method for comparing gene expression levels between different experimental conditions, such as disease vs healthy (e.g., tumour tissue vs healthy tissue), treatment vs control (e.g., a sample treated with a specific stimulus, drug, or compound vs an untreated sample), and tissue vs tissue (e.g., brain vs heart). Differential expression analysis aims to assess, for each gene, whether the observed differences in expression between groups are statistically significant, accounting for the variation observed within groups (replicates). This part of the analysis is typically performed in R using various packages, such as [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html), [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), and [limma](https://bioconductor.org/packages/release/bioc/html/limma.html). This tutorial focuses on **DESeq2**, a popular R package known for its robustness. For more detailed information, see the [DESeq2 vignette](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
+
+The analysis begins with the input data, which generally consist of a count matrix obtained during the alignment and quantification step, summarising the expression levels of the different genes in each sample of the dataset. The rows of the matrix typically correspond to genes, while the columns represent the samples. Another essential prerequisite is a metadata table describing the samples.
+
+### Quality Control
+
+To ensure robust differential expression results, it is common to start by exploring the sources of variation in the data. DESeq2 provides quality control tools, including Principal Component Analysis (PCA) and hierarchical clustering. PCA is used to reduce the dimensionality of the data, allowing the visualisation of the samples in a lower-dimensional space. In contrast, hierarchical clustering shows the correlation of gene expression for all pairwise combinations of samples in the dataset. These methods can identify groups of samples with similar behaviour, as well as potential outliers or unexpected patterns.
+
+Quality control in DESeq2 typically uses **variance-stabilised (vst)** or **regularised log-transformed (rlog)** counts. Since raw counts in RNA-seq follow a discrete distribution, which is not suitable for many statistical and machine learning algorithms that assume continuous distributions, transformations such as `vst` and `rlog` are applied to stabilise the variance across genes. This stabilisation ensures that, after transformation, genes with both low and high expression levels have variances that are more comparable, making the data more suitable for downstream analyses.
+
+Finally, it is often beneficial to filter out genes that are unlikely to show differential expression, such as those with zero counts across samples or extreme count outliers. By filtering out these genes, we can increase the sensitivity of our differential expression analysis, and reduce the risk of false positives.
+
+:::note
+The `vst` or `rlog` transformations are applied to the normalised counts stored in the `dds` object, which is generated by running either the `DESeq()` or `estimateSizeFactors()` function. Since the estimation of size factors is an early step in the `DESeq()` function, the transformations are generally applied immediately afterwards.
+:::
+
+### Design Formula
+
+The design formula specifies the sources of variation that the statistical model needs to account for. It defines how the counts will be related to the experimental conditions, allowing the software to model the relationship between gene expression and the factors of interest, such as treatment, time points or batch effects. It is important to specify the main factor of interest as well as additional covariates in the design formula.
+
+A well-defined design formula can account for potential confounding variables. For instance, if the experiment includes multiple conditions, specifying these in the design helps to isolate the effect of the primary variable of interest on gene expression. An example is provided below:
+
+```r
+# Basic design with a single factor of interest
+
+design = ~ condition
+
+# Design including covariates to control for sex and developmental stage
+
+design = ~ sex + developmental_stage + condition
+```
+
+:::tip
+In R, the tilde (`~`) is used in formula syntax to specify relationships between variables in statistical models. Here, it indicates that gene counts (dependent variable) will be modelled as a function of the specified variables (predictors).
+:::
+
+The results will not be affected by the order of variables but the common practice is to specify the main source of variation in the last position of the design formula.
+
+### Differential Expression Analysis
+
+RNA-seq data typically contain a large number of genes with low expression counts, indicating that many genes are expressed at very low levels across samples. At the same time, RNA-seq data exhibit a skewed distribution with a long right tail due to the absence of an upper limit for gene expression levels. This means that while most genes have low to moderate expression levels, a small number are expressed at high levels. Accurate statistical modelling must therefore account for this distribution to avoid misleading conclusions.
+
+
+ { width="400"}
+
+
+The core of the differential expression analysis is the `DESeq()` function, a wrapper that streamlines several key steps into a single command. The different functions are listed below:
+
+
+ { width="400"}
+
+
+:::note
+While `DESeq()` combines these steps, a user could choose to perform each function separately to have more control over the whole process.
+:::
+
+The different steps are explained in detail below:
+
+1. **Normalisation**: since DESeq2 compares counts between sample groups for the same gene, it does not need to adjust for gene length. However, it is essential to account for variations in sequencing depth and RNA composition among samples. To normalise the data, DESeq2 utilises **size factors**.
+
+The size factors are calculated using the **median ratio** method:
+
+- **Calculate the geometric mean**: for each gene, compute the geometric mean of its counts across all samples. This gives a row-wise geometric mean for each gene;
+
+- **Calculate ratios**: divide each gene's count by its geometric mean to obtain a ratio for each sample;
+
+- **Determine size factors**: for each sample, take the median of these ratios (column-wise) to derive the size factors;
+
+- **Normalise counts**: divide each raw count by the corresponding sample size factor to generate normalised counts.
+
+:::warning
+While normalised counts are useful for downstream visualisation of results, they should not be used as input for DESeq2. Instead, DESeq2 requires raw count data in the form of a matrix of integer values.
+:::
+
+2. **Estimate dispersion and gene-wise dispersion**: the dispersion is a measure of how much the variance deviates from the mean. The dispersion estimates indicate the variance in gene expression at a specific mean expression level. Importantly, RNA-seq data are characterised by **overdispersion**, where the variance in gene expression levels often exceeds the mean (variance > mean).
+
+
+ { width="400"}
+
+
+DESeq2 addresses this issue by employing the **negative binomial distribution**, which generalises the Poisson distribution by introducing an additional dispersion parameter. This parameter quantifies the extra variability present in RNA-seq data, providing a more realistic representation than the Poisson distribution, which assumes mean = variance. DESeq2 starts by estimating the **common dispersion**, a single estimate of dispersion applicable to all genes in the dataset. This estimate provides a baseline for variability across all genes in the dataset. Next, DESeq2 estimates **gene-wise dispersion**, a separate estimate of dispersion for each individual gene, taking into account that different genes may exhibit varying levels of expression variability due to biological differences.
+The dispersion parameter (α) is related to the mean (μ), and variance of the data, as described by the equation:
+
+$$
+Var = μ + α ⋅ μ²
+$$
+
+A key feature of DESeq2's dispersion estimates is their negative correlation with the mean, and positive correlation with the variance. Genes with low expression tend to have higher dispersion values, while genes with high expression tend to have lower dispersion.
+
+3. **Mean-dispersion relationship**: this process, known as dispersion fitting, models the relationship between the mean expression level of a gene, and its dispersion. DESeq2 assumes that genes with similar expression profiles share similar dispersion patterns, and leverages this information to refine the estimates identifying a trend in the dispersion estimates across genes. The fitted curve, typically a smooth curve, describes how dispersion changes as a function of the mean expression level.
+
+4. **Final dispersion estimates**: DESeq2 refines the gene-wise dispersion by shrinking it towards the fitted curve. The "shrinkage" helps control for overfitting, and makes the dispersion estimates more reliable. The strength of the shrinkage depends on the sample size (more samples = less shrinkage), and how close the initial estimates are to the fitted curve.
+
+
+ { width="400"}
+
+
+The initial estimates (black dots) are shrunk toward the fitted curve (red line) to obtain the final estimates (blue dots). However, genes with exceptionally high dispersion values are not shrunk, as they likely deviate from the model assumptions exhibiting elevated variability due to biological or technical factors. Shrinking these values could lead to false positives.
+
+5. **Fitting model and testing**: the initial step in hypothesis testing involves defining a **null hypothesis** for each gene. In DESeq2, The null hypothesis states that there is no difference in expression between the sample groups (log2 fold change == 0). Next, DESeq applies a statistical test to assess whether the null hypothesis is true.
+ DESeq2 fits a **generalised linear model (GLM)** to the normalised counts using the calculated size factors and final dispersion estimates. A GLM is a flexible extension of linear regression that models the relationship between a response variable (normalised counts), and predictors (e.g., condition).
+ By using a **negative binomial distribution**, DESeq2's GLM can handle the additional variability in gene expression counts that often occurs in RNA-seq data. Once the model is fit, coefficients are estimated for each sample group along with their standard errors. These coefficients represent the estimated log2 fold changes between groups, and serve as the basis for hypothesis testing, using either a **Wald test** or a **Likelihood Ratio Test (LRT)**, depending on the experimental design:
+
+- **Wald Test**: The Wald test is ideal for simpler experimental designs, such as comparing two conditions (e.g., treated vs untreated). In DESeq2, the Wald test is implemented by calculating the log2 fold change and dividing it by its standard error to obtain a z-statistic. This z-statistic is then evaluated against a standard normal distribution, leading to the computation of a p-value that indicates the likelihood of observing a z-statistic as extreme as the calculated value under the null hypothesis. When the p-value is small, we reject the null hypothesis, concluding that the gene is differentially expressed.
+
+- **Likelihood Ratio Test (LRT)**: The LRT is more suitable for complex experimental designs involving multiple variables. It compares the fit of two nested models: the full model, which includes the factor of interest (e.g., treatment), and the reduced model, which excludes that factor. This approach allows DESeq2 to account for confounding variables and to isolate the effect of specific variables on gene expression. In contrast to the Wald test, which assesses up- or down-regulation between two conditions, the LRT identifies genes that exhibit changes in expression in any direction across multiple sample classes, thereby making it more appropriate for multi-factor analyses.
+
+Regardless of the utilised test, for each gene DESeq2 computes a log2 fold change along with an associated p-value. The p-value is the result of a single test (single gene), but in the case of RNA-seq data thousands of genes are tested, increasing the false positive rate.
+To account for this, DESeq2 employs multiple test correction methods (the Benjamini-Hochberg procedure is the default) to adjust the p-values, and control the false discovery rate (FDR). The FDR is the expected proportion of false positives among the identified significant results.
+
+:::info
+By setting the FDR cutoff to < 0.05, 5% of genes identified as differentially expressed are expected to be false positives. For instance, if 400 genes are identified as differentially expressed with an FDR cutoff of 0.05, you would expect 20 of them to be false positives.
+:::
+
+After identifying DE genes using DESeq2, it is essential to interpret the biological significance of these genes through functional analysis. This involves examining the roles of the differentially expressed genes in various biological processes, molecular functions, and pathways, providing insights into the underlying mechanisms driving the observed changes in gene expression. This interpretation can help in discovering pathways involved in disease or identifying potential therapeutic targets. Different tools are available to carry out these functional analyses, such as [Gene Ontology](https://geneontology.org), [Reactome](https://reactome.org/), [KEGG](https://www.genome.jp/kegg), [clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html), [g:Profiler](https://biit.cs.ut.ee/gprofiler), and [WikiPathways](https://www.wikipathways.org).