Decoding Discrepancies: A Comparative Analysis of Seurat and Scanpy in Single-Cell RNA Sequencing Workflows

Introduction

Single-cell RNA sequencing (scRNA-seq) has emerged as a transformative technology, enabling gene expression analysis at the cellular level. This has led to a deeper understanding of cellular heterogeneity and complex biological processes. The analysis of scRNA-seq data relies heavily on computational tools, with Seurat and Scanpy being the most widely used packages. These platforms are generally perceived to implement similar workflows, but a closer examination reveals considerable differences in their underlying algorithms and methods, leading to variability in the final results. This article delves into the nuances of Seurat and Scanpy, highlighting the key differences and their implications for scRNA-seq data analysis.

The Standard scRNA-seq Workflow

The standard scRNA-seq workflow involves several key steps, starting with converting raw read data into a cell-gene count matrix. This matrix, denoted as X, represents the number of RNA transcripts of gene g expressed by cell i. The workflow typically includes:

Filtering: Removal of poor-quality cells and minimally expressed genes.
Normalization: Adjustment for non-biological sources of variability, such as sequencing depth, technical noise, library size, and batch effects. Log normalization is handled identically by Seurat and Scanpy, producing equivalent output given the same input matrices.
Highly Variable Gene (HVG) Selection: Identification of genes with significant expression variability across cells, crucial for dimensionality reduction and downstream analysis. However, the programs deviate from their default algorithm for HVG selection, with a Jaccard index (intersection over union between two sets) of 0.22.
Scaling: Scaling of gene expression values to a mean of zero and variance of one across cells.
Dimensionality Reduction: Application of Principal Component Analysis (PCA) to reduce the number of variables while retaining the most important information. PCA plots showed noticeable differences in the plotted positions of each cell on the PC1-2 space, although the same general shape of the plot is preserved. The Scree plots also displayed differences, most notably with the proportion of variance explained by the first PC differing by 0.1.
k-Nearest Neighbors (KNN) Graph Construction: Building a graph that represents the relationships between cells based on their gene expression profiles. Both the content and size of each neighborhood per cell differed greatly. The median Jaccard index between the neighborhood of each cell from Seurat and Scanpy was 0.11, and the median degree ratio (Seurat/Scanpy) magnitude was 2.05.
Clustering: Grouping cells with similar gene expression patterns into distinct clusters. Clustering with default settings also resulted in differences in output, as seen by the discordance in the alluvial plot and the Adjusted Rand Index (ARI) of 0.53.
Non-linear Dimensionality Reduction: Using techniques like t-distributed stochastic neighbor embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP) to visualize the data in a lower-dimensional space. UMAP plots visually showed some differences in the shapes of local and neighboring clusters, even when controlling for global shifts or rotations.

Seurat vs. Scanpy: A Detailed Comparison

Seurat, written in R, was one of the first comprehensive platforms for scRNA-seq analysis. Scanpy, a Python-based tool developed later, offers a similar set of features. While both tools accept a cell-gene count matrix as input, generated by packages like Cell Ranger and kallisto-bustools (kb), their internal algorithms and default settings differ significantly.

Input and Count Matrix Generation

The input to Seurat and Scanpy is a cell-gene count matrix, with two popular packages for count matrix generation being Cell Ranger and kallisto-bustools (kb). Cell Ranger, developed by 10x Genomics, is specifically optimized for processing data from the Chromium platform, providing a solution that includes barcode processing, read alignment (using the STAR aligner), and gene expression analysis. It is popular for its user-friendliness and seamless integration with 10x Genomics data. However, Cell Ranger’s robustness comes with the trade-off of high computational demands, particularly for larger datasets. On the other hand, kb is an open-source alternative to Cell Ranger known for its efficiency and speed. kb-python is a wrapper around kallisto and bustools, which pseudoaligns reads to produce a barcode, unique molecular identifier (UMI), set (BUS) file, which is then processed into a cell-by-gene count matrix. Utilizing the kallisto pseudoalignment algorithm and the bustools toolkit, kb provides a fast, lightweight solution for quantifying transcript abundances and handling BUS files. This efficiency makes it particularly suitable for environments with constrained computational resources. Additionally, kb is accurate and stands out for its flexibility, allowing researchers to tailor the analysis pipeline to a broader range of experimental designs and research needs.

Filtering and Normalization

There was no difference in cell or gene filtering between the packages after filtering UMIs, minimum genes per cell, minimum cells per gene, and maximum mitochondrial gene content. Furthermore, given the same matrices as input, Seurat and Scanpy handled log normalization identically as well, producing equivalent output.

Highly Variable Gene Selection and PCA

The programs deviated from their default algorithm for HVG selection, with a Jaccard index (intersection over union between two sets) of 0.22. Further differences were observed with PCA analysis, which also yielded different results when run with default parameters. The PCA plots showed noticeable differences in the plotted positions of each cell on the PC1-2 space, although the same general shape of the plot is preserved. The Scree plots also displayed differences, most notably with the proportion of variance explained by the first PC differing by 0.1. The eigenvectors demonstrated differences, with the angle between the first PC vectors having a sine of 0.1, the angle between the second PCs having a sine of 0.5, i.e., 30 degrees apart, and PCs 3+ being nearly orthogonal. All of these changes could be resolved with HVG-set standardization and with the clipping and regression settings prior to PCA adjusted accordingly.

KNN/SNN Graph Construction and Clustering

Next, the packages differed substantially in their production of an SNN graph. Both the content and size of each neighborhood per cell differed greatly. The median Jaccard index between the neighborhood of each cell from Seurat and Scanpy was 0.11, and the median degree ratio (Seurat/Scanpy) magnitude was 2.05. The degree ratio for each was nearly always greater than 1, indicating that Seurat, by default, yields more highly connected SNN graphs than Scanpy. Clustering with default settings also resulted in differences in output, as seen by the discordance in the alluvial plot and the Adjusted Rand Index (ARI) of 0.53. UMAP plots visually showed some differences in the shapes of local and neighboring clusters, even when controlling for global shifts or rotations.

Differential Expression Analysis

Upon DE analysis, Seurat and Scanpy overlapped with a Jaccard index of 0.62 for their significant marker genes (i.e., the total set of genes with adjusted p-value < 0.05 across all clusters), but Seurat had approximately 50% more significant marker genes than Scanpy. The difference in significant marker genes is a result of a few differences in default settings between packages. First, each package implements the Wilcoxon function separately, with Seurat requiring tie correction and Scanpy by default omitting tie correction. Additionally, each package adjusts p-values differently by default - Seurat with Bonferroni multiple testing correction, and Scanpy with Benjamini-Hochberg multiple testing correction. Finally, Seurat, by default, filters markers by p-value, percentage of cells per group possessing the gene, and log-fold change (logFC) prior to performing the Wilcoxon rank-sum test; Scanpy does not perform this type of filtering without invoking additional functions. Setting the filtering arguments and clusters of Scanpy to be the same as Seurat (filtering, tie-correction, Bonferroni correction) for DE analysis improved the Jaccard index of significant marker gene overlap to 0.73, and providing the same cluster assignments further improved the Jaccard index to 0.99. The remaining 1% of genes differ as a result of differences in logFC calculation discussed later.

Seurat and Scanpy compute logFC differently as well. Comparing each analogous gene per cluster across packages resulted in a concordance correlation coefficient (CCC) of 0.98 and a PCA fit line with a slope of 1, indicating strong correlation across packages. Briefly, CCC measures the agreement between two variables both in terms of correlation and variance. However, observing the scatterplot of logFC values revealed noticeable differences in a large number of values. Specifically, there were a handful of cases (4,109 out of 135,185 markers) where Scanpy predicted a logFC near ±30 for a gene in a cluster while Seurat predicted a logFC near 0. Regarding adjusted p-value, there were also differences between Seurat and Scanpy. With default function arguments, Seurat predicted p-values either less than or similar to Scanpy, but never substantially greater. Most p-values were near the maximum of 1, but there was a wide degree of variability. A considerable number of p-values were far from the y=x line, including those below 1e-50 for Seurat but near 1 for Scanpy. 20% of markers had their p-values flip across the p=0.05 threshold between packages, with it being fairly even flipping in either direction (i.e., significant only in Seurat, or significant only in Scanpy). When function arguments were aligned to be like Seurat, virtually all differences in adjusted p-value disappeared.

Impact of Variability

The observed differences between Seurat and Scanpy can have a significant impact on the interpretation of scRNA-seq results. The extent of these differences is approximately equivalent to the variability introduced by sequencing less than 5% of the reads or analyzing less than 20% of the cell population. This level of variability can lead to inconsistencies in downstream analyses, such as identifying marker genes and defining cell types.

Read also: Your Guide to the Sullivan Portal

The Cost of scRNA-seq

A standard scRNA-seq experiment costs thousands of dollars, with exact pricing influenced largely by data size. While it is difficult to provide an exact cost as a result of variability between methods, it is estimated that a typical sequencing kit costs approximated in the range of hundreds to thousands of dollars, and sequencing costs add up to an additional $5 per million reads. The necessary number of reads per cell for high-quality data depends on the context of the experiment, but as an example, Cell Ranger typically recommends 20,000 read pairs per cell for its v3 technologies, and 50,000 read pairs per cell for its v2 technologies. Sample preparation also has substantial costs, often requiring precious patient samples, or maintenance of cell or animal lines for months to years in preparation for experimental analysis. A standard 10x Genomics scRNA-seq experiment sequences tens of millions to billions of reads, with a recommended cell count ranging from 500-10,000+ depending on the context. These estimates do not factor in additional costs including labor, experimental setup, and follow-up analysis.

tags: #delaney #sullivan #ucla #mstp #research