scuttle 1.4.0
scuttle provides various low-level utilities for single-cell RNA-seq data analysis, typically used at the start of the analysis workflows or within high-level functions in other packages. This vignette will discuss one of the earliest steps, namely, quality control (QC) to remove damaged cells.
To demonstrate, we will obtain the classic Zeisel dataset from the scRNAseq package.
In this case, the dataset is provided as a SingleCellExperiment
object.
However, most scuttle functions can also be used with raw expression matrices
or instances of the more general SummarizedExperiment
class.
library(scRNAseq)
sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(2): ERCC repeat
Note: A more comprehensive description of the use of scuttle functions (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
The perCellQCMetrics()
function computes a variety of basic cell-level metrics,
including sum
, total number of counts for the cell (i.e., the library size);
and detected
, the number of features for the cell that have counts above the detection limit (default of zero).
Low values for either metric are indicative of poor cDNA capture.
library(scuttle)
is.mito <- grep("mt-", rownames(sce))
per.cell <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
summary(per.cell$sum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2574 8130 12913 14954 19284 63505
summary(per.cell$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 785 2484 3656 3777 4929 8167
In addition, we can compute subsets_X_percent
, percentage of counts that come from the feature control set named X
;
and altexps_Y_percent
, the percentage of all counts that come from an alternative feature set Y
.
Here, X
contains the mitochondrial genes and Y
contains the set of spike-ins.
High proportions for either metric are indicative of cell damage.
summary(per.cell$subsets_Mito_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.992 6.653 7.956 10.290 56.955
summary(per.cell$altexps_ERCC_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.912 14.186 20.699 23.199 29.950 76.881
It is often convenient to store this in the colData()
of our SingleCellExperiment
object for future reference.
One can either do so manually:
colData(sce) <- cbind(colData(sce), per.cell)
… or just use the addPerCellQCMetrics()
function, which does this automatically:
sce2 <- addPerCellQCMetrics(sce, subsets=list(Mito=is.mito))
colnames(colData(sce2))
## [1] "tissue" "group #"
## [3] "total mRNA mol" "well"
## [5] "sex" "age"
## [7] "diameter" "cell_id"
## [9] "level1class" "level2class"
## [11] "sum" "detected"
## [13] "subsets_Mito_sum" "subsets_Mito_detected"
## [15] "subsets_Mito_percent" "altexps_ERCC_sum"
## [17] "altexps_ERCC_detected" "altexps_ERCC_percent"
## [19] "altexps_repeat_sum" "altexps_repeat_detected"
## [21] "altexps_repeat_percent" "total"
## [23] "sum" "detected"
## [25] "subsets_Mito_sum" "subsets_Mito_detected"
## [27] "subsets_Mito_percent" "altexps_ERCC_sum"
## [29] "altexps_ERCC_detected" "altexps_ERCC_percent"
## [31] "altexps_repeat_sum" "altexps_repeat_detected"
## [33] "altexps_repeat_percent" "total"
We identify low-quality cells by setting a threshold on each of these metrics using the isOutlier()
function.
This defines the threshold at a certain number of median absolute deviations (MADs) away from the median;
values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells.
Here, we define small outliers (using type="lower"
) for the log-total counts at 3 MADs from the median.
low.total <- isOutlier(per.cell$sum, type="lower", log=TRUE)
summary(low.total)
## Mode FALSE
## logical 3005
Note that the attributes of the isOutlier()
output contain the thresholds, if this is of interest.
attr(low.total, "threshold")
## lower higher
## 1928.56 Inf
Advanced users can set batch=
to compute outliers within each batch, avoiding inflated MADs due to batch effects.
low.total.batched <- isOutlier(per.cell$sum, type="lower", log=TRUE, batch=sce$tissue)
summary(low.total.batched)
## Mode FALSE TRUE
## logical 2996 9
In cases where entire batches contain a majority of low-quality cells, we can set subset=
to only use high-quality batches for computing the thresholds.
Those thresholds are then extrapolated back to the low-quality batches for some more stringent QC.
We could manually apply isOutlier()
to all of our metrics, but it is easier to use perCellQCFilters()
to do this for us.
This identifies low-quality cells as those that are low outliers for the log-total count, low outliers for the log-number of detected features,
or high outliers for the percentage of counts in specified gene sets (e.g., mitochondrial genes, spike-in transcripts).
The discard
column contains the final call for whether a particular cell should be discarded.
# An example with just mitochondrial filters.
qc.stats <- perCellQCFilters(per.cell, sub.fields="subsets_Mito_percent")
colSums(as.matrix(qc.stats))
## low_lib_size low_n_features high_subsets_Mito_percent
## 0 3 128
## discard
## 131
# Another example with mitochondrial + spike-in filters.
qc.stats2 <- perCellQCFilters(per.cell,
sub.fields=c("subsets_Mito_percent", "altexps_ERCC_percent"))
colSums(as.matrix(qc.stats2))
## low_lib_size low_n_features high_subsets_Mito_percent
## 0 3 128
## high_altexps_ERCC_percent discard
## 65 189
For typical scRNA-seq applications, quickPerCellQC()
will wrap the perCellQCMetrics()
and perCellQCFilters()
calls.
This returns a filtered SingleCellExperiment
that can be immediately used in downstream analyses.
filtered <- quickPerCellQC(sce, subsets=list(Mito=is.mito), sub.fields="subsets_Mito_percent")
filtered
## class: SingleCellExperiment
## dim: 20006 2874
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(2874): 1772071015_C02 1772071017_G12 ... 1772063068_D01
## 1772066098_A12
## colData names(38): tissue group # ... high_subsets_Mito_percent discard
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(2): ERCC repeat
The outlier-based approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, differences in total RNA content between cell types. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the book for more details.
Some basic feature-level statistics are computed by the perFeatureQCMetrics()
function.
This includes mean
, the mean count of the gene/feature across all cells;
detected
, the percentage of cells with non-zero counts for each gene;
subsets_Y_ratio
, ratio of mean counts between the cell control set named Y and all cells.
# Pretending that the first 10 cells are empty wells, for demonstration.
per.feat <- perFeatureQCMetrics(sce, subsets=list(Empty=1:10))
summary(per.feat$mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007 0.0097 0.1338 0.7475 0.5763 732.1524
summary(per.feat$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03328 0.76539 9.01830 18.87800 31.24792 99.96672
summary(per.feat$subsets_Empty_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.601 1.872 2.016 300.500
A more refined calculation of the average is provided by the calculateAverage()
function,
which adjusts the counts by the relative library size (or size factor) prior to taking the mean.
ave <- calculateAverage(sce)
summary(ave)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002 0.0109 0.1443 0.7475 0.5674 850.6880
These metrics tend to be more useful for informing the analyst about the overall behavior of the experiment, rather than being explicitly used to filter out genes. For example, one would hope that the most abundant genes are the “usual suspects”, e.g., ribosomal proteins, actin, histones.
Users may wish to filter out very lowly or non-expressed genes to reduce the size of the dataset prior to downstream processing. However, be sure to publish the unfiltered count matrix! It is very difficult to re-use a count matrix where the features are filtered; information from the filtered features cannot be recovered, complicating integration with other datasets.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scran_1.22.0 ensembldb_2.18.0
## [3] AnnotationFilter_1.18.0 GenomicFeatures_1.46.0
## [5] AnnotationDbi_1.56.0 scuttle_1.4.0
## [7] scRNAseq_2.7.2 SingleCellExperiment_1.16.0
## [9] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [11] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0
## [13] IRanges_2.28.0 S4Vectors_0.32.0
## [15] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
## [17] matrixStats_0.61.0 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] rjson_0.2.20 ellipsis_0.3.2
## [3] bluster_1.4.0 XVector_0.34.0
## [5] BiocNeighbors_1.12.0 bit64_4.0.5
## [7] interactiveDisplayBase_1.32.0 fansi_0.5.0
## [9] xml2_1.3.2 sparseMatrixStats_1.6.0
## [11] cachem_1.0.6 knitr_1.36
## [13] jsonlite_1.7.2 Rsamtools_2.10.0
## [15] cluster_2.1.2 dbplyr_2.1.1
## [17] png_0.1-7 shiny_1.7.1
## [19] BiocManager_1.30.16 compiler_4.1.1
## [21] httr_1.4.2 dqrng_0.3.0
## [23] assertthat_0.2.1 Matrix_1.3-4
## [25] fastmap_1.1.0 lazyeval_0.2.2
## [27] limma_3.50.0 later_1.3.0
## [29] BiocSingular_1.10.0 htmltools_0.5.2
## [31] prettyunits_1.1.1 tools_4.1.1
## [33] igraph_1.2.7 rsvd_1.0.5
## [35] glue_1.4.2 GenomeInfoDbData_1.2.7
## [37] dplyr_1.0.7 rappdirs_0.3.3
## [39] Rcpp_1.0.7 jquerylib_0.1.4
## [41] vctrs_0.3.8 Biostrings_2.62.0
## [43] ExperimentHub_2.2.0 rtracklayer_1.54.0
## [45] DelayedMatrixStats_1.16.0 xfun_0.27
## [47] stringr_1.4.0 beachmat_2.10.0
## [49] mime_0.12 lifecycle_1.0.1
## [51] irlba_2.3.3 restfulr_0.0.13
## [53] statmod_1.4.36 XML_3.99-0.8
## [55] edgeR_3.36.0 AnnotationHub_3.2.0
## [57] zlibbioc_1.40.0 hms_1.1.1
## [59] promises_1.2.0.1 ProtGenerics_1.26.0
## [61] parallel_4.1.1 yaml_2.2.1
## [63] curl_4.3.2 memoise_2.0.0
## [65] sass_0.4.0 biomaRt_2.50.0
## [67] stringi_1.7.5 RSQLite_2.2.8
## [69] BiocVersion_3.14.0 BiocIO_1.4.0
## [71] ScaledMatrix_1.2.0 filelock_1.0.2
## [73] BiocParallel_1.28.0 rlang_0.4.12
## [75] pkgconfig_2.0.3 bitops_1.0-7
## [77] evaluate_0.14 lattice_0.20-45
## [79] purrr_0.3.4 GenomicAlignments_1.30.0
## [81] bit_4.0.4 tidyselect_1.1.1
## [83] magrittr_2.0.1 bookdown_0.24
## [85] R6_2.5.1 generics_0.1.1
## [87] metapod_1.2.0 DelayedArray_0.20.0
## [89] DBI_1.1.1 pillar_1.6.4
## [91] withr_2.4.2 KEGGREST_1.34.0
## [93] RCurl_1.98-1.5 tibble_3.1.5
## [95] crayon_1.4.1 utf8_1.2.2
## [97] BiocFileCache_2.2.0 rmarkdown_2.11
## [99] progress_1.2.2 locfit_1.5-9.4
## [101] grid_4.1.1 blob_1.2.2
## [103] digest_0.6.28 xtable_1.8-4
## [105] httpuv_1.6.3 bslib_0.3.1