Scaling normalization strategies for scRNA-seq data can be broadly divided into two classes. The first class assumes that there exists a subset of genes that are not DE between samples, as described in the previous workflows. The second class uses the fact that the same amount of spike-in RNA was added to each cell (Lun et al. 2017). Differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. Scaling normalization is then applied to equalize spike-in coverage across cells.
The choice between these two normalization strategies depends on the biology of the cells and the features of interest. If the majority of genes are expected to be DE and there is no reliable house-keeping set, spike-in normalization may be the only option for removing cell-specific biases. Spike-in normalization should also be used if differences in the total RNA content of individual cells are of interest. In any particular cell, an increase in the amount of endogenous RNA will not increase spike-in coverage (with or without library quantification). Thus, the former will not be represented as part of the bias in the latter, which means that the effects of total RNA content on expression will not be removed upon scaling. With non-DE normalization, an increase in RNA content will systematically increase the expression of all genes in the non-DE subset, such that it will be treated as bias and removed.
We demonstrate the use of spike-in normalization on a dataset involving different cell types – namely, mouse embryonic stem cells (mESCs) and mouse embryonic fibroblasts (MEFs) (Islam et al. 2011). The count table was obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file using the accession number GSE29087.
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask=FALSE)
islam.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
"GSE29nnn/GSE29087/suppl/GSE29087_L139_expression_tab.txt.gz"))
We load the counts into R, using colClasses
to speed up read.table
by pre-defining the type of each column.
We also specify the rows corresponding to spike-in transcripts.
library(SingleCellExperiment)
counts <- read.table(islam.fname,
colClasses=c(list("character", NULL, NULL, NULL, NULL, NULL, NULL),
rep("integer", 96)), skip=6, sep='\t', row.names=1)
is.spike <- grep("SPIKE", rownames(counts))
sce.islam <- SingleCellExperiment(list(counts=as.matrix(counts)))
isSpike(sce.islam, "spike") <- is.spike
dim(sce.islam)
## [1] 22936 96
We perform some quality control to remove low-quality cells using the calculateQCMetrics
function.
Outliers are identified within each cell type to avoid issues with systematic differences in the metrics between cell types.
The negative control wells do not contain any cells and are useful for quality control (as they should manifest as outliers for the various metrics), but need to be removed prior to downstream analysis.
library(scater)
sce.islam <- calculateQCMetrics(sce.islam)
sce.islam$grouping <- rep(c("mESC", "MEF", "Neg"), c(48, 44, 4))
libsize.drop <- isOutlier(sce.islam$total_counts, nmads=3, type="lower",
log=TRUE, batch=sce.islam$grouping)
feature.drop <- isOutlier(sce.islam$total_features_by_counts, nmads=3, type="lower",
log=TRUE, batch=sce.islam$grouping)
spike.drop <- isOutlier(sce.islam$pct_counts_spike, nmads=3, type="higher",
batch=sce.islam$grouping)
sce.islam <- sce.islam[,!(libsize.drop | feature.drop |
spike.drop | sce.islam$grouping=="Neg")]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
BySpike=sum(spike.drop), Remaining=ncol(sce.islam))
## ByLibSize ByFeature BySpike Remaining
## 1 4 6 12 77
We apply the computeSpikeFactors
method to estimate size factors for all cells.
This method computes the total count over all spike-in transcripts in each cell, and calculates size factors to equalize the total spike-in count across cells.
Here, we set general.use=TRUE
as we intend to apply the spike-in factors to all counts.
library(scran)
sce.islam <- computeSpikeFactors(sce.islam, general.use=TRUE)
head(sizeFactors(sce.islam))
## [1] 1.1486524 1.1274936 0.4285218 1.1011014 0.6646450 1.5932121
head(sizeFactors(sce.islam, "spike")) # same as general size factors.
## [1] 1.1486524 1.1274936 0.4285218 1.1011014 0.6646450 1.5932121
Running normalize
will use the spike-in-based size factors to compute normalized log-expression values.
Unlike the previous analyses, we do not have to define separate size factors for the spike-in transcripts.
This is because the relevant factors are already being used for all genes and spike-in transcripts when general.use=TRUE
.
(The exception is if the experiment uses multiple spike-in sets that behave differently and need to be normalized separately.)
sce.islam <- normalize(sce.islam)
For comparison, we also compute the deconvolution size factors for this data set (Lun, Bach, and Marioni 2016).
deconv.sf <- computeSumFactors(sce.islam, sf.out=TRUE, cluster=sce.islam$grouping)
head(deconv.sf)
## [1] 0.06424166 0.12172399 0.15547626 0.12876863 0.05680651 0.07888561
We observe a negative correlation between the two sets of size factors (Figure 1). This is because MEFs contain more endogenous RNA, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the deconvolution size factors). If the spike-in size factors were applied to the counts, the expression values in MEFs would be scaled up while expression in mESCs would be scaled down. However, the opposite would occur if deconvolution size factors were used.
colours <- c(mESC="red", MEF="grey")
plot(sizeFactors(sce.islam), deconv.sf, col=colours[sce.islam$grouping], pch=16,
log="xy", xlab="Size factor (spike-in)", ylab="Size factor (deconvolution)")
legend("bottomleft", col=colours, legend=names(colours), pch=16)
Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In the HSC and brain analyses, variability in total RNA across the population was treated as noise and removed by non-DE normalization. This may not always be appropriate if total RNA is associated with a biological difference of interest. For example, Islam et al. (2011) observe a 5-fold difference in total RNA between mESCs and MEFs. Similarly, the total RNA in a cell changes across phases of the cell cycle (Buettner et al. 2015). Spike-in normalization will preserve these differences in total RNA content such that the corresponding biological groups can be easily resolved in downstream analyses.
Comments from Aaron:
min.mean
) to compute the deconvolution size factors.
This avoids problems with discreteness as mentioned in our previous uses of computeSumFactors
.sf.out=TRUE
will directly return the size factors, rather than a SingleCellExperiment
object containing those factors.
This is more convenient when only the size factors are required for further analysis.All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scRNAseq_1.8.0
## [2] edgeR_3.24.3
## [3] Matrix_1.2-15
## [4] org.Hs.eg.db_3.7.0
## [5] EnsDb.Hsapiens.v86_2.99.0
## [6] ensembldb_2.6.3
## [7] AnnotationFilter_1.6.0
## [8] DropletUtils_1.2.2
## [9] pheatmap_1.0.12
## [10] cluster_2.0.7-1
## [11] dynamicTreeCut_1.63-1
## [12] limma_3.38.3
## [13] scran_1.10.2
## [14] scater_1.10.1
## [15] ggplot2_3.1.0
## [16] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
## [17] GenomicFeatures_1.34.1
## [18] org.Mm.eg.db_3.7.0
## [19] AnnotationDbi_1.44.0
## [20] SingleCellExperiment_1.4.1
## [21] SummarizedExperiment_1.12.0
## [22] DelayedArray_0.8.0
## [23] BiocParallel_1.16.5
## [24] matrixStats_0.54.0
## [25] Biobase_2.42.0
## [26] GenomicRanges_1.34.0
## [27] GenomeInfoDb_1.18.1
## [28] IRanges_2.16.0
## [29] S4Vectors_0.20.1
## [30] BiocGenerics_0.28.0
## [31] bindrcpp_0.2.2
## [32] BiocFileCache_1.6.0
## [33] dbplyr_1.2.2
## [34] knitr_1.21
## [35] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.2.0 plyr_1.8.4
## [3] igraph_1.2.2 lazyeval_0.2.1
## [5] splines_3.5.2 sp_1.3-1
## [7] digest_0.6.18 htmltools_0.3.6
## [9] viridis_0.5.1 magrittr_1.5
## [11] memoise_1.1.0 openxlsx_4.1.0
## [13] Biostrings_2.50.2 prettyunits_1.0.2
## [15] colorspace_1.3-2 blob_1.1.1
## [17] rappdirs_0.3.1 rrcov_1.4-7
## [19] haven_2.0.0 xfun_0.4
## [21] dplyr_0.7.8 crayon_1.3.4
## [23] RCurl_1.95-4.11 bindr_0.1.1
## [25] survival_2.43-3 zoo_1.8-4
## [27] glue_1.3.0 gtable_0.2.0
## [29] zlibbioc_1.28.0 XVector_0.22.0
## [31] kernlab_0.9-27 car_3.0-2
## [33] Rhdf5lib_1.4.2 prabclus_2.2-6
## [35] DEoptimR_1.0-8 HDF5Array_1.10.1
## [37] abind_1.4-5 VIM_4.7.0
## [39] scales_1.0.0 sgeostat_1.0-27
## [41] mvtnorm_1.0-8 DBI_1.0.0
## [43] GGally_1.4.0 sROC_0.1-2
## [45] Rcpp_1.0.0 laeken_0.4.6
## [47] viridisLite_0.3.0 progress_1.2.0
## [49] foreign_0.8-71 bit_1.1-14
## [51] mclust_5.4.2 truncnorm_1.0-8
## [53] vcd_1.4-4 httr_1.4.0
## [55] fpc_2.1-11.1 RColorBrewer_1.1-2
## [57] modeltools_0.2-22 NADA_1.6-1
## [59] flexmix_2.3-14 pkgconfig_2.0.2
## [61] reshape_0.8.8 XML_3.98-1.16
## [63] nnet_7.3-12 locfit_1.5-9.1
## [65] tidyselect_0.2.5 labeling_0.3
## [67] rlang_0.3.1 reshape2_1.4.3
## [69] cellranger_1.1.0 munsell_0.5.0
## [71] tools_3.5.2 RSQLite_2.1.1
## [73] pls_2.7-0 cvTools_0.3.2
## [75] evaluate_0.12 stringr_1.3.1
## [77] yaml_2.2.0 bit64_0.9-7
## [79] zip_1.0.0 robustbase_0.93-3
## [81] purrr_0.2.5 biomaRt_2.38.0
## [83] compiler_3.5.2 beeswarm_0.2.3
## [85] curl_3.2 e1071_1.7-0
## [87] zCompositions_1.1.2 tibble_2.0.0
## [89] statmod_1.4.30 robCompositions_2.0.9
## [91] pcaPP_1.9-73 stringi_1.2.4
## [93] highr_0.7 forcats_0.3.0
## [95] trimcluster_0.1-2.1 lattice_0.20-38
## [97] ProtGenerics_1.14.0 pillar_1.3.1
## [99] BiocManager_1.30.4 lmtest_0.9-36
## [101] BiocNeighbors_1.0.0 data.table_1.11.8
## [103] cowplot_0.9.4 bitops_1.0-6
## [105] irlba_2.3.2 rtracklayer_1.42.1
## [107] R6_2.3.0 bookdown_0.9
## [109] KernSmooth_2.23-15 gridExtra_2.3
## [111] rio_0.5.16 vipor_0.4.5
## [113] boot_1.3-20 MASS_7.3-51.1
## [115] assertthat_0.2.0 rhdf5_2.26.2
## [117] withr_2.1.2 GenomicAlignments_1.18.1
## [119] Rsamtools_1.34.0 GenomeInfoDbData_1.2.0
## [121] diptest_0.75-7 hms_0.4.2
## [123] grid_3.5.2 class_7.3-15
## [125] rmarkdown_1.11 DelayedMatrixStats_1.4.0
## [127] carData_3.0-2 mvoutlier_2.0.9
## [129] Rtsne_0.15 ggbeeswarm_0.6.0
Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2):155–60.
Islam, S., U. Kjallquist, A. Moliner, P. Zajac, J. B. Fan, P. Lonnerberg, and S. Linnarsson. 2011. “Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.” Genome Res. 21 (7):1160–7.
Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.
Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.