Here, we describe a few additional analyses that can be performed with single-cell RNA sequencing data. This includes detection of significant correlations between genes and regressing out the effect of cell cycle from the gene expression matrix.
Cell cycle phase is usually uninteresting in studies focusing on other aspects of biology. However, the effects of cell cycle on the expression profile can mask other effects and interfere with the interpretation of the results. This cannot be avoided by simply removing cell cycle marker genes, as the cell cycle can affect a substantial number of other transcripts (Buettner et al. 2015). Rather, more sophisticated strategies are required, one of which is demonstrated below using data from a study of T Helper 2 (TH2) cells (Mahata et al. 2014).
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
mahata.fname <- bfcrpath(bfc,
"http://www.nature.com/nbt/journal/v33/n2/extref/nbt.3102-S7.xlsx")
Buettner et al. (2015) have already applied quality control and normalized the data, so we can use them directly as log-expression values (accessible as Supplementary Data 1 of https://dx.doi.org/10.1038/nbt.3102).
library(readxl)
incoming <- as.data.frame(read_excel(mahata.fname, sheet=1))
rownames(incoming) <- incoming[,1]
incoming <- incoming[,-1]
incoming <- incoming[,!duplicated(colnames(incoming))] # Remove duplicated genes.
sce.th2 <- SingleCellExperiment(list(logcounts=t(incoming)))
We empirically identify the cell cycle phase using the pair-based classifier in cyclone
.
The majority of cells in Figure 2 seem to lie in G1 phase, with small numbers of cells in the other phases.
library(org.Mm.eg.db)
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.th2), keytype="SYMBOL", column="ENSEMBL")
set.seed(100)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds",
package="scran"))
assignments <- cyclone(sce.th2, mm.pairs, gene.names=ensembl, assay.type="logcounts")
plot(assignments$score$G1, assignments$score$G2M,
xlab="G1 score", ylab="G2/M score", pch=16)
We can block directly on the phase scores in downstream analyses. This is more graduated than using a strict assignment of each cell to a specific phase, as the magnitude of the score considers the uncertainty of the assignment. The phase covariates in the design matrix will absorb any phase-related effects on expression such that they will not affect estimation of the effects of other experimental factors. Users should also ensure that the phase score is not confounded with other factors of interest. For example, model fitting is not possible if all cells in one experimental condition are in one phase, and all cells in another condition are in a different phase.
design <- model.matrix(~ G1 + G2M, assignments$score)
fit.block <- trendVar(sce.th2, design=design, parametric=TRUE, use.spikes=NA)
dec.block <- decomposeVar(sce.th2, fit.block)
library(limma)
sce.th2.block <- sce.th2
assay(sce.th2.block, "corrected") <- removeBatchEffect(
logcounts(sce.th2), covariates=design[,-1])
sce.th2.block <- denoisePCA(sce.th2.block, technical=dec.block,
assay.type="corrected")
dim(reducedDim(sce.th2.block, "PCA"))
## [1] 81 5
The result of blocking on design
is visualized with some PCA plots in Figure 3.
Before removal, the distribution of cells along the first two principal components is strongly associated with their G1 and G2/M scores.
This is no longer the case after removal, which suggests that the cell cycle effect has been mitigated.
sce.th2$G1score <- sce.th2.block$G1score <- assignments$score$G1
sce.th2$G2Mscore <- sce.th2.block$G2Mscore <- assignments$score$G2M
# Without blocking on phase score.
fit <- trendVar(sce.th2, parametric=TRUE, use.spikes=NA)
sce.th2 <- denoisePCA(sce.th2, technical=fit$trend)
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
out <- plotReducedDim(sce.th2, use_dimred="PCA", ncomponents=2, colour_by="G1score",
size_by="G2Mscore") + fontsize + ggtitle("Before removal")
# After blocking on the phase score.
out2 <- plotReducedDim(sce.th2.block, use_dimred="PCA", ncomponents=2,
colour_by="G1score", size_by="G2Mscore") + fontsize +
ggtitle("After removal")
multiplot(out, out2, cols=2)
As an aside, this dataset contains cells at various stages of differentiation (Mahata et al. 2014). This is an ideal use case for diffusion maps which perform dimensionality reduction along a continuous process. In Figure 4, cells are arranged along a trajectory in the low-dimensional space. The first diffusion component is likely to correspond to TH2 differentiation, given that a key regulator Gata3 (Zhu et al. 2006) changes in expression from left to right.
plotDiffusionMap(sce.th2.block, colour_by="Gata3",
run_args=list(use_dimred="PCA", sigma=25)) + fontsize
All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] limma_3.42.0 org.Mm.eg.db_3.10.0
## [3] AnnotationDbi_1.48.0 scran_1.14.5
## [5] scater_1.14.4 ggplot2_3.2.1
## [7] SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.0
## [9] DelayedArray_0.12.0 BiocParallel_1.20.0
## [11] matrixStats_0.55.0 Biobase_2.46.0
## [13] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
## [15] IRanges_2.20.0 S4Vectors_0.24.0
## [17] BiocGenerics_0.32.0 readxl_1.3.1
## [19] R.utils_2.9.0 R.oo_1.23.0
## [21] R.methodsS3_1.7.1 BiocFileCache_1.10.2
## [23] dbplyr_1.4.2 knitr_1.26
## [25] BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] ggbeeswarm_0.6.0 colorspace_1.4-1 RcppEigen_0.3.3.7.0
## [4] class_7.3-15 rio_0.5.16 XVector_0.26.0
## [7] RcppHNSW_0.2.0 BiocNeighbors_1.4.1 proxy_0.4-23
## [10] hexbin_1.28.0 farver_2.0.1 bit64_0.9-7
## [13] RSpectra_0.15-0 ranger_0.11.2 codetools_0.2-16
## [16] robustbase_0.93-5 zeallot_0.1.0 BiocManager_1.30.10
## [19] compiler_3.6.1 httr_1.4.1 ggplot.multistats_1.0.0
## [22] dqrng_0.2.1 backports_1.1.5 assertthat_0.2.1
## [25] Matrix_1.2-17 lazyeval_0.2.2 BiocSingular_1.2.0
## [28] htmltools_0.4.0 tools_3.6.1 rsvd_1.0.2
## [31] igraph_1.2.4.1 gtable_0.3.0 glue_1.3.1
## [34] GenomeInfoDbData_1.2.2 dplyr_0.8.3 ggthemes_4.2.0
## [37] rappdirs_0.3.1 Rcpp_1.0.3 carData_3.0-3
## [40] cellranger_1.1.0 vctrs_0.2.0 DelayedMatrixStats_1.8.0
## [43] lmtest_0.9-37 laeken_0.5.0 xfun_0.11
## [46] stringr_1.4.0 openxlsx_4.1.3 lifecycle_0.1.0
## [49] irlba_2.3.3 statmod_1.4.32 edgeR_3.28.0
## [52] DEoptimR_1.0-8 zoo_1.8-6 zlibbioc_1.32.0
## [55] MASS_7.3-51.4 scales_1.1.0 VIM_4.8.0
## [58] pcaMethods_1.78.0 hms_0.5.2 yaml_2.2.0
## [61] curl_4.2 memoise_1.1.0 gridExtra_2.3
## [64] stringi_1.4.3 RSQLite_2.1.2 highr_0.8
## [67] knn.covertree_1.0 e1071_1.7-2 destiny_3.0.0
## [70] TTR_0.23-5 boot_1.3-23 zip_2.0.4
## [73] rlang_0.4.1 pkgconfig_2.0.3 bitops_1.0-6
## [76] evaluate_0.14 lattice_0.20-38 purrr_0.3.3
## [79] labeling_0.3 cowplot_1.0.0 bit_1.1-14
## [82] tidyselect_0.2.5 magrittr_1.5 bookdown_0.15
## [85] R6_2.4.1 DBI_1.0.0 pillar_1.4.2
## [88] haven_2.2.0 foreign_0.8-72 withr_2.1.2
## [91] xts_0.11-2 scatterplot3d_0.3-41 abind_1.4-5
## [94] RCurl_1.95-4.12 sp_1.3-2 nnet_7.3-12
## [97] tibble_2.1.3 crayon_1.3.4 car_3.0-5
## [100] rmarkdown_1.17 viridis_0.5.1 locfit_1.5-9.1
## [103] grid_3.6.1 data.table_1.12.6 blob_1.2.0
## [106] forcats_0.4.0 vcd_1.4-4 digest_0.6.22
## [109] tidyr_1.0.0 munsell_0.5.0 beeswarm_0.2.3
## [112] viridisLite_0.3.0 smoother_1.1 vipor_0.4.5
Angel, P., and M. Karin. 1991. “The role of Jun, Fos and the AP-1 complex in cell-proliferation and transformation.” Biochim. Biophys. Acta 1072 (2-3):129–57.
Bourgon, R., R. Gentleman, and W. Huber. 2010. “Independent filtering increases detection power for high-throughput experiments.” Proc. Natl. Acad. Sci. U.S.A. 107 (21):9546–51.
Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2):155–60.
Mahata, B., X. Zhang, A. A. Kołodziejczyk, V. Proserpio, L. Haim-Vilmovsky, A. E. Taylor, D. Hebenstreit, et al. 2014. “Single-cell RNA sequencing reveals T helper cells synthesizing steroids de novo to contribute to immune homeostasis.” Cell Rep. 7 (4):1130–42.
Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9:Article 39.
Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3):751–54.
Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.” Cell Stem Cell 16 (6):712–24.
Zhu, J., H. Yamane, J. Cote-Sierra, L. Guo, and W. E. Paul. 2006. “GATA-3 promotes Th2 responses through three different mechanisms: induction of Th2 cytokine production, selective growth of Th2 cells and inhibition of Th1 cell-specific factors.” Cell Res. 16 (1):3–10.
3 Comments on filtering by abundance
Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference. In applications involving hypothesis testing, these genes typically do not provide enough evidence to reject the null hypothesis yet they still increase the severity of the multiple testing correction. The discreteness of the counts may also interfere with statistical procedures, e.g., by compromising the accuracy of continuous approximations. Thus, low-abundance genes are often removed in many RNA-seq analysis pipelines before the application of downstream methods.
The “optimal” choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness and to avoid zeroes, e.g., for normalization purposes. By comparison, the filter statistic for hypothesis testing is mainly required to be independent of the test statistic under the null hypothesis (Bourgon, Gentleman, and Huber 2010). Given these differences in priorities, we (or the relevant function) will filter at each step as appropriate, rather than applying a single filter for the entire analysis. For example,
computeSumFactors()
will apply a somewhat stringent filter based on the average count, whiletrendVar()
will apply a relatively relaxed filter based on the average log-expression. Other applications will not do any abundance-based filtering at all (e.g.,denoisePCA()
) to preserve biological signal from lowly expressed genes.Nonetheless, if global filtering is desired, it is simple to achieve by simply subsetting the
SingleCellExperiment
object. The example below demonstrates how we could remove genes with average counts less than 1 in the HSC dataset. The number ofTRUE
values indemo.keep
corresponds to the number of retained rows/genes after filtering.