1 Introduction

Single-cell RNA sequencing (scRNA-seq) is a widely used technique for profiling gene expression in individual cells. This allows molecular biology to be studied at a resolution that cannot be matched by bulk sequencing of cell populations. The scran package implements methods to perform low-level processing of scRNA-seq data, including cell cycle phase assignment, scaling normalization, variance modelling and testing for corrrelated genes. This vignette provides brief descriptions of these methods and some toy examples to demonstrate their use.

Note: A more comprehensive description of the use of scran (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.

2 Setting up the data

We start off with a count matrix where each row is a gene and each column is a cell. These can be obtained by mapping read sequences to a reference genome, and then counting the number of reads mapped to the exons of each gene. (See, for example, the Rsubread package to do both of these tasks.) Alternatively, pseudo-alignment methods can be used to quantify the abundance of each transcript in each cell. For simplicity, we will pull out an existing dataset from the scRNAseq package.

library(scRNAseq)
sce <- GrunPancreasData()
sce
## class: SingleCellExperiment 
## dim: 20064 1728 
## metadata(0):
## assays(1): counts
## rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(2): symbol chr
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): donor sample
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

This particular dataset is taken from a study of the human pancreas with the CEL-seq protocol (Grun et al. 2016). It is provided as a SingleCellExperiment object (from the SingleCellExperiment package), which contains the raw data and various annotations. We perform some cursory quality control to remove cells with low total counts or high spike-in percentages:

library(scater)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, percent_subsets="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]
summary(qcfilter$discard)
##    Mode   FALSE    TRUE 
## logical    1291     437

3 Normalizing cell-specific biases

3.1 Based on the gene counts

Cell-specific biases are normalized using the computeSumFactors method, which implements the deconvolution strategy for scaling normalization (A. T. Lun, Bach, and Marioni 2016). This computes size factors that are used to scale the counts in each cell. The assumption is that most genes are not differentially expressed (DE) between cells, such that any differences in expression across the majority of genes represents some technical bias that should be removed.

library(scran)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters)
summary(sizeFactors(sce))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.006722 0.442629 0.801312 1.000000 1.295809 9.227575

For larger data sets, clustering should be performed with the quickCluster function before normalization. Briefly, cells are grouped into clusters of similar expression; normalization is applied within each cluster to compute size factors for each cell; and the factors are rescaled by normalization between clusters. This reduces the risk of violating the above assumption when many genes are DE between clusters in a heterogeneous population.

Note that computeSumFactors will automatically remove low-abundance genes, which provides some protection against zero or negative size factor estimates. We also assume that quality control on the cells has already been performed, as low-quality cells with few expressed genes can often have negative size factor estimates.

3.2 Based on the spike-in counts

An alternative approach is to normalize based on the spike-in counts (Lun et al. 2017). The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells. The main practical difference is that spike-in normalization preserves differences in total RNA content between cells, whereas computeSumFactors and other non-DE methods do not.

sce2 <- computeSpikeFactors(sce, "ERCC")
summary(sizeFactors(sce2))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01041 0.57760 0.88667 1.00000 1.27679 7.43466

3.3 Computing normalized expression values

Normalized expression values are calculated using the logNormCounts() method from scater (McCarthy et al. 2017). This will use the deconvolution size factors for the endogenous genes, and the spike-in-based size factors for the spike-in transcripts. Each expression value can be interpreted as a log-transformed “normalized count”, and can be used in downstream applications like clustering or dimensionality reduction.

sce <- logNormCounts(sce)

4 Variance modelling

We identify genes that drive biological heterogeneity in the data set by modelling the per-gene variance. The aim is use a subset of highly variable genes in downstream analyses like clustering, to improve resolution by removing genes driven by technical noise. We decompose the total variance of each gene into its biological and technical components by fitting a trend to the endogenous variances (A. T. Lun, McCarthy, and Marioni 2016). The fitted value of the trend is used as an estimate of the technical component, and we subtract the fitted value from the total variance to obtain the biological component for each gene.

dec <- modelGeneVar(sce)
plot(dec$mean, dec$total, xlab="Mean log-expression", ylab="Variance")
curve(metadata(dec)$trend(x), col="blue", add=TRUE)

If we have spike-ins, we can use them to fit the trend instead. This provides a more direct estimate of the technical variance and avoids assuming that most genes do not exhibit biological variaility.

dec2 <- modelGeneVarWithSpikes(sce, 'ERCC')
plot(dec2$mean, dec2$total, xlab="Mean log-expression", ylab="Variance")
points(metadata(dec2)$mean, metadata(dec2)$var, col="red")
curve(metadata(dec2)$trend(x), col="blue", add=TRUE)

If we have some uninteresting factors of variation, we can block on these using block=. This will perform the trend fitting and decomposition within each block before combining the statistics across blocks for output. Statistics for each individual block can also be extracted for further inspection.

dec3 <- modelGeneVar(sce, block=sce$donor)
per.block <- dec3$per.block
par(mfrow=c(3, 2))
for (i in seq_along(per.block)) {
    decX <- per.block[[i]]
    plot(decX$mean, decX$total, xlab="Mean log-expression", 
        ylab="Variance", main=names(per.block)[i])
    curve(metadata(decX)$trend(x), col="blue", add=TRUE)
}

We can then extract some top genes for use in downstream procedures. This is usually done by passing the selected subset of genes to the subset.row argument (or equivalent) in the desired downstream function, as shown below for the PCA step.

# Get the top 10% of genes.
top.hvgs <- getTopHVGs(dec, prop=0.1)

# Get the top 2000 genes.
top.hvgs2 <- getTopHVGs(dec, n=2000)

# Get all genes with positive biological components.
top.hvgs3 <- getTopHVGs(dec, var.threshold=0)

# Get all genes with FDR below 5%.
top.hvgs4 <- getTopHVGs(dec, fdr.threshold=0.05)

# Running the PCA with the 10% of HVGs.
sce <- runPCA(sce, subset_row=top.hvgs)
reducedDimNames(sce)
## [1] "PCA"

5 Automated PC choice

Principal components analysis is commonly performed to denoise and compact the data prior to downstream analysis. A common question is how many PCs to retain; more PCs will capture more biological signal at the cost of retaining more noise and requiring more computational work. One approach to choosing the number of PCs is to use the technical component estimates to determine the proportion of variance that should be retained. This is implemented in denoisePCA(), which takes the estimates returned by modelGeneVar() or friends. (For greater accuracy, we use the fit with the spikes; we also subset to only the top HVGs to remove noise.)

sced <- denoisePCA(sce, dec2, subset.row=getTopHVGs(dec2, prop=0.1))
ncol(reducedDim(sced, "PCA"))
## [1] 50

Another approach is based on the assumption that each subpopulation should be separated from each other on a different axis of variation. Thus, we choose the number of PCs that is not less than the number of subpopulations (which are unknown, of course, so we use the number of clusters as a proxy). It is then a simple to subset the dimensionality reduction result to the desired number of PCs.

output <- getClusteredPCs(reducedDim(sce))
npcs <- metadata(output)$chosen
reducedDim(sce, "PCAsub") <- reducedDim(sce, "PCA")[,1:npcs,drop=FALSE]
npcs
## [1] 14

6 Graph-based clustering

Clustering of scRNA-seq data is commonly performed with graph-based methods due to their relative scalability and robustness. scran provides several graph construction methods based on shared nearest neighbors (Xu and Su 2015) through the buildSNNGraph() function. This is most commonly generated from the selected PCs, after which methods from the igraph package can be used to identify clusters.

# In this case, using the PCs that we chose from getClusteredPCs().
g <- buildSNNGraph(sce, use.dimred="PCAsub")
cluster <- igraph::cluster_walktrap(g)$membership
sce$cluster <- factor(cluster)
table(sce$cluster)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  79 285  64 170 166 164 124  32  57  63  63  24

We can then use methods from scater to visualize this on a \(t\)-SNE plot, as shown below.

sce <- runTSNE(sce, dimred="PCAsub")
plotTSNE(sce, colour_by="cluster", text_by="cluster")

Another diagnostic is to examine the ratio of observed to expected edge weights for each pair of clusters (closely related to the modularity score used in many cluster_* functions). We would usually expect to see high observed weights between cells in the same cluster with minimal weights between clusters, indicating that the clusters are well-separated. Off-diagonal entries indicate that some clusters are closely related, which is useful to know for checking that they are annotated consistently.

ratio <- clusterModularity(g, cluster, as.ratio=TRUE)

library(pheatmap)
pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,
    col=rev(heat.colors(100)))

By default, buildSNNGraph() uses the mode of shared neighbor weighting described by Xu and Su (2015), but other weighting methods (e.g., the Jaccard index) are also available by setting type=. An unweighted \(k\)-nearest neighbor graph can also be constructed with buildKNNGraph().

7 Identifying marker genes

The findMarkers() wrapper function will perform some simple differential expression tests between pairs of clusters to identify potential marker genes for each cluster. For each cluster, we perform \(t\)-tests to identify genes that are DE in each cluster compared to at least one other cluster. All pairwise tests are combined into a single ranking by simply taking the top genes from each pairwise comparison. For example, if we take all genes with Top <= 5, this is equivalent to the union of the top 5 genes from each pairwise comparison. This aims to provide a set of genes that is guaranteed to be able to distinguish the chosen cluster from all others.

markers <- findMarkers(sce, sce$cluster)
markers[[1]][,1:3]
## DataFrame with 20064 rows and 3 columns
##                          Top               p.value                   FDR
##                    <integer>             <numeric>             <numeric>
## CPE__chr4                  1  1.72176035756373e-62  1.43939165892329e-59
## KCNQ1OT1__chr11            1  2.07621300887932e-61  1.60219760808285e-58
## LOC100131257__chr7         1  2.66948101461403e-43  7.87653927606118e-41
## PGM5P2__chr9               1  2.67376429046488e-51  1.27729539818779e-48
## SCG2__chr2                 1 1.41985653427671e-104 2.84880015037282e-100
## ...                      ...                   ...                   ...
## TLX1NB__chr10          19918                     1                     1
## TNFRSF17__chr16        19946                     1                     1
## TSPAN32__chr11         19973                     1                     1
## WFDC11__chr20          20015                     1                     1
## XKR7__chr20            20025                     1                     1

We can modify the tests by passing a variety of arguments to findMarkers(). For example, the code below will perform Wilcoxon tests instead of \(t\)-tests; only identify genes that are upregulated in the target cluster compared to each other cluster; and require a minimum log2-fold change of 1 to be considered significant.

wmarkers <- findMarkers(sce, sce$cluster, 
    test.type="wilcox", direction="up", lfc=1)
wmarkers[[1]][,1:3]
## DataFrame with 20064 rows and 3 columns
##                          Top              p.value                  FDR
##                    <integer>            <numeric>            <numeric>
## FBLIM1__chr1               1 7.55506575286656e-30 2.16549770379307e-26
## KCNQ1OT1__chr11            1  1.3037949839962e-37 1.51912378444456e-33
## PGM5P2__chr9               1 1.46542988650341e-35  9.8007950809348e-32
## UGDH-AS1__chr4             1 1.51427809454203e-37 1.51912378444456e-33
## LOC100131257__chr7         2 7.66838656887684e-33 3.07717016235892e-29
## ...                      ...                  ...                  ...
## TLX1NB__chr10          19918                    1                    1
## TNFRSF17__chr16        19946                    1                    1
## TSPAN32__chr11         19973                    1                    1
## WFDC11__chr20          20015                    1                    1
## XKR7__chr20            20025                    1                    1

We can also modify how the statistics are combined across pairwise comparisons. Setting pval.type="all" requires a gene to be DE between each cluster and every other cluster (rather than any other cluster, as is the default with pval.type="any"). This is a more stringent definition that can yield a more focused set of markers but may also fail to detect any markers in the presence of overclustering.

markers <- findMarkers(sce, sce$cluster, pval.type="all")
markers[[1]][,1:2]
## DataFrame with 20064 rows and 2 columns
##                                 p.value                  FDR
##                               <numeric>            <numeric>
## UGDH-AS1__chr4     8.36194812717426e-20 1.67774127223624e-15
## KCNQ1OT1__chr11    2.72693541696133e-19 2.42652105077295e-15
## LOC100131257__chr7 3.62817142759114e-19 2.42652105077295e-15
## TFDP2__chr3        5.98639305311369e-19 3.00277475544182e-15
## LOC643406__chr20   1.09250908390331e-18 4.38402045188719e-15
## ...                                 ...                  ...
## ZP1__chr11                            1                    1
## ZP3__chr7                             1                    1
## ZPBP__chr7                            1                    1
## ZSCAN1__chr19                         1                    1
## ZSCAN20__chr1                         1                    1

8 Detecting correlated genes

Another useful procedure is to identify significant pairwise correlations between pairs of HVGs. The idea is to distinguish between HVGs caused by random stochasticity, and those that are driving systematic heterogeneity, e.g., between subpopulations. Correlations are computed in the correlatePairs method using a slightly modified version of Spearman’s rho. Testing is performed against the null hypothesis of independent genes, using a permutation method in correlateNull to construct a null distribution.

# Using the first 200 HVs, which are the most interesting anyway.
of.interest <- top.hvgs[1:200]
cor.pairs <- correlatePairs(sce, subset.row=of.interest)
cor.pairs
## DataFrame with 19900 rows and 6 columns
##                gene1             gene2                  rho            p.value
##          <character>       <character>            <numeric>          <numeric>
## 1     UGDH-AS1__chr4   KCNQ1OT1__chr11    0.830409572989877 1.999998000002e-06
## 2          GCG__chr2        TTR__chr18    0.798357336766403 1.999998000002e-06
## 3        REG1A__chr2       PRSS1__chr7    0.784339048912704 1.999998000002e-06
## 4        REG1A__chr2      SPINK1__chr5    0.780994260179146 1.999998000002e-06
## 5        REG1A__chr2       REG1B__chr2    0.760322041823354 1.999998000002e-06
## ...              ...               ...                  ...                ...
## 19896     VIM__chr10 ANKRD20A9P__chr13 5.94148616291629e-05  0.998997001002999
## 19897   PCSK2__chr20        ORC4__chr2  4.8552206469983e-05  0.999283000716999
## 19898      IL8__chr4       UGGT1__chr2 9.44748445112069e-06     0.999565000435
## 19899     ELF3__chr1    METTL21A__chr2 3.53904217566944e-05     0.999699000301
## 19900  SLC30A8__chr8   LOC90834__chr22 1.61231862737779e-05     0.999733000267
##                       FDR   limited
##                 <numeric> <logical>
## 1     3.9188617762938e-06      TRUE
## 2     3.9188617762938e-06      TRUE
## 3     3.9188617762938e-06      TRUE
## 4     3.9188617762938e-06      TRUE
## 5     3.9188617762938e-06      TRUE
## ...                   ...       ...
## 19896   0.999197844790897     FALSE
## 19897   0.999433669109327     FALSE
## 19898   0.999665469326389     FALSE
## 19899      0.999733000267     FALSE
## 19900      0.999733000267     FALSE

As with variance estimation, if uninteresting substructure is present, this should be blocked on using the block= argument in both correlateNull and correlatePairs. This avoids strong correlations due to the blocking factor.

cor.pairs2 <- correlatePairs(sce, subset.row=of.interest, block=sce$donor)

The pairs can be used for choosing marker genes in experimental validation, and to construct gene-gene association networks. In other situations, the pairs may not be of direct interest - rather, we just want to know whether a gene is correlated with any other gene. This is often the case if we are to select a set of correlated HVGs for use in downstream steps like clustering or dimensionality reduction. To do so, we use correlateGenes() to compute a single set of statistics for each gene, rather than for each pair.

cor.genes <- correlateGenes(cor.pairs)
cor.genes
## DataFrame with 200 rows and 5 columns
##                gene               rho              p.value                  FDR
##         <character>         <numeric>            <numeric>            <numeric>
## 1    UGDH-AS1__chr4 0.830409572989877 6.74575596610844e-06 8.99434128814459e-06
## 2         GCG__chr2 0.798357336766403 2.92646766176763e-06 6.46050810811457e-06
## 3       REG1A__chr2 0.784339048912704 3.06153540000306e-06 6.46050810811457e-06
## 4        TTR__chr18 0.798357336766403 2.70748028571699e-06 6.46050810811457e-06
## 5       CHGB__chr20 0.751585010712636  2.7260246712356e-06 6.46050810811457e-06
## ...             ...               ...                  ...                  ...
## 196       FN1__chr2 0.164283584259597 3.61817820000362e-05 3.61817820000362e-05
## 197      MAFA__chr8 0.362234935737863 5.76811017391881e-06 7.95601403299147e-06
## 198        F3__chr1 0.378165190325267 3.68518150000369e-06 6.46418063992851e-06
## 199    SRXN1__chr20 0.340372056884118 3.06153540000306e-05 3.07692000000308e-05
## 200 KDM4A-AS1__chr1 0.330622091196779 1.02051180000102e-05  1.0742229473695e-05
##       limited
##     <logical>
## 1        TRUE
## 2        TRUE
## 3        TRUE
## 4        TRUE
## 5        TRUE
## ...       ...
## 196      TRUE
## 197      TRUE
## 198      TRUE
## 199      TRUE
## 200      TRUE

Significant correlations are defined at a false discovery rate (FDR) threshold of, e.g., 5%. Note that the p-values are calculated by permutation and will have a lower bound. If there were insufficient permutation iterations, a warning will be issued suggesting that more iterations be performed.

9 Converting to other formats

The SingleCellExperiment object can be easily converted into other formats using the convertTo method. This allows analyses to be performed using other pipelines and packages. For example, if DE analyses were to be performed using edgeR, the count data in sce could be used to construct a DGEList.

y <- convertTo(sce, type="edgeR")

By default, rows corresponding to spike-in transcripts are dropped when get.spikes=FALSE. As such, the rows of y may not correspond directly to the rows of sce – users should match by row name to ensure correct cross-referencing between objects. Normalization factors are also automatically computed from the size factors.

The same conversion strategy roughly applies to the other supported formats. DE analyses can be performed using DESeq2 by converting the object to a DESeqDataSet. Cells can be ordered on pseudotime with monocle by converting the object to a CellDataSet (in this case, normalized unlogged expression values are stored).

10 Getting help

Further information can be obtained by examining the documentation for each function (e.g., ?convertTo); reading the Orchestrating Single Cell Analysis book; or asking for help on the Bioconductor support site (please read the posting guide beforehand).

11 Session information

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             scater_1.14.6              
##  [3] ggplot2_3.2.1               scRNAseq_2.0.2             
##  [5] scran_1.14.6                SingleCellExperiment_1.8.0 
##  [7] SummarizedExperiment_1.16.1 DelayedArray_0.12.2        
##  [9] BiocParallel_1.20.1         matrixStats_0.55.0         
## [11] Biobase_2.46.0              GenomicRanges_1.38.0       
## [13] GenomeInfoDb_1.22.0         IRanges_2.20.2             
## [15] S4Vectors_0.24.3            BiocGenerics_0.32.0        
## [17] knitr_1.27.2                BiocStyle_2.14.4           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] RColorBrewer_1.1-2            httr_1.4.1                   
##  [5] tools_3.6.2                   R6_2.4.1                     
##  [7] irlba_2.3.3                   vipor_0.4.5                  
##  [9] DBI_1.1.0                     lazyeval_0.2.2               
## [11] colorspace_1.4-1              withr_2.1.2                  
## [13] tidyselect_1.0.0              gridExtra_2.3                
## [15] bit_1.1-15.1                  curl_4.3                     
## [17] compiler_3.6.2                BiocNeighbors_1.4.1          
## [19] labeling_0.3                  bookdown_0.17                
## [21] scales_1.1.0                  rappdirs_0.3.1               
## [23] stringr_1.4.0                 digest_0.6.23                
## [25] rmarkdown_2.1                 XVector_0.26.0               
## [27] pkgconfig_2.0.3               htmltools_0.4.0              
## [29] fastmap_1.0.1                 dbplyr_1.4.2                 
## [31] limma_3.42.2                  rlang_0.4.4                  
## [33] RSQLite_2.2.0                 shiny_1.4.0                  
## [35] DelayedMatrixStats_1.8.0      farver_2.0.3                 
## [37] dplyr_0.8.4                   RCurl_1.98-1.1               
## [39] magrittr_1.5                  BiocSingular_1.2.1           
## [41] GenomeInfoDbData_1.2.2        Matrix_1.2-18                
## [43] Rcpp_1.0.3                    ggbeeswarm_0.6.0             
## [45] munsell_0.5.0                 viridis_0.5.1                
## [47] lifecycle_0.1.0               stringi_1.4.5                
## [49] yaml_2.2.1                    edgeR_3.28.0                 
## [51] zlibbioc_1.32.0               Rtsne_0.15                   
## [53] BiocFileCache_1.10.2          AnnotationHub_2.18.0         
## [55] grid_3.6.2                    blob_1.2.1                   
## [57] promises_1.1.0                dqrng_0.2.1                  
## [59] ExperimentHub_1.12.0          crayon_1.3.4                 
## [61] lattice_0.20-38               cowplot_1.0.0                
## [63] magick_2.3                    locfit_1.5-9.1               
## [65] pillar_1.4.3                  igraph_1.2.4.2               
## [67] glue_1.3.1                    BiocVersion_3.10.1           
## [69] evaluate_0.14                 BiocManager_1.30.10          
## [71] httpuv_1.5.2                  vctrs_0.2.2                  
## [73] gtable_0.3.0                  purrr_0.3.3                  
## [75] assertthat_0.2.1              xfun_0.12                    
## [77] mime_0.8                      rsvd_1.0.2                   
## [79] xtable_1.8-4                  later_1.0.0                  
## [81] viridisLite_0.3.0             tibble_2.1.3                 
## [83] AnnotationDbi_1.48.0          beeswarm_0.2.3               
## [85] memoise_1.1.0                 statmod_1.4.33               
## [87] interactiveDisplayBase_1.24.0

References

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.

Lun, A. T., D. J. McCarthy, and J. C. Marioni. 2016. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res 5:2122.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

Xu, C., and Z. Su. 2015. “Identification of cell types from single-cell transcriptomes using a novel clustering method.” Bioinformatics 31 (12):1974–80.