SingleCellExperiment 1.8.0
The SingleCellExperiment
class is a light-weight container for single-cell genomics data.
It extends the RangedSummarizedExperiment
class and follows similar conventions,
i.e., rows should represent features (genes, transcripts, genomic regions) and columns should represent cells.
It provides methods for storing dimensionality reduction results and data for alternative feature sets (e.g., synthetic spike-in transcripts, antibody tags).
It is the central data structure for Bioconductor single-cell packages like scater and scran.
SingleCellExperiment
objects can be created via the constructor of the same name:
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
sce <- SingleCellExperiment(assays = list(counts = counts))
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):
An alternative approach is via coercion from SummarizedExperiment
objects.
se <- SummarizedExperiment(list(counts=counts))
as(se, "SingleCellExperiment")
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):
To demonstrate the use of the class, we will the Allen data set from the scRNAseq package.
library(scRNAseq)
sce <- ReprocessedAllenData("tophat_counts")
sce
## class: SingleCellExperiment
## dim: 20816 379
## metadata(2): SuppInfo which_qc
## assays(1): tophat_counts
## rownames(20816): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC
The set of operations that can be applied to a RangedSummarizedExperiment
are also applicable to any instance of a SingleCellExperiment
.
This includes access to assay data via assay()
, column metadata with colData()
, and so on.
We compute log-transformed normalized expression values from the count matrix.
(We note that many of these steps can be performed as one-liners from the scater package,
but we will show them here in full to demonstrate the capabilities of the SingleCellExperiment
class.)
counts <- assay(sce, "tophat_counts")
libsizes <- colSums(counts)
size.factors <- libsizes/mean(libsizes)
logcounts(sce) <- log2(t(t(counts)/size.factors) + 1)
assayNames(sce)
## [1] "tophat_counts" "logcounts"
We obtain the PCA and t-SNE representations of the data and add them to the object with the reducedDims()<-
method.
pca_data <- prcomp(t(logcounts(sce)), rank=50)
library(Rtsne)
set.seed(5252)
tsne_data <- Rtsne(pca_data$x[,1:50], pca = FALSE)
reducedDims(sce) <- list(PCA=pca_data$x, TSNE=tsne_data$Y)
sce
## class: SingleCellExperiment
## dim: 20816 379
## metadata(2): SuppInfo which_qc
## assays(2): tophat_counts logcounts
## rownames(20816): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
## altExpNames(1): ERCC
The stored coordinates can be retrieved by name or by numerical index. Each row of the coordinate matrix is assumed to correspond to a cell, while each column represents a dimension.
reducedDims(sce)
## List of length 2
## names(2): PCA TSNE
reducedDimNames(sce)
## [1] "PCA" "TSNE"
head(reducedDim(sce, "PCA")[,1:2])
## PC1 PC2
## SRR2140028 -70.23670 33.78880
## SRR2140022 -41.00366 49.76633
## SRR2140055 -133.70763 7.68870
## SRR2140083 -36.26099 113.18806
## SRR2139991 -100.86439 15.94740
## SRR2140067 -97.71210 35.77881
head(reducedDim(sce, "TSNE")[,1:2])
## [,1] [,2]
## SRR2140028 -5.089887 -0.9326747
## SRR2140022 -5.021462 0.9825430
## SRR2140055 7.877810 -1.7014442
## SRR2140083 -3.446139 8.4331700
## SRR2139991 -4.332190 2.0750396
## SRR2140067 -5.229462 -3.1371659
Any subsetting by column of sce_sub
will also lead to subsetting of the dimensionality reduction results by cell.
dim(reducedDim(sce, "PCA"))
## [1] 379 50
dim(reducedDim(sce[,1:10], "PCA"))
## [1] 10 50
In the SingleCellExperiment
, users can assign arbitrary names to entries of assays
.
To assist interoperability between packages, we provide some suggestions for what the names should be for particular types of data:
counts
: Raw count data, e.g., number of reads or transcripts for a particular gene.normcounts
: Normalized values on the same scale as the original counts.
For example, counts divided by cell-specific size factors that are centred at unity.logcounts
: Log-transformed counts or count-like values.
In most cases, this will be defined as log-transformed normcounts
, e.g., using log base 2 and a pseudo-count of 1.cpm
: Counts-per-million.
This is the read count for each gene in each cell, divided by the library size of each cell in millions.tpm
: Transcripts-per-million.
This is the number of transcripts for each gene in each cell, divided by the total number of transcripts in that cell (in millions).Each of these suggested names has an appropriate getter/setter method for convenient manipulation of the SingleCellExperiment
.
For example, we can take the (very specifically named) tophat_counts
name and assign it to counts
instead:
counts(sce) <- assay(sce, "tophat_counts")
sce
## class: SingleCellExperiment
## dim: 20816 379
## metadata(2): SuppInfo which_qc
## assays(3): tophat_counts logcounts counts
## rownames(20816): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
## altExpNames(1): ERCC
dim(counts(sce))
## [1] 20816 379
This means that functions expecting count data can simply call counts()
without worrying about package-specific naming conventions.
Many scRNA-seq experiments contain sequencing data for multiple feature types beyond the endogenous genes:
Such features can be stored inside the SingleCellExperiment
via the concept of “alternative Experiments”.
These are nested SummarizedExperiment
instances that are guaranteed to have the same number and ordering of columns as the SingleCellExperiment
itself.
Data for endogenous genes and other features can thus be kept separate, which is often desirable as they need to be processed differently.
To illustrate, consider the case of the spike-in transcripts in the Allen data.
We move the corresponding rows out of the assays of the main SingleCellExperiment
and into a nested alternative Experiment.
is.spike <- grepl("^ERCC-", rownames(sce))
sce <- splitAltExps(sce, ifelse(is.spike, "ERCC", "gene"))
altExpNames(sce)
## [1] "ERCC"
The altExp()
method returns a self-contained SingleCellExperiment
instance containing only the spike-in transcripts.
altExp(sce)
## class: SingleCellExperiment
## dim: 92 379
## metadata(2): SuppInfo which_qc
## assays(1): tophat_counts
## rownames(92): ERCC-00002 ERCC-00003 ... ERCC-00170 ERCC-00171
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):
Each alternative Experiment can have a different set of assays from the main SingleCellExperiment
.
This is useful in cases where the other feature types must be normalized or transformed differently.
Similarly, the alternative Experiments can have different rowData()
from the main object.
rowData(altExp(sce))$concentration <- runif(nrow(altExp(sce)))
rowData(altExp(sce))
## DataFrame with 92 rows and 1 column
## concentration
## <numeric>
## ERCC-00002 0.034970022039488
## ERCC-00003 0.979286155896261
## ERCC-00004 0.171510147629306
## ERCC-00009 0.577605556230992
## ERCC-00012 0.965110752964392
## ... ...
## ERCC-00164 0.294522685930133
## ERCC-00165 0.444945453200489
## ERCC-00168 0.282102683791891
## ERCC-00170 0.0817114380188286
## ERCC-00171 0.417851015692577
rowData(sce)
## DataFrame with 20816 rows and 0 columns
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rtsne_0.15 scRNAseq_1.99.8
## [3] SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.0
## [5] DelayedArray_0.12.0 BiocParallel_1.20.0
## [7] matrixStats_0.55.0 Biobase_2.46.0
## [9] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
## [11] IRanges_2.20.0 S4Vectors_0.24.0
## [13] BiocGenerics_0.32.0 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 lattice_0.20-38
## [3] assertthat_0.2.1 zeallot_0.1.0
## [5] digest_0.6.22 mime_0.7
## [7] BiocFileCache_1.10.0 R6_2.4.0
## [9] backports_1.1.5 RSQLite_2.1.2
## [11] evaluate_0.14 httr_1.4.1
## [13] pillar_1.4.2 zlibbioc_1.32.0
## [15] rlang_0.4.1 curl_4.2
## [17] blob_1.2.0 Matrix_1.2-17
## [19] rmarkdown_1.16 AnnotationHub_2.18.0
## [21] stringr_1.4.0 RCurl_1.95-4.12
## [23] bit_1.1-14 shiny_1.4.0
## [25] httpuv_1.5.2 compiler_3.6.1
## [27] xfun_0.10 pkgconfig_2.0.3
## [29] htmltools_0.4.0 tidyselect_0.2.5
## [31] tibble_2.1.3 GenomeInfoDbData_1.2.2
## [33] interactiveDisplayBase_1.24.0 bookdown_0.14
## [35] later_1.0.0 crayon_1.3.4
## [37] dplyr_0.8.3 dbplyr_1.4.2
## [39] bitops_1.0-6 rappdirs_0.3.1
## [41] grid_3.6.1 xtable_1.8-4
## [43] DBI_1.0.0 magrittr_1.5
## [45] stringi_1.4.3 XVector_0.26.0
## [47] promises_1.1.0 vctrs_0.2.0
## [49] tools_3.6.1 bit64_0.9-7
## [51] glue_1.3.1 BiocVersion_3.10.1
## [53] purrr_0.3.3 fastmap_1.0.1
## [55] yaml_2.2.0 AnnotationDbi_1.48.0
## [57] BiocManager_1.30.9 ExperimentHub_1.12.0
## [59] memoise_1.1.0 knitr_1.25