To install the package, please use the following.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("mbkmeans")
This vignette provides an introductory example on how
to work with the mbkmeans
package, which contains an
implementation of the mini-batch k-means algorithm
proposed in (Sculley 2010) for large single-cell
sequencing data.
The main function to be used by the users is mbkmeans
.
This is implemented as an S4 generic and methods are
implemented for matrix
, Matrix
, HDF5Matrix
,
DelayedMatrix
, SummarizedExperiment
, and
SingleCellExperiment
.
Most of this work was inspired by the
MiniBatchKmeans()
function implemented in the
ClusterR
R package and we re-use many of the
C++ functions implemented there.
Our main contribution here is to provide an
interface to the DelayedArray
and HDF5Array
packages, allowing the user to run the mini-batch
k-means algorithm on data that do not fit
entirely in memory.
The motivation for this work is the clustering of
large single-cell RNA-sequencing (scRNA-seq) datasets,
and hence the main focus is on Bioconductor’s
SingleCellExperiment
and SummarizedExperiment
data container. For this reason, mbkmeans
assumes a
data representation typical of genomic data, in which
genes (variables) are in the rows and cells
(observations) are in the column. This is contrary to
most other statistical applications, and notably to
the stats::kmeans()
and ClusterR::MiniBatchKmeans()
functions that assume observations in rows.
We provide a lower-level mini_batch()
function that
expects observations in rows and is expected to be a direct
a replacement of ClusterR::MiniBatchKmeans()
for on-disk
data representations such as HDF5
files. The rest of
this document shows the typical use case through the
mbkmeans()
interface, users interested in the
mini_batch()
function should refer to its manual page.
To illustrate a typical use case, we use the
pbmc4k
dataset of the
TENxPBMCData
package.
This dataset contains a set of about 4,000 cells from
peripheral blood from a healthy donor and is expected
to contain many types or clusters of cell.
Note that in this vignette, we do not aim at identifying biologically meaningful clusters here (that would entail a more sophisticated normalization of data and dimensionality reduction), but insted we only aim to show how to run mini-batch k-means on a large HDF5-backed matrix.
We normalize the data simply by scaling for the total
number of counts using scater
and select the 1,000
most variable genes and a random set of 100 cells to
speed-up computations.
tenx_pbmc4k <- TENxPBMCData(dataset = "pbmc4k")
set.seed(1034)
idx <- sample(seq_len(NCOL(tenx_pbmc4k)), 100)
sce <- tenx_pbmc4k[, idx]
#normalization
sce <- normalize(sce)
vars <- rowVars(logcounts(sce))
names(vars) <- rownames(sce)
vars <- sort(vars, decreasing = TRUE)
sce1000 <- sce[names(vars)[1:1000],]
sce1000
## class: SingleCellExperiment
## dim: 1000 100
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(1000): ENSG00000090382 ENSG00000204287 ... ENSG00000117748
## ENSG00000135968
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):
mbkmeans
The main function, mbkmeans()
, returns a
list object including centroids
,
WCSS_per_cluster
(where WCSS stands for
within-cluster-sum-of-squares),
best_initialization
, iters_per_initiazation
and Clusters
.
It takes any matrix-like object as input, such
as SummarizedExperiment
, SingleCellExperiment
,
matrix
, DelayedMatrix
and HDF5Matrix
.
In this example, the input is a SingleCellExperiment
object.
res <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts")
The number of clusters (such as k in the
k-means algorithm) is set through the clusters
argument. In this case, we set clusters = 5
for
no particular reason. For SingleCellExperiment
objects, the function provides the reduceMethod
and whichAssay
arguments. The reduceMethod
argument
should specify the dimensionality reduction slot
to use for the clustering, and the default is
“PCA”. Note that this does not perform PCA but
only looks at a slot called “PCA” already stored
in the object. Alternatively, one can specify
whichAssay
as the assay to use as input to
mini-batch k-means. This is used only when
reduceMethod
option is NA
. See ?mbkmeans
for more details.
There are additional arguements in mbkmeans()
function that make the function more flexible
and suitable for more situations.
The size of the mini batches is set through the
batch_size
argument. The default value uses the
blocksize()
function. The blocksize()
function
considers both the number of columns in the dataset
and the amount of RAM on the current matchine to
calculate as large of a batch size as reasonable for
the RAM available to the session. The calculation uses
get_ram()
function in benchmarkme
package. See
the benchmarkme
vignette for more details.
batchsize <- blocksize(sce1000)
batchsize
## [1] 100
In this case, as the whole data fits in memory, the default batch size would be a single batch of size 100.
The performance of mini-batch k-means greatly depends on the process of initialization. We implemented two different initialization methods:
kmeans++
, as proposed in (Arthur and Vassilvitskii 2007). The default is “kmeans++”.The percentage of data to use for the initialization
centroids is set through the init_fraction
argument,
which should be larger than 0 and less than 1, with
default value of 0.25.
res_random <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random")
table(res$Clusters, res_random$Clusters)
##
## 1 2 3 4 5
## 1 0 0 0 0 1
## 2 7 12 0 2 0
## 3 9 0 41 0 0
## 4 0 0 0 0 27
## 5 0 0 1 0 0
Note that if we set init_fraction=1
,
initializer = "random"
, and batch_size=ncol(x)
,
we recover the classic k-means algorithm.
res_full <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random",
batch_size = ncol(sce1000))
res_classic <- kmeans(t(logcounts(sce1000)), centers = 5)
table(res_full$Clusters, res_classic$cluster)
##
## 1 2 3 4 5
## 1 0 0 14 0 42
## 2 0 0 0 1 0
## 3 0 0 0 1 0
## 4 0 7 0 0 9
## 5 10 0 0 16 0
Arthur, David, and Sergei Vassilvitskii. 2007. “K-Means++: The Advantages of Careful Seeding.” In Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms, 1027–35. Society for Industrial; Applied Mathematics.
Sculley, David. 2010. “Web-Scale K-Means Clustering.” In Proceedings of the 19th International Conference on World Wide Web, 1177–8. ACM.