clusterMNN {batchelor}R Documentation

Cluster-based MNN

Description

Perform MNN correction based on cluster centroids, using the corrected centroid coordinates to correct the per-cell expression values with a variable bandwidth Gaussian kernel.

Usage

clusterMNN(
  ...,
  batch = NULL,
  restrict = NULL,
  clusters = NULL,
  cos.norm = TRUE,
  merge.order = NULL,
  auto.merge = FALSE,
  min.batch.skip = 0,
  subset.row = NULL,
  correct.all = FALSE,
  assay.type = "logcounts",
  BSPARAM = ExactParam(),
  BNPARAM = KmknnParam(),
  BPPARAM = SerialParam()
)

Arguments

...

One or more log-expression matrices where genes correspond to rows and cells correspond to columns. Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the assay.type assay. Each object should contain the same number of rows, corresponding to the same genes in the same order. Objects of different types can be mixed together.

If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch. If a single object is supplied, it is assumed to contain cells from all batches, so batch should also be specified.

Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside each list were passed directly to ....

batch

A factor specifying the batch of origin for all cells when only a single object is supplied in .... This is ignored if multiple objects are present.

restrict

A list of length equal to the number of objects in .... Each entry of the list corresponds to one batch and specifies the cells to use when computing the correction.

clusters

A list of length equal to ... containing the assigned cluster for each cell in each batch. If NULL, this is generated by k-means clustering with centers equal to the root of the number of cells in each batch.

cos.norm

A logical scalar indicating whether cosine normalization should be performed on the input data prior to PCA.

merge.order

An integer vector containing the linear merge order of batches in .... Alternatively, a list of lists representing a tree structure specifying a hierarchical merge order.

auto.merge

Logical scalar indicating whether to automatically identify the “best” merge order.

min.batch.skip

Numeric scalar specifying the minimum relative magnitude of the batch effect, below which no correction will be performed at a given merge step.

subset.row

A vector specifying which features to use for correction.

correct.all

Logical scalar indicating whether a rotation matrix should be computed for genes not in subset.row.

assay.type

A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs.

BSPARAM

A BiocSingularParam object specifying the algorithm to use for PCA. This uses a fast approximate algorithm from irlba by default, see multiBatchPCA for details.

BNPARAM

A BiocNeighborParam object specifying the nearest neighbor algorithm.

BPPARAM

A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized.

Details

These functions are motivated by the scenario where each batch has been clustered separately and each cluster has already been annotated with some meaningful biological state. We want to identify which biological states match to each other across batches; this is achieved by identifying mutual nearest neighbors based on the cluster centroids with reducedMNN.

MNN pairs are identified with k=1 to ensure that each cluster has no more than one match in another batch. This reduces the risk of inadvertently merging together different clusters from the same batch. By comparison, higher values of k may result in many-to-one mappings between batches such that the correction will implicitly force different clusters together.

Using this guarantee of no-more-than-one mappings across batches, we can form meta-clusters by identifying all components of the resulting MNN graph. Each meta-cluster can be considered to represent some biological state (e.g., cell type), and all of its constituents are the matching clusters within each batch.

As an extra courtesy, clusterMNN will also compute corrected values for each cell. This cell is done by applying a Gaussian kernel to the correction vectors for the centroids, where the bandwidth is proportional to the distance between that cell and the centroid of its assigned cluster. This yields a smooth correction function that avoids edge effects at cluster boundaries.

Value

A SingleCellExperiment containing per-cell expression values where each row is a gene and each column is a cell. This has the same format as the output of fastMNN but with an additional cluster field in the colData containing the cluster identity of each cell. The metadata contains:

Author(s)

Aaron Lun

References

Lun ATL (2019). Cluster-based mutual nearest neighbors correction https://marionilab.github.io/FurtherMNN2018/theory/clusters.html

Lun ATL (2019). A discussion of the known failure points of the fastMNN algorithm. https://marionilab.github.io/FurtherMNN2018/theory/failure.html

See Also

reducedMNN, which is used internally to perform the correction.

Examples

# Mocking up some data for multiple batches:
means <- matrix(rnorm(3000), ncol=3)
colnames(means) <- LETTERS[1:3]

B1 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)]
B1 <- B1 + rnorm(length(B1))

B2 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)]
B2 <- B2 + rnorm(length(B2)) + rnorm(nrow(B2)) # batch effect.

# Applying the correction with some made-up clusters:
cluster1 <- kmeans(t(B1), centers=10)$cluster
cluster2 <- kmeans(t(B2), centers=10)$cluster
out <- clusterMNN(B1, B2, clusters=list(cluster1, cluster2)) 

rd <- reducedDim(out, "corrected") 
plot(rd[,1], rd[,2], col=out$batch)


[Package batchelor version 1.4.0 Index]