multiBatchNorm {batchelor} | R Documentation |
Perform scaling normalization within each batch to provide comparable results to the lowest-coverage batch.
multiBatchNorm( ..., batch = NULL, assay.type = "counts", norm.args = list(), min.mean = 1, subset.row = NULL, normalize.all = FALSE, preserve.single = TRUE )
... |
One or more SingleCellExperiment objects containing counts and size factors. Each object should contain the same number of rows, corresponding to the same genes in the same order. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, |
batch |
A factor specifying the batch of origin for all cells when only a single object is supplied in |
assay.type |
A string specifying which assay values contains the counts. |
norm.args |
A named list of further arguments to pass to |
min.mean |
A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. |
subset.row |
A vector specifying which features to use for normalization. |
normalize.all |
A logical scalar indicating whether normalized values should be returned for all genes. |
preserve.single |
A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in |
When performing integrative analyses of multiple batches, it is often the case that different batches have large differences in coverage. This function removes systematic differences in coverage across batches to simplify downstream comparisons. It does so by resaling the size factors using median-based normalization on the ratio of the average counts between batches. This is roughly equivalent to the between-cluster normalization described by Lun et al. (2016).
This function will adjust the size factors so that counts in high-coverage batches are scaled downwards to match the coverage of the most shallow batch.
The logNormCounts
function will then add the same pseudo-count to all batches before log-transformation.
By scaling downwards, we favour stronger squeezing of log-fold changes from the pseudo-count, mitigating any technical differences in variance between batches.
Of course, genuine biological differences will also be shrunk, but this is less of an issue for upregulated genes with large counts.
For comparison, imagine if we ran logNormCounts
separately in each batch prior to correction.
In most cases, size factors will be computed within each batch;
batch-specific application in logNormCounts
will not account for scaling differences between batches.
In contrast, multiBatchNorm
will rescale the size factors so that they are comparable across batches.
This removes at least one difference between batches to facilitate easier correction.
Only genes with library size-adjusted average counts greater than min.mean
will be used for computing the rescaling factors.
This improves precision and avoids problems with discreteness.
By default, we use min.mean=1
, which is usually satisfactory but may need to be lowered for very sparse datasets.
Users can also set subset.row
to restrict the set of genes used for computing the rescaling factors.
By default, normalized values will only be returned for genes specified in the subset.
Setting normalize.all=TRUE
will return normalized values for all genes.
A list of SingleCellExperiment objects with normalized log-expression values in the "logcounts"
assay (depending on values in norm.args
).
Each object contains cells from a single batch.
If preserve.single=TRUE
and ...
contains only one SingleCellExperiment, that object is returned with an additional "logcounts"
assay containing normalized log-expression values.
The order of cells is not changed.
Rescaling is only performed on endogenous genes in each SingleCellExperiment object.
If any spike-in transcripts are present in the altExps
,
their abundances will not be rescaled here, and are no longer directly comparable to the rescaled abundances of the genes.
This is usually not a major problem as spike-ins are rarely used during the batch correction itself -
however, users should not attempt to perform variance modelling with the spike-ins on the output of this function.
Aaron Lun
Lun ATL (2018). Further MNN algorithm development. https://MarioniLab.github.io/FurtherMNN2018/theory/description.html
mnnCorrect
and fastMNN
, for methods that can benefit from rescaling.
normalize
for the calculation of log-transformed normalized expression values.
d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100) sce1 <- SingleCellExperiment(list(counts=d1)) sizeFactors(sce1) <- runif(ncol(d1)) d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40) sce2 <- SingleCellExperiment(list(counts=d2)) sizeFactors(sce2) <- runif(ncol(d2)) out <- multiBatchNorm(sce1, sce2) summary(sizeFactors(out[[1]])) summary(sizeFactors(out[[2]]))