emptyDrops {DropletUtils} | R Documentation |
Distinguish between droplets containing cells and ambient RNA in a droplet-based single-cell RNA sequencing experiment.
testEmptyDrops(m, lower=100, niters=10000, test.ambient=FALSE, ignore=NULL, alpha=NULL, BPPARAM=SerialParam()) emptyDrops(m, lower=100, retain=NULL, barcode.args=list(), ...)
m |
A numeric matrix object - usually a dgTMatrix or dgCMatrix - containing droplet data prior to any filtering or cell calling. Columns represent barcoded droplets, rows represent genes. |
lower |
A numeric scalar specifying the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. |
niters |
An integer scalar specifying the number of iterations to use for the Monte Carlo p-value calculations. |
test.ambient |
A logical scalar indicating whether results should be returned for barcodes with totals less than or equal to |
ignore |
A numeric scalar specifying the lower bound on the total UMI count, at or below which barcodes will be ignored (see Details for how this differs from |
alpha |
A numeric scalar specifying the scaling parameter for the Dirichlet-multinomial sampling scheme. |
BPPARAM |
A BiocParallelParam object indicating whether parallelization should be used to compute p-values. |
retain |
A numeric scalar specifying the threshold for the total UMI count above which all barcodes are assumed to contain cells. |
barcode.args |
Further arguments to pass to |
... |
Further arguments to pass to |
testEmptyDrops
will return a DataFrame with the following components:
Total
:Integer, the total UMI count for each barcode.
LogProb
:Numeric, the log-probability of observing the barcode's count vector under the null model.
PValue
:Numeric, the Monte Carlo p-value against the null model.
Limited
:Logical, indicating whether a lower p-value could be obtained by increasing niters
.
emptyDrops
will return a DataFrame like testEmptyDrops
, with an additional FDR
field.
The metadata of the output DataFrame will contains the ambient profile in ambient
, the estimated/specified value of alpha
, the specified value of lower
and the number of iterations in niters
.
For emptyDrops
, the metadata will also contain the retention threshold in retain
.
testEmptyDrops
The testEmptyDrops
function first obtains an estimate of the composition of the ambient pool of RNA based on the barcodes with total UMI counts less than or equal to lower
.
This assumes that a cell-containing droplet would generally have higher total counts than empty droplets containing RNA from the ambient pool.
Counts for the low-count barcodes are pooled together, and an estimate of the proportion vector for the ambient pool is calculated using goodTuringProportions
.
The count vector for each barcode above lower
is then tested for a significant deviation from these proportions.
Then, testEmptyDrops
will test each barcode for significant deviations from the ambient profile.
The null hypothesis is that transcript molecules are included into droplets by multinomial sampling from the ambient profile.
For each barcode, the probability of obtaining its count vector based on the null model is computed.
Then, niters
count vectors are simulated from the null model.
The proportion of simulated vectors with probabilities lower than the observed multinomial probability for that barcode is used to calculate the p-value.
We use this Monte Carlo approach as an exact multinomial p-value is difficult to calculate.
However, the p-value is lower-bounded by the value of niters
(Phipson and Smyth, 2010), which can result in loss of power if niters
is too small.
Users can check whether this loss of power has any practical consequence by checking the Limited
field in the output.
If any barcodes have Limited=TRUE
but does not reject the null hypothesis, it suggests that niters
should be increased.
The stability of the Monte Carlo $p$-values depends on niters
, which is only set to a default of 10000 for speed.
Larger values improve stability with the only cost being that of time, so users should set niters
to the largest value they are willing to wait for.
The ignore
argument can also be set to ignore barcodes with total counts less than or equal to ignore
.
This differs from the lower
argument in that the ignored barcodes are not necessarily used to compute the ambient profile.
Users can interpret ignore
as the minimum total count required for a barcode to be considered as a potential cell.
In contrast, lower
is the maximum total count below which all barcodes are assumed to be empty droplets.
emptyDrops
The emptyDrops
function identifies droplets that are likely to contain cells by calling testEmptyDrops
.
The Benjamini-Hochberg correction is applied to the Monte Carlo p-values to correct for multiple testing.
Cells can then be defined by taking all barcodes with significantly non-ambient profiles, e.g., at a false discovery rate of 0.1%.
Barcodes that contain more than retain
total counts are always retained.
This ensures that large cells with profiles that are very similar to the ambient pool are not inadvertently discarded.
If retain
is not specified, it is set to the total count at the knee point detected by barcodeRanks
.
Manual specification of retain
may be useful if the knee point was not correctly identified in complex log-rank curves.
Users can also set retain=Inf
to disable automatic retention of barcodes with large totals.
All barcodes with total counts above retain
are assigned p-values of zero during correction, reflecting our assumption that they are true positives.
This ensures that their Monte Carlo p-values do not affect the correction of other genes, and also means that they will have FDR values of zero.
However, their original Monte Carlo p-values are still reported in the output, as these may be useful for diagnostic purposes.
This function will also attempt to detect whether m
contains non-integer values by seeing if the column and row sums are discrete.
If such values are present, m
is first round
ed to the nearest integer value before proceeding.
This may be relevant when the count matrix is generated from pseudo-alignment methods like Alevin (see the tximeta package for details).
Rounding is performed as discrete count values are necessary for the Good-Turing algorithm and for the multinomial simulations.
In general, users should call emptyDrops
rather than testEmptyDrops
.
The latter is a “no frills” version that is largely intended for use within other functions.
If alpha
is set to a positive number, sampling is assumed to follow a Dirichlet-multinomial (DM) distribution.
The parameter vector of the DM distribution is defined as the estimated ambient profile scaled by alpha
.
Smaller values of alpha
model overdispersion in the counts, due to dependencies in sampling between molecules.
If alpha=NULL
, a maximum likelihood estimate is obtained from the count profiles for all barcodes with totals less than or equal to lower
.
If alpha=Inf
, the sampling of molecules is modelled with a multinomial distribution.
Users can check whether the model is suitable by extracting the p-values for all barcodes with test.ambient=TRUE
.
Under the null hypothesis, the p-values for presumed ambient barcodes (i.e., with total counts below lower
) should be uniformly distributed.
Skews in the p-value distribution are indicative of an inaccuracy in the model and/or its estimates (of alpha
or the ambient profile).
NA
values in the resultsWe assume that libraries with total UMI counts below lower
correspond to empty droplets.
These are used to estimate the ambient expression profile against which the remaining libraries are tested.
Under this definition, these low-count libraries cannot be cell-containing droplets and are excluded from the hypothesis testing.
However, it is still desirable for the number of rows of the output DataFrame to be the same as ncol(m)
.
This allows easy subsetting of m
based on a logical vector constructed from the output (e.g., to retain all FDR values below a threshold).
Thus, the rows for the excluded barcodes are filled in with NA
values for all fields in the output.
If test.ambient=FALSE
, non-NA
statistics will be reported for all barcodes.
This is occasionally useful for diagnostics to ensure that the p-values are well-calibrated for barcodes below lower
.
Specifically, if the null hypothesis were true, p-values for low-count barcodes should have a uniform distribution.
Any strong peaks in the p-values near zero indicate that emptyDrops
is not controlling the FDR correctly.
Aaron Lun
Lun A, Riesenfeld S, Andrews T, Dao TP, Gomes T, participants in the 1st Human Cell Atlas Jamboree, Marioni JC (2018). Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. biorXiv.
Phipson B, Smyth GK (2010). Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 9:Article 39.
# Mocking up some data: set.seed(0) my.counts <- DropletUtils:::simCounts() # Identify likely cell-containing droplets. out <- emptyDrops(my.counts) out is.cell <- out$FDR <= 0.01 sum(is.cell, na.rm=TRUE) # Check if p-values are lower-bounded by 'niters' # (increase 'niters' if any Limited==TRUE and Sig==FALSE) table(Sig=is.cell, Limited=out$Limited)