scDblFinder {scDblFinder}R Documentation

scDblFinder

Description

Identification of doublets in single-cell RNAseq directly from counts using overclustering-based generation of artifical doublets.

Usage

scDblFinder(sce, artificialDoublets = NULL, clusters = NULL,
  clust.method = c("louvain", "overcluster", "fast_greedy"),
  samples = NULL, minClusSize = min(50, ncol(sce)/5),
  maxClusSize = NULL, nfeatures = 1000, dims = 20, dbr = NULL,
  dbr.sd = 0.015, k = 20, clust.graph.type = c("snn", "knn"),
  fullTable = FALSE, verbose = is.null(samples),
  score = c("weighted", "ratio", "hybrid"), BPPARAM = SerialParam())

Arguments

sce

A SummarizedExperiment-class

artificialDoublets

The approximate number of artificial doublets to create. If NULL, will be the maximum of the number of cells or '5*nbClusters^2'.

clusters

The optional cluster assignments (if omitted, will run clustering). This is used to make doublets more efficiently. 'clusters' should either be a vector of labels for each cell, or the name of a colData column of 'sce'.

clust.method

The clustering method if 'clusters' is not given.

samples

A vector of the same length as cells (or the name of a column of 'colData(sce)'), indicating to which sample each cell belongs. Here, a sample is understood as being processed independently. If omitted, doublets will be searched for with all cells together. If given, doublets will be searched for independently for each sample, which is preferable if they represent different captures. If your samples were multiplexed using cell hashes, want you want to give here are the different batches/wells (i.e. independent captures, since doublets cannot arise across them) rather than biological samples.

minClusSize

The minimum cluster size for 'quickCluster'/'overcluster' (default 50); ignored if 'clusters' is given.

maxClusSize

The maximum cluster size for 'overcluster'. Ignored if 'clusters' is given. If NA, clustering will be performed using 'quickCluster', otherwise via 'overcluster'. If missing, the default value will be estimated by 'overcluster'.

nfeatures

The number of top features to use (default 1000)

dims

The number of dimensions used to build the network (default 20)

dbr

The expected doublet rate. By default this is assumed to be 1% per thousand cells captured (so 4% among 4000 thousand cells), which is appropriate for 10x datasets.

dbr.sd

The standard deviation of the doublet rate, defaults to 0.015.

k

Number of nearest neighbors (for KNN graph).

clust.graph.type

Either 'snn' or 'knn'.

fullTable

Logical; whether to return the full table including artificial doublets, rather than the table for real cells only (default).

verbose

Logical; whether to print messages and the thresholding plot.

score

Score to use for final classification; either 'weighted' (default), 'ratio' or 'hybrid' (includes information about library size and detection rate.

BPPARAM

Used for multithreading when splitting by samples (i.e. when 'samples!=NULL'); otherwise passed to eventual PCA and K/SNN calculations.

Value

The 'sce' object with the following additional colData columns: 'scDblFinder.ratio' (ratio of aritifical doublets among neighbors), 'scDblFinder.weighted' (the ratio of artificial doublets among neighbors weigted by their distance), 'scDblFinder.score' (the final score used, by default the same as 'scDblFinder.weighted'), and 'scDblFinder.class' (whether the cell is called as 'doublet' or 'singlet'). Alternatively, if 'fullTable=TRUE', a data.frame will be returned with information about real and artificial cells.

Examples

library(SingleCellExperiment)
m <- t(sapply( seq(from=0, to=5, length.out=50), 
               FUN=function(x) rpois(50,x) ) )
sce <- SingleCellExperiment( list(counts=m) )
sce <- scDblFinder(sce, verbose=FALSE)
table(sce$scDblFinder.class)


[Package scDblFinder version 1.2.0 Index]