trainSingleR {SingleR} | R Documentation |
Train the SingleR classifier on one or more reference datasets with known labels.
trainSingleR( ref, labels, genes = "de", sd.thresh = 1, de.method = c("classic", "wilcox", "t"), de.n = NULL, de.args = list(), assay.type = "logcounts", check.missing = TRUE, BNPARAM = KmknnParam() )
ref |
A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples). Each row should be named with the gene name. In general, the expression values are expected to be log-transformed, see Details. Alternatively, a SummarizedExperiment object containing such a matrix. Alternatively, a list or List of SummarizedExperiment objects or numeric matrices containing multiple references, in which case the row names are expected to be the same across all objects. |
labels |
A character vector or factor of known labels for all samples in Alternatively, if |
genes |
A string specifying the feature selection method to be used, see Details. Alternatively, a list of lists of character vectors containing DE genes between pairs of labels. Alternatively, a list of character vectors containing marker genes for each label. |
sd.thresh |
A numeric scalar specifying the minimum threshold on the standard deviation per gene.
Only used when |
de.method |
String specifying how DE genes should be detected between pairs of labels.
Defaults to |
de.n |
An integer scalar specifying the number of DE genes to use when |
de.args |
Named list of additional arguments to pass to |
assay.type |
An integer scalar or string specifying the assay of |
check.missing |
Logical scalar indicating whether rows should be checked for missing values (and if found, removed). |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use for building nearest neighbor indices. |
This function uses a training data set to select interesting features and construct nearest neighbor indices in rank space.
The resulting objects can be re-used across multiple classification steps with different test data sets via classifySingleR
.
This improves efficiency by avoiding unnecessary repetition of steps during the downstream analysis.
Several options are available for feature selection:
genes="de"
identifies genes that are differentially expressed between labels.
This is done by identifying the median expression within each label, and computing differences between medians for each pair of labels.
For each label, the top de.n
genes with the largest differences compared to another label are chosen as markers to distinguish the two labels.
The set of all features is defined as the union of markers from all pairwise comparisons.
genes="sd"
identifies genes that are highly variable across labels.
This is done by identifying the median expression within each label, and computing the standard deviation in the medians across all labels.
The set of all features is defined as those genes with standard deviations above sd.thresh
.
genes="all"
will not perform any feature selection.
If genes="de"
or "sd"
, the expression values are expected to be log-transformed and normalized.
For a single reference, a List is returned containing:
common.genes
:A character vector of all genes that were chosen by the designated feature selection method.
nn.indices
:A List of BiocNeighborIndex objects containing pre-constructed neighbor search indices.
original.exprs
:A List of numeric matrices where each matrix contains all cells for a particular label.
extra
:A List of additional information on the feature selection, for use by classifySingleR
.
This includes the selection method in genes
and method-specific structures that can be re-used during classification.
For multiple references, a List of Lists is returned where each internal List corresponds to a reference in ref
and has the same structure as described above.
Rather than relying on the in-built feature selection, users can pass in their own features of interest to genes
.
The function expects a named list of named lists of character vectors, with each vector containing the DE genes between a pair of labels.
For example:
genes <- list( A = list(A = character(0), B = "GENE_1", C = c("GENE_2", "GENE_3")), B = list(A = "GENE_100", B = character(0), C = "GENE_200"), C = list(A = c("GENE_4", "GENE_5"), B = "GENE_5", C = character(0)) )
If we consider the entry genes$A$B
, this contains marker genes for label "A"
against label "B"
.
That is, these genes are upregulated in "A"
compared to "B"
.
The outer list should have one list per label, and each inner list should have one character vector per label.
(Obviously, a label cannot have markers against itself, so this is just set to character(0)
.)
Alternatively, genes
can be a named list of character vectors containing per-label markers.
For example:
genes <- list( A = c("GENE_1", "GENE_2", "GENE_3"), B = c("GENE_100", "GENE_200"), C = c("GENE_4", "GENE_5") )
The entry genes$A
represent the genes that are upregulated in A
compared to some or all other labels.
This allows the function to handle pre-defined marker lists for specific cell populations.
However, it obviously captures less information than marker sets for the pairwise comparisons.
If genes
explicitly contains gene identities (as character vectors), ref
can be the raw counts or any monotonic transformation thereof.
The default SingleR policy for dealing with multiple references is to perform the classification for each reference separately and combine the results (see ?combineResults
for an explanation).
To this end, if ref
is a list with multiple references, marker genes are identified separately within each reference when de="genes"
or "sd"
.
This is almost equivalent to running trainSingleR
on each reference separately, except that the final common
set of genes consists of the union of common genes across all references.
We take the union to ensure that correlations are computed from the same set of genes across reference and are thus reasonably comparable in combineResults
.
The default marker selection is based on log-fold changes between the per-label medians and is very much designed with bulk references in mind.
It may not be effective for single-cell reference data where it is not uncommon to have more than 50% zero counts for a given gene such that the median is also zero for each group.
Users are recommended to either set de.method
to another DE ranking method, or detect markers externally and pass a list of markers to genes
(see Examples).
Aaron Lun, based on the original SingleR
code by Dvir Aran.
classifySingleR
, where the output of this function gets used.
combineResults
, to combine results from multiple references.
############################## ## Mocking up training data ## ############################## Ngroups <- 5 Ngenes <- 1000 means <- matrix(rnorm(Ngenes*Ngroups), nrow=Ngenes) means[1:900,] <- 0 colnames(means) <- LETTERS[1:5] g <- rep(LETTERS[1:5], each=4) ref <- SummarizedExperiment( list(counts=matrix(rpois(1000*length(g), lambda=10*2^means[,g]), ncol=length(g))), colData=DataFrame(label=g) ) rownames(ref) <- sprintf("GENE_%s", seq_len(nrow(ref))) ######################## ## Doing the training ## ######################## # Normalizing and log-transforming for automated marker detection. ref <- scater::logNormCounts(ref) trained <- trainSingleR(ref, ref$label) trained trained$nn.indices length(trained$common.genes) # Alternatively, computing and supplying a set of label-specific markers. by.t <- scran::pairwiseTTests(assay(ref, 2), ref$label, direction="up") markers <- scran::getTopMarkers(by.t[[1]], by.t[[2]], n=10) trained <- trainSingleR(ref, ref$label, genes=markers) length(trained$common.genes)