IdTaxa {DECIPHER} | R Documentation |
Classifies sequences according to a training set by assigning a confidence to taxonomic labels for each taxonomic level.
IdTaxa(test, trainingSet, type = "extended", strand = "both", threshold = 60, bootstraps = 100, samples = L^0.47, minDescend = 0.98, fullLength = 0, processors = 1, verbose = TRUE)
test |
An |
trainingSet |
An object of class |
type |
Character string indicating the type of output desired. This should be (an abbreviation of) one of |
strand |
Character string indicating the orientation of the |
threshold |
Numeric specifying the confidence at which to truncate the output taxonomic classifications. Lower values of |
bootstraps |
Integer giving the maximum number of bootstrap replicates to perform for each sequence. The number of bootstrap replicates is set automatically such that (on average) 99% of k-mers are sampled in each |
samples |
A function or call written as a function of ‘L’, which will evaluate to a numeric vector the same length as ‘L’. Typically of the form “ |
minDescend |
Numeric giving the minimum fraction of |
fullLength |
Numeric specifying the fold-difference in sequence lengths between sequences in |
processors |
The number of processors to use, or |
verbose |
Logical indicating whether to display progress. |
Sequences in test
are each assigned a taxonomic classification based on the trainingSet
created with LearnTaxa
. Each taxonomic level is given a confidence between 0% and 100%, and the taxonomy is truncated where confidence drops below threshold
. If the taxonomic classification was truncated, the last group is labeled with “unclassified_” followed by the final taxon's name. Note that the reported confidence is not a p-value but does directly relate to a given classification's probability of being wrong. The default threshold
of 60%
is intended to minimize the rate of incorrect classifications. Lower values of threshold
(e.g., 50%
) may be preferred to increase the taxonomic depth of classifications. Values of 60%
or 50%
are recommended for nucleotide sequences and 50%
or 40%
for amino acid sequences.
If type
is "extended"
(the default) then an object of class Taxa
and subclass Train is returned. This is stored as a list with elements corresponding to their respective sequence in test
. Each list element contains components:
taxon |
A character vector containing the taxa to which the sequence was assigned. |
confidence |
A numeric vector giving the corresponding percent confidence for each taxon. |
rank |
If the classifier was trained with a set of |
If type
is "collapsed"
then a character vector is returned with the taxonomic assignment for each sequence. This takes the repeating form “Taxon name [rank, confidence%]; ...” if rank
s were supplied during training, or “Taxon name [confidence%]; ...” otherwise.
Erik Wright eswright@pitt.edu
Murali, A., et al. (2018). IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences. Microbiome, 6, 140. https://doi.org/10.1186/s40168-018-0521-5
data("TrainingSet_16S") # import test sequences fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER") dna <- readDNAStringSet(fas) # remove any gaps in the sequences dna <- RemoveGaps(dna) # classify the test sequences ids <- IdTaxa(dna, TrainingSet_16S, strand="top") ids # view the results plot(ids, TrainingSet_16S)