runTests {ClassifyR}R Documentation

Reproducibly Run Various Kinds of Cross-Validation

Description

Enables doing classification schemes such as ordinary 10-fold, 100 permutations 5-fold, and leave one out cross-validation. Processing in parallel is possible by leveraging the package BiocParallel.

Pre-validation is possible and activated by specifying a list named "prevaliated" to params, which will use the functions specified in the list of parameters on the pre-validated data table. Other named items in the list correspond to other assays to be added as a pre-vaildated vector of the clinical table.

Usage

  ## S4 method for signature 'matrix'
runTests(measurements, classes, ...)
  ## S4 method for signature 'DataFrame'
runTests(measurements, classes,
         balancing = c("downsample", "upsample", "none"), featureSets = NULL, metaFeatures = NULL,
         minimumOverlapPercent = 80, datasetName, classificationName,
         validation = c("permute", "leaveOut", "fold"),
         permutePartition = c("fold", "split"),
         permutations = 100, percent = 25, folds = 5, leave = 2,
         seed, parallelParams = bpparam(),
            params = list(SelectParams(), TrainParams(), PredictParams()), verbose = 1)
  ## S4 method for signature 'MultiAssayExperiment'
runTests(measurements, targets = names(measurements), ...)
  ## S4 method for signature 'MultiAssayExperiment'
runTestsEasyHard(measurements, balancing = c("downsample", "upsample", "none"),
                   easyDatasetID = "clinical", hardDatasetID = names(measurements)[1],
                   featureSets = NULL, metaFeatures = NULL, minimumOverlapPercent = 80,
                   datasetName = NULL, classificationName = "Easy-Hard Classifier", 
                   validation = c("permute", "leaveOut", "fold"),
                   permutePartition = c("fold", "split"),
                   permutations = 100, percent = 25, folds = 5, leave = 2,
                   seed, parallelParams = bpparam(), ..., verbose = 1)  

Arguments

measurements

Either a matrix, DataFrame or MultiAssayExperiment containing the training data. For a matrix, the rows are features, and the columns are samples. The sample identifiers must be present as column names of the matrix or the row names of the DataFrame. If pre-validation is activated by naming one of the lists in params "prevalidated", this variable must be of type MultiAssayExperiment.

classes

Either a vector of class labels of class factor of the same length as the number of samples in measurements or if the measurements are of class DataFrame a character vector of length 1 containing the column name in measurement is also permitted. Not used if measurements is a MultiAssayExperiment object.

balancing

A character vector of length 1 specifying how to balance samples in the training set so that each class ends up with an equal number of samples. "downsample" samples from each class with more samples than the smallest class so that each of those classes ends up with the same number of samples as the smallest class. "upsample" causes sampling with replacement to be done for every class that has less samples than the largest class to make all classes have equal size. Lastly, "none" means that no downsampling or upsampling is done and the class imbalance is preserved.

featureSets

An object of type FeatureSetCollection which defines sets of features or sets of edges.

metaFeatures

Either NULL or a DataFrame which has meta-features of the numeric data of interest.

minimumOverlapPercent

If featureSets stores sets of features, the minimum overlap of feature IDs with measurements for a feature set to be retained in the analysis. If featureSets stores sets of network edges, the minimum percentage of edges with both vertex IDs found in measurements that a set has to have to be retained in the analysis.

targets

If measurements is a MultiAssayExperiment, the names of the data tables to be used. "clinical" is also a valid value and specifies that numeric variables from the clinical data table will be used.

...

For runTests, variables not used by the matrix nor the MultiAssayExperiment method which are passed into and used by the DataFrame method. For runTestsEasyHard, easyClassifierParams and hardClassifierParams to be passed to easyHardClassifierTrain

datasetName

A name associated with the data set used.

classificationName

A name associated with the classification.

validation

Default: "permute". "permute" for repeated permuting. "leaveOut" for leaving all possible combinations of k samples as test samples. "fold" for folding of the data set (no resampling).

permutePartition

Default: "fold". Either "fold" or "split". Only applicable if validation is "permute". If "fold", then the samples are split into folds and in each iteration one is used as the test set. If "split", the samples are split into two groups, the sizes being based on the percent value. One group is used as the training set, the other is the test set.

permutations

Default: 100. Relevant when permuting is used. The number of times to do reordering of the samples before splitting or folding them.

percent

Default: 25. Used when permutation with the split method is chosen. The percentage of samples to be in the test set.

folds

Default: 5. Relevant when repeated permutations are done and permutePartition is set to "fold" or when validation is set to "fold". The number of folds to break the data set into. Each fold is used once as the test set.

leave

Default: 2. Relevant when leave-k-out cross-validation is used. The number of samples to leave for testing.

seed

The random number generator used for repeated resampling will use this seed, if it is provided. Allows reproducibility of repeated usage on the same input data.

parallelParams

An object of class MulticoreParam or SnowParam.

params

A list of objects of class of TransformParams, SelectParams, TrainParams or PredictParams. The order they are in the list determines the order in which the stages of classification are done in. It may also be a list of such lists for pre-validation. In that case, each list must be named and one of them must be named "prevalidated", which specifies the functions to use on the pre-validated data table.

easyDatasetID

The name of a data set in measurements or "clinical" to indicate the patient information in the column data be used.

hardDatasetID

The name of a data set in measurements different to the value of easyDatasetID to be used for classifying the samples not classified by the easy classifier.

verbose

Default: 1. A number between 0 and 3 for the amount of progress messages to give. A higher number will produce more messages as more lower-level functions print messages.

Value

If the predictor function made a single prediction per sample, then an object of class ClassifyResult. If the predictor function made a set of predictions, then a list of such objects.

Author(s)

Dario Strbenac

Examples

  #if(require(sparsediscrim))
  #{
    data(asthma)
    
    resubstituteParams <- ResubstituteParams(nFeatures = seq(5, 25, 5),
                                         performanceType = "balanced error",
                                         better = "lower")
    runTests(measurements, classes, datasetName = "Asthma",
             classificationName = "Different Means", permutations = 5,
             params = list(SelectParams(differentMeansSelection, "t Statistic",
                                        resubstituteParams = resubstituteParams),
                           TrainParams(DLDAtrainInterface),
                           PredictParams(DLDApredictInterface)
                           )
             )
  #}
  
  genesMatrix <- matrix(c(rnorm(90, 9, 1),
                        9.5, 9.4, 5.2, 5.3, 5.4, 9.4, 9.6, 9.9, 9.1, 9.8),
		      ncol = 10, byrow = TRUE)

  colnames(genesMatrix) <- paste("Sample", 1:10)
  rownames(genesMatrix) <- paste("Gene", 1:10)
  genders <- factor(c("Male", "Male", "Female", "Female", "Female",
                    "Female", "Female", "Female", "Female", "Female"))

  # Scenario: Male gender can predict the hard-to-classify Sample 1 and Sample 2.
  clinical <- DataFrame(age = c(31, 34, 32, 39, 33, 38, 34, 37, 35, 36),
                        gender = genders,
                        class = factor(rep(c("Poor", "Good"), each = 5)),
		        row.names = colnames(genesMatrix))
  dataset <- MultiAssayExperiment(ExperimentList(RNA = genesMatrix), clinical)
  selParams <- SelectParams(featureSelection = differentMeansSelection, selectionName = "Difference in Means",
                            resubstituteParams = ResubstituteParams(1:10, "balanced error", "lower"))
  easyHardCV <- runTestsEasyHard(dataset, datasetName = "Test Data", classificationName = "Easy-Hard",
                                 easyClassifierParams = list(minCardinality = 2, minPurity = 0.9),
                                 hardClassifierParams = list(selParams, TrainParams(), PredictParams()),
                                 validation = "leaveOut", leave = 1)  

[Package ClassifyR version 2.14.0 Index]