BioMM {BioMM}R Documentation

BioMM end-to-end prediction

Description

End-to-end prediction by BioMM framework using either supervised or unsupervised learning at stage-1, then supervised learning at stage-2.

Usage

BioMM(trainData, testData, pathlistDB, featureAnno, restrictUp,
  restrictDown, minPathSize, supervisedStage1 = TRUE, typePCA,
  resample1 = "BS", resample2 = "CV", dataMode = "allTrain",
  repeatA1 = 100, repeatA2 = 1, repeatB1 = 20, repeatB2 = 1,
  nfolds = 10, FSmethod1, FSmethod2, cutP1, cutP2, fdr2,
  FScore = MulticoreParam(), classifier, predMode, paramlist,
  innerCore = MulticoreParam())

Arguments

trainData

The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

testData

The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). This is only used for pathway-based stratification (only stratify is 'pathway').

featureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. If it's NULL, then the input probe is from the transcriptomic data. (Default: NULL)

restrictUp

The upper-bound of the number of probes or genes in each biological stratified block.

restrictDown

The lower-bound of the number of probes or genes in each biological stratified block.

minPathSize

The minimal defined pathway size after mapping your own data to GO database. This is only used for pathway-based stratification (only stratify is 'pathway').

supervisedStage1

A logical value. If TRUE, then supervised learning models are applied; if FALSE, unsupervised learning.

typePCA

the type of PCA. Available options are c('regular', 'sparse').

resample1

The resampling methods at stage-1. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'BS'.

resample2

The resampling methods at stage-2. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'CV'.

dataMode

The input training data mode for model training. It is used only if 'testData' is present. It can be a subset of the whole training data or the entire training data. 'subTrain' is the given for subsetting and 'allTrain' for the entire training dataset.

repeatA1

The number of repeats N is used during resampling procedure. Repeated cross validation or multiple boostrapping is performed if N >=2. One can choose 10 repeats for 'CV' and 100 repeats for 'BS'.

repeatA2

The number of repeats N is used during resampling prediction. The default is 1 for 'CV'.

repeatB1

The number of repeats N is used for generating stage-2 test data prediction scores. The default is 20.

repeatB2

The number of repeats N is used for test data prediction. The default is 1.

nfolds

The number of folds is defined for cross validation. The default is 10.

FSmethod1

Feature selection methods at stage-1. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox').

FSmethod2

Feature selection methods at stage-2. Features that are positively associated with the outcome will be used.

cutP1

The cutoff used for p value thresholding at stage-1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc).

cutP2

The cutoff used for p value thresholding at stage-2.

fdr2

Multiple testing correction method at stage-2. Available options are c(NULL, 'fdr', 'BH', 'holm', etc). See also p.adjust. The default is NULL.

FScore

The number of cores used for feature selection.

classifier

Machine learning classifiers at both stages.

predMode

The prediction mode at both stages. Available options are c('probability', 'classification', 'regression').

paramlist

A list of model parameters at both stages.

innerCore

The number of cores used for computation.

Details

Stage-2 training data can be learned either using bootstrapping or cross validation resampling methods in the supervised learning settting. Stage-2 test data is learned via independent test set prediction.

Value

The CV or BS prediction performance for the training data and test set prediction performance if testData is given.

References

Chen, J., & Schwarz, E. (2017). BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.

Perlich, C., & Swirszcz, G. (2011). On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2), 11-15.

See Also

reconBySupervised; reconByUnsupervised; BioMMstage2pred

Examples

 
## Load data    
methylfile <- system.file('extdata', 'methylData.rds', package='BioMM')  
methylData <- readRDS(methylfile)    
## Annotation file
probeAnnoFile <- system.file('extdata', 'cpgAnno.rds', package='BioMM')  
probeAnno <- readRDS(file=probeAnnoFile)   
supervisedStage1=TRUE
classifier <- 'randForest'
predMode <- 'classification'
paramlist <- list(ntree=300, nthreads=30)   
library(BiocParallel)
library(ranger)
param1 <- MulticoreParam(workers = 2)
param2 <- MulticoreParam(workers = 20)
## Not Run 
## result <- BioMM(trainData=methylData, testData=NULL,
##                 pathlistDB, featureAnno=probeAnno, 
##                 restrictUp=10, restrictDown=200, minPathSize=10, 
##                 supervisedStage1, typePCA='regular', 
##                 resample1='BS', resample2='CV', dataMode="allTrain",
##                 repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1, 
##                 nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
##                 cutP1=0.1, cutP2=0.1, fdr2=NULL, FScore=param1, 
##                 classifier, predMode, paramlist, innerCore=param2)

[Package BioMM version 1.4.0 Index]