BioMM {BioMM} | R Documentation |
End-to-end prediction by BioMM framework using either supervised or unsupervised learning at stage-1, then supervised learning at stage-2.
BioMM(trainData, testData, pathlistDB, featureAnno, restrictUp, restrictDown, minPathSize, supervisedStage1 = TRUE, typePCA, resample1 = "BS", resample2 = "CV", dataMode = "allTrain", repeatA1 = 100, repeatA2 = 1, repeatB1 = 20, repeatB2 = 1, nfolds = 10, FSmethod1, FSmethod2, cutP1, cutP2, fdr2, FScore = MulticoreParam(), classifier, predMode, paramlist, innerCore = MulticoreParam())
trainData |
The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member. |
testData |
The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member. |
pathlistDB |
A list of pathways with pathway IDs and their
corresponding genes ('entrezID' is used). This is only used for
pathway-based stratification (only |
featureAnno |
The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. If it's NULL, then the input probe is from the transcriptomic data. (Default: NULL) |
restrictUp |
The upper-bound of the number of probes or genes in each biological stratified block. |
restrictDown |
The lower-bound of the number of probes or genes in each biological stratified block. |
minPathSize |
The minimal defined pathway size after mapping your
own data to GO database. This is only used for
pathway-based stratification (only |
supervisedStage1 |
A logical value. If TRUE, then supervised learning models are applied; if FALSE, unsupervised learning. |
typePCA |
the type of PCA. Available options are c('regular', 'sparse'). |
resample1 |
The resampling methods at stage-1. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'BS'. |
resample2 |
The resampling methods at stage-2. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'CV'. |
dataMode |
The input training data mode for model training. It is used only if 'testData' is present. It can be a subset of the whole training data or the entire training data. 'subTrain' is the given for subsetting and 'allTrain' for the entire training dataset. |
repeatA1 |
The number of repeats N is used during resampling procedure. Repeated cross validation or multiple boostrapping is performed if N >=2. One can choose 10 repeats for 'CV' and 100 repeats for 'BS'. |
repeatA2 |
The number of repeats N is used during resampling prediction. The default is 1 for 'CV'. |
repeatB1 |
The number of repeats N is used for generating stage-2 test data prediction scores. The default is 20. |
repeatB2 |
The number of repeats N is used for test data prediction. The default is 1. |
nfolds |
The number of folds is defined for cross validation. The default is 10. |
FSmethod1 |
Feature selection methods at stage-1. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox'). |
FSmethod2 |
Feature selection methods at stage-2. Features that are positively associated with the outcome will be used. |
cutP1 |
The cutoff used for p value thresholding at stage-1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). |
cutP2 |
The cutoff used for p value thresholding at stage-2. |
fdr2 |
Multiple testing correction method at stage-2.
Available options are c(NULL, 'fdr', 'BH', 'holm', etc).
See also |
FScore |
The number of cores used for feature selection. |
classifier |
Machine learning classifiers at both stages. |
predMode |
The prediction mode at both stages. Available options are c('probability', 'classification', 'regression'). |
paramlist |
A list of model parameters at both stages. |
innerCore |
The number of cores used for computation. |
Stage-2 training data can be learned either using bootstrapping or cross validation resampling methods in the supervised learning settting. Stage-2 test data is learned via independent test set prediction.
The CV or BS prediction performance for the training data and
test set prediction performance if testData
is given.
Chen, J., & Schwarz, E. (2017). BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.
Perlich, C., & Swirszcz, G. (2011). On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2), 11-15.
reconBySupervised
; reconByUnsupervised
;
BioMMstage2pred
## Load data methylfile <- system.file('extdata', 'methylData.rds', package='BioMM') methylData <- readRDS(methylfile) ## Annotation file probeAnnoFile <- system.file('extdata', 'cpgAnno.rds', package='BioMM') probeAnno <- readRDS(file=probeAnnoFile) supervisedStage1=TRUE classifier <- 'randForest' predMode <- 'classification' paramlist <- list(ntree=300, nthreads=30) library(BiocParallel) library(ranger) param1 <- MulticoreParam(workers = 2) param2 <- MulticoreParam(workers = 20) ## Not Run ## result <- BioMM(trainData=methylData, testData=NULL, ## pathlistDB, featureAnno=probeAnno, ## restrictUp=10, restrictDown=200, minPathSize=10, ## supervisedStage1, typePCA='regular', ## resample1='BS', resample2='CV', dataMode="allTrain", ## repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1, ## nfolds=10, FSmethod1=NULL, FSmethod2=NULL, ## cutP1=0.1, cutP2=0.1, fdr2=NULL, FScore=param1, ## classifier, predMode, paramlist, innerCore=param2)