This vignette explains how to specify non-default machine learning frameworks and their hyperparameters when applying Infinity Flow. We will assume here that the basic usage of Infinity Flow has already been read, if you are not familiar with this material I suggest you first look at the basic usage vignette
This vignette will cover:
regression_functions
argumentextra_args_regression_params
argumentHere is a single R code chunk that recapitulates all of the data preparation covered in the basic usage vignette.
if(!require(devtools)){
install.packages("devtools")
}
if(!require(infinityFlow)){
library(devtools)
install_github("ebecht/infinityFlow")
}
library(infinityFlow)
data(steady_state_lung)
data(steady_state_lung_annotation)
data(steady_state_lung_backbone_specification)
dir <- file.path(tempdir(), "infinity_flow_example")
input_dir <- file.path(dir, "fcs")
write.flowSet(steady_state_lung, outdir = input_dir)
#> [1] "/tmp/RtmpUcsVgQ/infinity_flow_example/fcs"
write.csv(steady_state_lung_backbone_specification, file = file.path(dir, "backbone_selection_file.csv"), row.names = FALSE)
path_to_fcs <- file.path(dir, "fcs")
path_to_output <- file.path(dir, "output")
path_to_intermediary_results <- file.path(dir, "tmp")
backbone_selection_file <- file.path(dir, "backbone_selection_file.csv")
targets <- steady_state_lung_annotation$Infinity_target
names(targets) <- rownames(steady_state_lung_annotation)
isotypes <- steady_state_lung_annotation$Infinity_isotype
names(isotypes) <- rownames(steady_state_lung_annotation)
input_events_downsampling <- 1000
prediction_events_downsampling <- 500
cores = 1L
The infinity_flow()
function which encapsulates the
complete Infinity Flow computational pipeline uses two arguments to
respectively select regression models and their hyperparameters. These
two arguments are both lists, and should have the same length. The idea
is that the first list, regression_functions
will be a list
of model templates (XGBoost, Neural Networks, SVMs…) to train, while the
second will be used to specify their hyperparameters. The list of
templates is then fit to the data using parallel computing with
socketing (using the parallel
package through the
pbapply
package), which is more memory efficient.
regression_functions
argumentThis argument is a list of functions which specifies how many models
to train per well and which ones. Each type of machine learning model is
supported through a wrapper in the infinityFlow package, and
has a name of the form fitter_*
. See below for the complete
list:
print(grep("fitter_", ls("package:infinityFlow"), value = TRUE))
#> [1] "fitter_glmnet" "fitter_linear" "fitter_nn" "fitter_svm"
#> [5] "fitter_xgboost"
fitter_ function | Backend | Model type |
---|---|---|
fitter_xgboost | XGBoost | Gradient boosted trees |
fitter_nn | Tensorflow/Keras | Neural networks |
fitter_svm | e1071 | Support vector machines |
fitter_glmnet | glmnet | Generalized linear and polynomial models |
fitter_lm | stats | Linear and polynomial models |
These functions rely on optional package dependencies (so that you do not need to install e.g. Keras if you are not planning to use it). We need to make sure that these dependencies are however met:
optional_dependencies <- c("glmnetUtils", "e1071")
unmet_dependencies <- setdiff(optional_dependencies, rownames(installed.packages()))
if(length(unmet_dependencies) > 0){
install.packages(unmet_dependencies)
}
for(pkg in optional_dependencies){
library(pkg, character.only = TRUE)
}
In this vignette we will train all of these models. Note that if you
do it on your own data, it make take quite a bit of memory (remember
that the output expression matrix will be a numeric matrix of size
(prediction_events_downsampling x number of wells) rows x (number of wells x number of models)
.
To train multiple models we create a list of these fitter_* functions
and assign this to the regression_functions
argument that
will be fed to the infinity_flow
function. The names of
this list will be used to name your models.
extra_args_regression_params
argumentThis argument is a list of list (so of the form
list(list(...), list(...), etc.)
) of length
length(regression_functions)
. Each element of the
extra_args_regression_params object is thus a list. This lower-level
list will be used to pass named arguments to the machine learning
fitting function. The list of extra_args_regression_params
is matched with the list of machine learning models
regression_functions
using the order of the elements in
these two lists (e.g. the first regression model is matched with the
first element of the list of arguments, then the seconds elements are
matched together, etc…).
backbone_size <- table(read.csv(backbone_selection_file)[,"type"])["backbone"]
extra_args_regression_params <- list(
## Passed to the first element of `regression_functions`, e.g. XGBoost. See ?xgboost for which parameters can be passed through this list
list(nrounds = 500, eta = 0.05),
# ## Passed to the second element of `regression_functions`, e.g. neural networks through keras::fit. See https://keras.rstudio.com/articles/tutorial_basic_regression.html
# list(
# object = { ## Specifies the network's architecture, loss function and optimization method
# model = keras_model_sequential()
# model %>%
# layer_dense(units = backbone_size, activation = "relu", input_shape = backbone_size) %>%
# layer_dense(units = backbone_size, activation = "relu", input_shape = backbone_size) %>%
# layer_dense(units = 1, activation = "linear")
# model %>%
# compile(loss = "mean_squared_error", optimizer = optimizer_sgd(lr = 0.005))
# serialize_model(model)
# },
# epochs = 1000, ## Number of maximum training epochs. The training is however stopped early if the loss on the validation set does not improve for 20 epochs. This early stopping is hardcoded in fitter_nn.
# validation_split = 0.2, ## Fraction of the training data used to monitor validation loss
# verbose = 0,
# batch_size = 128 ## Size of the minibatches for training.
# ),
# Passed to the third element, SVMs. See help(svm, "e1071") for possible arguments
list(type = "nu-regression", cost = 8, nu=0.5, kernel="radial"),
# Passed to the fourth element, fitter_glmnet. This should contain a mandatory argument `degree` which specifies the degree of the polynomial model (1 for linear, 2 for quadratic etc...). Here we use degree = 2 corresponding to our LASSO2 model Other arguments are passed to getS3method("cv.glmnet", "formula"),
list(alpha = 1, nfolds=10, degree = 2),
# Passed to the fourth element, fitter_linear. This only accepts a degree argument specifying the degree of the polynomial model. Here we use degree = 1 corresponding to a linear model.
list(degree = 1)
)
We can now run the pipeline with these custom arguments to train all the models.
if(length(regression_functions) != length(extra_args_regression_params)){
stop("Number of models and number of lists of hyperparameters mismatch")
}
imputed_data <- infinity_flow(
regression_functions = regression_functions,
extra_args_regression_params = extra_args_regression_params,
path_to_fcs = path_to_fcs,
path_to_output = path_to_output,
path_to_intermediary_results = path_to_intermediary_results,
backbone_selection_file = backbone_selection_file,
annotation = targets,
isotype = isotypes,
input_events_downsampling = input_events_downsampling,
prediction_events_downsampling = prediction_events_downsampling,
verbose = TRUE,
cores = cores
)
#> Using directories...
#> input: /tmp/RtmpUcsVgQ/infinity_flow_example/fcs
#> intermediary: /tmp/RtmpUcsVgQ/infinity_flow_example/tmp
#> subset: /tmp/RtmpUcsVgQ/infinity_flow_example/tmp/subsetted_fcs
#> rds: /tmp/RtmpUcsVgQ/infinity_flow_example/tmp/rds
#> annotation: /tmp/RtmpUcsVgQ/infinity_flow_example/tmp/annotation.csv
#> output: /tmp/RtmpUcsVgQ/infinity_flow_example/output
#> Parsing and subsampling input data
#> Downsampling to 1000 events per input file
#> Concatenating expression matrices
#> Writing to disk
#> Logicle-transforming the data
#> Backbone data
#> Exploratory data
#> Writing to disk
#> Transforming expression matrix
#> Writing to disk
#> Harmonizing backbone data
#> Scaling expression matrices
#> Writing to disk
#> Fitting regression models
#> Randomly selecting 50% of the subsetted input files to fit models
#> Fitting...
#> XGBoost
#>
#> 10.33553 seconds
#> SVM
#>
#> 1.43135 seconds
#> LASSO2
#>
#> 6.880951 seconds
#> LM
#>
#> 0.06642556 seconds
#> Imputing missing measurements
#> Randomly drawing events to predict from the test set
#> Imputing...
#> XGBoost
#>
#> 0.9312365 seconds
#> SVM
#>
#> 1.091354 seconds
#> LASSO2
#>
#> 1.330984 seconds
#> LM
#>
#> 0.04434729 seconds
#> Concatenating predictions
#> Writing to disk
#> Performing dimensionality reduction
#> 17:58:58 UMAP embedding parameters a = 1.262 b = 1.003
#> 17:58:58 Read 5000 rows and found 17 numeric columns
#> 17:58:58 Using Annoy for neighbor search, n_neighbors = 15
#> 17:58:58 Building Annoy index with metric = euclidean, n_trees = 50
#> 0% 10 20 30 40 50 60 70 80 90 100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 17:58:58 Writing NN index file to temp file /tmp/RtmpUcsVgQ/file10428f372a5f0e
#> 17:58:58 Searching Annoy index using 1 thread, search_k = 1500
#> 17:58:59 Annoy recall = 100%
#> 17:59:00 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 15
#> 17:59:01 Initializing from normalized Laplacian + noise (using irlba)
#> 17:59:01 Commencing optimization for 1000 epochs, with 102026 positive edges using 1 thread
#> 17:59:11 Optimization finished
#> Exporting results
#> Transforming predictions back to a linear scale
#> Exporting FCS files (1 per well)
#> Plotting
#> Chopping off the top and bottom 0.005 quantiles
#> Shuffling the order of cells (rows)
#> Producing plot
#> Background correcting
#> Transforming background-corrected predictions. (Use logarithm to visualize)
#> Exporting FCS files (1 per well)
#> Plotting
#> Chopping off the top and bottom 0.005 quantiles
#> Shuffling the order of cells (rows)
#> Producing plot
Our model names are appended to the predicted markers in the output. For more discussion about the outputs (including output files written to disk and plots), see the basic usage vignette
print(imputed_data$bgc[1:2, ])
#> FSC-A FSC-H FSC-W SSC-A SSC-H SSC-W CD69-CD301b
#> 1 49252.20 -0.01534063 -0.7885858 1780.66 -1.258372 -0.9259407 -0.01680446
#> 2 46171.35 -0.26199480 -0.7080096 1262.42 -1.817041 -1.0551946 1.32612064
#> Zombie MHCII CD4 CD44 CD8 CD11c CD11b
#> 1 -13.60832 -0.6109858 -0.01113398 -1.063708 -0.6791409 0.05259537 -0.03369963
#> 2 -16.52464 1.0220343 0.87153843 -1.973077 -0.1400648 -0.10762596 0.02589508
#> F480 Ly6C Lineage CD45a488 FJComp-PE(yg)-A CD24
#> 1 -0.8785054 0.7070034 0.01516564 0.3745576 0.8664429 -1.7605858
#> 2 -1.2185910 -0.8204737 0.19784643 0.2874405 0.6688780 0.6129032
#> CD103 Time CD137.LASSO2_bgc CD137.LM_bgc CD137.SVM_bgc
#> 1 0.2628921 3417.409 0.1227090 0.13621719 -0.2307604
#> 2 0.4301747 2575.407 0.2452072 -0.04235699 -0.8670138
#> CD137.XGBoost_bgc CD28.LASSO2_bgc CD28.LM_bgc CD28.SVM_bgc CD28.XGBoost_bgc
#> 1 0.48100450 -0.02952459 0.129479144 -0.2682753 -0.009571856
#> 2 -0.08628275 -0.07315344 0.009658425 -0.3916564 -0.146417911
#> CD49b(pan-NK).LASSO2_bgc CD49b(pan-NK).LM_bgc CD49b(pan-NK).SVM_bgc
#> 1 0.5846342 0.1421483 1.8906219
#> 2 -0.4982740 -0.3599298 -0.7831375
#> CD49b(pan-NK).XGBoost_bgc KLRG1.LASSO2_bgc KLRG1.LM_bgc KLRG1.SVM_bgc
#> 1 1.3462507 0.4379121 0.14821175 0.443576
#> 2 -0.6877002 -0.1051760 -0.06103081 -1.125597
#> KLRG1.XGBoost_bgc Ly-49c/F/I/H.LASSO2_bgc Ly-49c/F/I/H.LM_bgc
#> 1 -0.1942122 0.02414474 -0.03945548
#> 2 -0.3004749 -0.23876690 -0.34732020
#> Ly-49c/F/I/H.SVM_bgc Ly-49c/F/I/H.XGBoost_bgc Podoplanin.LASSO2_bgc
#> 1 0.2037049 0.01328134 -0.3406807
#> 2 -0.8222864 -0.22567020 -0.0195263
#> Podoplanin.LM_bgc Podoplanin.SVM_bgc Podoplanin.XGBoost_bgc SHIgG.LASSO2_bgc
#> 1 -0.1823438 -0.4522098 0.4705204 3.976281e-16
#> 2 0.2841068 0.3098248 0.3823015 1.339684e-15
#> SHIgG.LM_bgc SHIgG.SVM_bgc SHIgG.XGBoost_bgc SSEA-3.LASSO2_bgc SSEA-3.LM_bgc
#> 1 2.268219e-16 -8.940419e-17 2.790570e-15 0.04613137 0.04963487
#> 2 3.916539e-15 6.760505e-17 1.691506e-15 0.11885326 -0.20091633
#> SSEA-3.SVM_bgc SSEA-3.XGBoost_bgc TCR Vg3.LASSO2_bgc TCR Vg3.LM_bgc
#> 1 0.2456389 -0.07623579 -0.1712451 -0.14418617
#> 2 -0.6870845 -0.23302383 0.1849280 0.02204801
#> TCR Vg3.SVM_bgc TCR Vg3.XGBoost_bgc rIgM.LASSO2_bgc rIgM.LM_bgc
#> 1 -0.6747146 -0.4160073 5.408404e-16 2.980140e-17
#> 2 -0.3884910 0.2496065 3.995044e-15 -2.057125e-16
#> rIgM.SVM_bgc rIgM.XGBoost_bgc UMAP1 UMAP2 PE_id
#> 1 3.278154e-16 2.028152e-16 710.2420 641.39841 1
#> 2 -1.791809e-15 2.028152e-16 910.6396 46.79035 1
Neural networks won’t build in knitr for me but here is an example of the syntax if you want to use them.
Note: there is an issue with serialization of the neural networks and socketing since I updated to R-4.0.1. If you want to use neural networks, please make sure to set
optional_dependencies <- c("keras", "tensorflow")
unmet_dependencies <- setdiff(optional_dependencies, rownames(installed.packages()))
if(length(unmet_dependencies) > 0){
install.packages(unmet_dependencies)
}
for(pkg in optional_dependencies){
library(pkg, character.only = TRUE)
}
invisible(eval(try(keras_model_sequential()))) ## avoids conflicts with flowCore...
if(!is_keras_available()){
install_keras() ## Instal keras unsing the R interface - can take a while
}
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install("infinityFlow")
library(infinityFlow)
data(steady_state_lung)
data(steady_state_lung_annotation)
data(steady_state_lung_backbone_specification)
dir <- file.path(tempdir(), "infinity_flow_example")
input_dir <- file.path(dir, "fcs")
write.flowSet(steady_state_lung, outdir = input_dir)
write.csv(steady_state_lung_backbone_specification, file = file.path(dir, "backbone_selection_file.csv"), row.names = FALSE)
path_to_fcs <- file.path(dir, "fcs")
path_to_output <- file.path(dir, "output")
path_to_intermediary_results <- file.path(dir, "tmp")
backbone_selection_file <- file.path(dir, "backbone_selection_file.csv")
targets <- steady_state_lung_annotation$Infinity_target
names(targets) <- rownames(steady_state_lung_annotation)
isotypes <- steady_state_lung_annotation$Infinity_isotype
names(isotypes) <- rownames(steady_state_lung_annotation)
input_events_downsampling <- 1000
prediction_events_downsampling <- 500
## Passed to fitter_nn, e.g. neural networks through keras::fit. See https://keras.rstudio.com/articles/tutorial_basic_regression.html
regression_functions <- list(NN = fitter_nn)
backbone_size <- table(read.csv(backbone_selection_file)[,"type"])["backbone"]
extra_args_regression_params <- list(
list(
object = { ## Specifies the network's architecture, loss function and optimization method
model = keras_model_sequential()
model %>%
layer_dense(units = backbone_size, activation = "relu", input_shape = backbone_size) %>%
layer_dense(units = backbone_size, activation = "relu", input_shape = backbone_size) %>%
layer_dense(units = 1, activation = "linear")
model %>%
compile(loss = "mean_squared_error", optimizer = optimizer_sgd(lr = 0.005))
serialize_model(model)
},
epochs = 1000, ## Number of maximum training epochs. The training is however stopped early if the loss on the validation set does not improve for 20 epochs. This early stopping is hardcoded in fitter_nn.
validation_split = 0.2, ## Fraction of the training data used to monitor validation loss
verbose = 0,
batch_size = 128 ## Size of the minibatches for training.
)
)
imputed_data <- infinity_flow(
regression_functions = regression_functions,
extra_args_regression_params = extra_args_regression_params,
path_to_fcs = path_to_fcs,
path_to_output = path_to_output,
path_to_intermediary_results = path_to_intermediary_results,
backbone_selection_file = backbone_selection_file,
annotation = targets,
isotype = isotypes,
input_events_downsampling = input_events_downsampling,
prediction_events_downsampling = prediction_events_downsampling,
verbose = TRUE,
cores = 1L
)
Thank you for following this vignette, I hope you made it through the end without too much headache and that it was informative. General questions about proper usage of the package are best asked on the Bioconductor support site to maximize visibility for future users. If you encounter bugs, feel free to raise an issue on infinityFlow’s github.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> Random number generation:
#> RNG: L'Ecuyer-CMRG
#> Normal: Inversion
#> Sample: Rejection
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] e1071_1.7-16 glmnetUtils_1.1.9 infinityFlow_1.17.0
#> [4] flowCore_2.19.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 generics_0.1.3 class_7.3-22
#> [4] gtools_3.9.5 shape_1.4.6.1 lattice_0.22-6
#> [7] digest_0.6.37 evaluate_1.0.1 grid_4.5.0
#> [10] iterators_1.0.14 fastmap_1.2.0 xgboost_1.7.8.1
#> [13] foreach_1.5.2 jsonlite_1.8.9 Matrix_1.7-1
#> [16] glmnet_4.1-8 survival_3.7-0 pbapply_1.7-2
#> [19] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3
#> [22] rlang_1.1.4 RProtoBufLib_2.19.0 Biobase_2.67.0
#> [25] RcppAnnoy_0.0.22 uwot_0.2.2 matlab_1.0.4.1
#> [28] splines_4.5.0 cachem_1.1.0 yaml_2.3.10
#> [31] cytolib_2.19.0 tools_4.5.0 raster_3.6-30
#> [34] parallel_4.5.0 BiocGenerics_0.53.0 R6_2.5.1
#> [37] png_0.1-8 proxy_0.4-27 matrixStats_1.4.1
#> [40] stats4_4.5.0 lifecycle_1.0.4 S4Vectors_0.45.0
#> [43] irlba_2.3.5.1 terra_1.7-83 bslib_0.8.0
#> [46] data.table_1.16.2 Rcpp_1.0.13 xfun_0.48
#> [49] knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28
#> [52] compiler_4.5.0 sp_2.1-4