This vignette contains a detailed tutorial on how to train a MOFA model using R. A concise template script can be found here. Many more examples on application of MOFA to various multi-omics data sets can be found here.
library(data.table)
library(MOFA2)
MOFA (and factor analysis models in general) are useful to uncover variation in complex data sets that contain multiple sources of heterogeneity. This requires a relatively large sample size (at least ~15 samples). In addition, MOFA needs the multi-modal measurements to be derived from the same samples. It is fine if you have samples that are missing some data modality, but there has to be a significant degree of matched measurements.
Proper normalisation of the data is critical. The model can handle three types of data: continuous (modelled with a gaussian likelihood), small counts (modelled with a Poisson likelihood) and binary measurements (modelled with a bernoulli likelihood). Non-gaussian likelihoods give non-optimal results, we recommend the user to apply data transformations to obtain continuous measurements. For example, for count-based data such as RNA-seq or ATAC-seq we recommend size factor normalisation + variance stabilisation (i.e. a log transformation).
It is strongly recommended that you select highly variable features (HVGs) per assay before fitting the model. This ensures a faster training and a more robust inference procedure. Also, for data modalities that have very different dimensionalities we suggest a stronger feature selection fort he bigger views, with the aim of reducing the feature imbalance between data modalities.
To create a MOFA object you need to specify three dimensions: samples, features and view(s). Optionally, a group can also be specified for each sample (no group structure by default). MOFA objects can be created from a wide range of input formats, including:
A list of matrices, where each entry corresponds to one view. Samples are stored in columns and features in rows.
Let’s simulate some data to start with
data <- make_example_data(
n_views = 2,
n_samples = 200,
n_features = 1000,
n_factors = 10
)[[1]]
lapply(data,dim)
## $view_1
## [1] 1000 200
##
## $view_2
## [1] 1000 200
Create the MOFA object:
MOFAobject <- create_mofa(data)
Plot the data overview
plot_data_overview(MOFAobject)
In case you are using the multi-group functionality, the groups can be specified
using the groups
argument as a vector with the group ID for each sample.
Keep in mind that the multi-group functionality is a rather advanced option that we
discourage for beginners. For more details on how the multi-group inference works,
read the FAQ section and
check this vignette.
N = ncol(data[[1]])
groups = c(rep("A",N/2), rep("B",N/2))
MOFAobject <- create_mofa(data, groups=groups)
Plot the data overview
plot_data_overview(MOFAobject)
A long data.frame with columns sample
, feature
, view
, group
(optional), value
might be the best format for complex data sets with multiple omics and potentially multiple groups of data. Also, there is no need to add rows that correspond to missing data:
filepath <- system.file("extdata", "test_data.RData", package = "MOFA2")
load(filepath)
head(dt)
## sample feature view value
## <char> <char> <char> <num>
## 1: sample_0_group_1 feature_0_view_0 view_0 2.08
## 2: sample_1_group_1 feature_0_view_0 view_0 0.01
## 3: sample_2_group_1 feature_0_view_0 view_0 -0.11
## 4: sample_3_group_1 feature_0_view_0 view_0 -0.82
## 5: sample_4_group_1 feature_0_view_0 view_0 -1.13
## 6: sample_5_group_1 feature_0_view_0 view_0 -0.25
Create the MOFA object
MOFAobject <- create_mofa(dt)
## Creating MOFA object from a data.frame...
print(MOFAobject)
## Untrained MOFA model with the following characteristics:
## Number of views: 2
## Views names: view_0 view_1
## Number of features (per view): 1000 1000
## Number of groups: 1
## Groups names: single_group
## Number of samples (per group): 100
##
Plot data overview
plot_data_overview(MOFAobject)
FALSE
FALSE
data_opts <- get_default_data_options(MOFAobject)
head(data_opts)
## $scale_views
## [1] FALSE
##
## $scale_groups
## [1] FALSE
##
## $center_groups
## [1] TRUE
##
## $use_float32
## [1] TRUE
##
## $views
## [1] "view_0" "view_1"
##
## $groups
## [1] "single_group"
FALSE
.TRUE
.TRUE
if using multiple groups.TRUE
if using multiple views.Only change the default model options if you are familiar with the underlying mathematical model.
model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 10
head(model_opts)
## $likelihoods
## view_0 view_1
## "gaussian" "gaussian"
##
## $num_factors
## [1] 10
##
## $spikeslab_factors
## [1] FALSE
##
## $spikeslab_weights
## [1] FALSE
##
## $ard_factors
## [1] FALSE
##
## $ard_weights
## [1] TRUE
train_opts <- get_default_training_options(MOFAobject)
head(train_opts)
## $maxiter
## [1] 1000
##
## $convergence_mode
## [1] "fast"
##
## $drop_factor_threshold
## [1] -1
##
## $verbose
## [1] FALSE
##
## $startELBO
## [1] 1
##
## $freqELBO
## [1] 5
Prepare the MOFA object
MOFAobject <- prepare_mofa(
object = MOFAobject,
data_options = data_opts,
model_options = model_opts,
training_options = train_opts
)
Train the MOFA model. Remember that in this step the MOFA2
R package connets with the
mofapy2
Python package using reticulate
. This is the source of most problems when
running MOFA. See our FAQ section
if you have issues. The output is saved in the file specified as outfile
. If none is specified, the output is saved in a temporary location.
outfile = file.path(tempdir(),"model.hdf5")
MOFAobject.trained <- run_mofa(MOFAobject, outfile, use_basilisk=TRUE)
## Connecting to the mofapy2 package using basilisk.
## Set 'use_basilisk' to FALSE if you prefer to manually set the python binary using 'reticulate'.
## + /var/cache/basilisk/1.16.0/0/bin/conda create --yes --prefix /var/cache/basilisk/1.16.0/MOFA2/1.14.0/mofa_env 'python=3.10.5' --quiet -c conda-forge
## + /var/cache/basilisk/1.16.0/0/bin/conda install --yes --prefix /var/cache/basilisk/1.16.0/MOFA2/1.14.0/mofa_env 'python=3.10.5' -c conda-forge
## + /var/cache/basilisk/1.16.0/0/bin/conda install --yes --prefix /var/cache/basilisk/1.16.0/MOFA2/1.14.0/mofa_env -c conda-forge 'python=3.10.5' 'python=3.10.5' 'numpy=1.23.1' 'scipy=1.8.1' 'pandas=1.4.3' 'h5py=3.6.0' 'scikit-learn=1.1.1' 'dtw-python=1.2.2'
##
## #########################################################
## ### __ __ ____ ______ ###
## ### | \/ |/ __ \| ____/\ _ ###
## ### | \ / | | | | |__ / \ _| |_ ###
## ### | |\/| | | | | __/ /\ \_ _| ###
## ### | | | | |__| | | / ____ \|_| ###
## ### |_| |_|\____/|_|/_/ \_\ ###
## ### ###
## #########################################################
##
##
##
## use_float32 set to True: replacing float64 arrays by float32 arrays to speed up computations...
##
## Successfully loaded view='view_0' group='single_group' with N=100 samples and D=1000 features...
## Successfully loaded view='view_1' group='single_group' with N=100 samples and D=1000 features...
##
##
## Model options:
## - Automatic Relevance Determination prior on the factors: False
## - Automatic Relevance Determination prior on the weights: True
## - Spike-and-slab prior on the factors: False
## - Spike-and-slab prior on the weights: False
## Likelihoods:
## - View 0 (view_0): gaussian
## - View 1 (view_1): gaussian
##
##
##
##
## ######################################
## ## Training the model with seed 42 ##
## ######################################
##
##
## ELBO before training: -1602213.76
##
## Iteration 1: time=0.02, ELBO=-281808.04, deltaELBO=1320405.719 (82.41133331%), Factors=10
## Iteration 2: time=0.02, Factors=10
## Iteration 3: time=0.02, Factors=10
## Iteration 4: time=0.02, Factors=10
## Iteration 5: time=0.02, Factors=10
## Iteration 6: time=0.02, ELBO=-237882.86, deltaELBO=43925.180 (2.74153059%), Factors=10
## Iteration 7: time=0.02, Factors=10
## Iteration 8: time=0.02, Factors=10
## Iteration 9: time=0.02, Factors=10
## Iteration 10: time=0.02, Factors=10
## Iteration 11: time=0.02, ELBO=-237719.36, deltaELBO=163.500 (0.01020466%), Factors=10
## Iteration 12: time=0.02, Factors=10
## Iteration 13: time=0.02, Factors=10
## Iteration 14: time=0.02, Factors=10
## Iteration 15: time=0.02, Factors=10
## Iteration 16: time=0.02, ELBO=-237643.63, deltaELBO=75.731 (0.00472667%), Factors=10
## Iteration 17: time=0.02, Factors=10
## Iteration 18: time=0.02, Factors=10
## Iteration 19: time=0.02, Factors=10
## Iteration 20: time=0.02, Factors=10
## Iteration 21: time=0.02, ELBO=-237581.46, deltaELBO=62.170 (0.00388025%), Factors=10
## Iteration 22: time=0.02, Factors=10
## Iteration 23: time=0.02, Factors=10
## Iteration 24: time=0.02, Factors=10
## Iteration 25: time=0.02, Factors=10
## Iteration 26: time=0.02, ELBO=-237526.91, deltaELBO=54.543 (0.00340423%), Factors=10
## Iteration 27: time=0.02, Factors=10
## Iteration 28: time=0.02, Factors=10
## Iteration 29: time=0.02, Factors=10
## Iteration 30: time=0.02, Factors=10
## Iteration 31: time=0.02, ELBO=-237478.43, deltaELBO=48.479 (0.00302578%), Factors=10
## Iteration 32: time=0.02, Factors=10
## Iteration 33: time=0.02, Factors=10
## Iteration 34: time=0.02, Factors=10
## Iteration 35: time=0.02, Factors=10
## Iteration 36: time=0.02, ELBO=-237434.89, deltaELBO=43.543 (0.00271765%), Factors=10
## Iteration 37: time=0.02, Factors=10
## Iteration 38: time=0.02, Factors=10
## Iteration 39: time=0.02, Factors=10
## Iteration 40: time=0.02, Factors=10
## Iteration 41: time=0.02, ELBO=-237395.74, deltaELBO=39.148 (0.00244336%), Factors=10
## Iteration 42: time=0.02, Factors=10
## Iteration 43: time=0.02, Factors=10
## Iteration 44: time=0.02, Factors=10
## Iteration 45: time=0.02, Factors=10
## Iteration 46: time=0.02, ELBO=-237360.34, deltaELBO=35.405 (0.00220976%), Factors=10
## Iteration 47: time=0.02, Factors=10
## Iteration 48: time=0.02, Factors=10
## Iteration 49: time=0.02, Factors=10
## Iteration 50: time=0.02, Factors=10
## Iteration 51: time=0.02, ELBO=-237328.27, deltaELBO=32.070 (0.00200164%), Factors=10
## Iteration 52: time=0.02, Factors=10
## Iteration 53: time=0.02, Factors=10
## Iteration 54: time=0.02, Factors=10
## Iteration 55: time=0.02, Factors=10
## Iteration 56: time=0.02, ELBO=-237299.18, deltaELBO=29.085 (0.00181528%), Factors=10
## Iteration 57: time=0.02, Factors=10
## Iteration 58: time=0.02, Factors=10
## Iteration 59: time=0.02, Factors=10
## Iteration 60: time=0.02, Factors=10
## Iteration 61: time=0.03, ELBO=-237272.57, deltaELBO=26.614 (0.00166108%), Factors=10
## Iteration 62: time=0.02, Factors=10
## Iteration 63: time=0.02, Factors=10
## Iteration 64: time=0.02, Factors=10
## Iteration 65: time=0.02, Factors=10
## Iteration 66: time=0.02, ELBO=-237248.15, deltaELBO=24.420 (0.00152416%), Factors=10
## Iteration 67: time=0.02, Factors=10
## Iteration 68: time=0.02, Factors=10
## Iteration 69: time=0.02, Factors=10
## Iteration 70: time=0.02, Factors=10
## Iteration 71: time=0.02, ELBO=-237225.70, deltaELBO=22.449 (0.00140113%), Factors=10
## Iteration 72: time=0.02, Factors=10
## Iteration 73: time=0.02, Factors=10
## Iteration 74: time=0.02, Factors=10
## Iteration 75: time=0.02, Factors=10
## Iteration 76: time=0.02, ELBO=-237204.91, deltaELBO=20.791 (0.00129767%), Factors=10
## Iteration 77: time=0.02, Factors=10
## Iteration 78: time=0.02, Factors=10
## Iteration 79: time=0.02, Factors=10
## Iteration 80: time=0.02, Factors=10
## Iteration 81: time=0.02, ELBO=-237185.61, deltaELBO=19.299 (0.00120451%), Factors=10
## Iteration 82: time=0.02, Factors=10
## Iteration 83: time=0.02, Factors=10
## Iteration 84: time=0.02, Factors=10
## Iteration 85: time=0.02, Factors=10
## Iteration 86: time=0.02, ELBO=-237167.58, deltaELBO=18.032 (0.00112541%), Factors=10
## Iteration 87: time=0.02, Factors=10
## Iteration 88: time=0.02, Factors=10
## Iteration 89: time=0.02, Factors=10
## Iteration 90: time=0.02, Factors=10
## Iteration 91: time=0.02, ELBO=-237150.64, deltaELBO=16.937 (0.00105708%), Factors=10
## Iteration 92: time=0.02, Factors=10
## Iteration 93: time=0.02, Factors=10
## Iteration 94: time=0.02, Factors=10
## Iteration 95: time=0.02, Factors=10
## Iteration 96: time=0.02, ELBO=-237134.73, deltaELBO=15.913 (0.00099316%), Factors=10
## Iteration 97: time=0.02, Factors=10
## Iteration 98: time=0.02, Factors=10
## Iteration 99: time=0.02, Factors=10
## Iteration 100: time=0.02, Factors=10
## Iteration 101: time=0.02, ELBO=-237119.66, deltaELBO=15.065 (0.00094029%), Factors=10
## Iteration 102: time=0.02, Factors=10
## Iteration 103: time=0.02, Factors=10
## Iteration 104: time=0.02, Factors=10
## Iteration 105: time=0.02, Factors=10
## Iteration 106: time=0.02, ELBO=-237105.46, deltaELBO=14.204 (0.00088652%), Factors=10
## Iteration 107: time=0.02, Factors=10
## Iteration 108: time=0.02, Factors=10
## Iteration 109: time=0.02, Factors=10
## Iteration 110: time=0.02, Factors=10
## Iteration 111: time=0.02, ELBO=-237091.91, deltaELBO=13.552 (0.00084585%), Factors=10
## Iteration 112: time=0.02, Factors=10
## Iteration 113: time=0.02, Factors=10
## Iteration 114: time=0.02, Factors=10
## Iteration 115: time=0.02, Factors=10
## Iteration 116: time=0.02, ELBO=-237079.07, deltaELBO=12.837 (0.00080120%), Factors=10
## Iteration 117: time=0.02, Factors=10
## Iteration 118: time=0.02, Factors=10
## Iteration 119: time=0.02, Factors=10
## Iteration 120: time=0.02, Factors=10
## Iteration 121: time=0.02, ELBO=-237066.81, deltaELBO=12.261 (0.00076526%), Factors=10
## Iteration 122: time=0.02, Factors=10
## Iteration 123: time=0.02, Factors=10
## Iteration 124: time=0.02, Factors=10
## Iteration 125: time=0.02, Factors=10
## Iteration 126: time=0.02, ELBO=-237055.06, deltaELBO=11.746 (0.00073311%), Factors=10
## Iteration 127: time=0.02, Factors=10
## Iteration 128: time=0.02, Factors=10
## Iteration 129: time=0.02, Factors=10
## Iteration 130: time=0.02, Factors=10
## Iteration 131: time=0.02, ELBO=-237043.82, deltaELBO=11.238 (0.00070142%), Factors=10
## Iteration 132: time=0.02, Factors=10
## Iteration 133: time=0.02, Factors=10
## Iteration 134: time=0.02, Factors=10
## Iteration 135: time=0.02, Factors=10
## Iteration 136: time=0.02, ELBO=-237033.03, deltaELBO=10.797 (0.00067389%), Factors=10
## Iteration 137: time=0.02, Factors=10
## Iteration 138: time=0.02, Factors=10
## Iteration 139: time=0.02, Factors=10
## Iteration 140: time=0.02, Factors=10
## Iteration 141: time=0.02, ELBO=-237022.63, deltaELBO=10.394 (0.00064876%), Factors=10
## Iteration 142: time=0.02, Factors=10
## Iteration 143: time=0.02, Factors=10
## Iteration 144: time=0.02, Factors=10
## Iteration 145: time=0.02, Factors=10
## Iteration 146: time=0.02, ELBO=-237012.62, deltaELBO=10.011 (0.00062484%), Factors=10
## Iteration 147: time=0.02, Factors=10
## Iteration 148: time=0.02, Factors=10
## Iteration 149: time=0.02, Factors=10
## Iteration 150: time=0.02, Factors=10
## Iteration 151: time=0.02, ELBO=-237002.98, deltaELBO=9.639 (0.00060161%), Factors=10
## Iteration 152: time=0.02, Factors=10
## Iteration 153: time=0.02, Factors=10
## Iteration 154: time=0.02, Factors=10
## Iteration 155: time=0.02, Factors=10
## Iteration 156: time=0.02, ELBO=-236993.68, deltaELBO=9.304 (0.00058067%), Factors=10
## Iteration 157: time=0.02, Factors=10
## Iteration 158: time=0.02, Factors=10
## Iteration 159: time=0.02, Factors=10
## Iteration 160: time=0.02, Factors=10
## Iteration 161: time=0.02, ELBO=-236984.68, deltaELBO=9.001 (0.00056179%), Factors=10
## Iteration 162: time=0.02, Factors=10
## Iteration 163: time=0.02, Factors=10
## Iteration 164: time=0.02, Factors=10
## Iteration 165: time=0.02, Factors=10
## Iteration 166: time=0.02, ELBO=-236975.94, deltaELBO=8.734 (0.00054515%), Factors=10
## Iteration 167: time=0.02, Factors=10
## Iteration 168: time=0.02, Factors=10
## Iteration 169: time=0.02, Factors=10
## Iteration 170: time=0.02, Factors=10
## Iteration 171: time=0.02, ELBO=-236967.59, deltaELBO=8.349 (0.00052108%), Factors=10
## Iteration 172: time=0.02, Factors=10
## Iteration 173: time=0.02, Factors=10
## Iteration 174: time=0.02, Factors=10
## Iteration 175: time=0.02, Factors=10
## Iteration 176: time=0.02, ELBO=-236959.51, deltaELBO=8.087 (0.00050473%), Factors=10
## Iteration 177: time=0.02, Factors=10
## Iteration 178: time=0.02, Factors=10
## Iteration 179: time=0.02, Factors=10
## Iteration 180: time=0.02, Factors=10
## Iteration 181: time=0.02, ELBO=-236951.67, deltaELBO=7.839 (0.00048925%), Factors=10
## Iteration 182: time=0.02, Factors=10
## Iteration 183: time=0.02, Factors=10
## Iteration 184: time=0.02, Factors=10
## Iteration 185: time=0.02, Factors=10
## Iteration 186: time=0.02, ELBO=-236944.18, deltaELBO=7.487 (0.00046726%), Factors=10
##
## Converged!
##
##
##
## #######################
## ## Training finished ##
## #######################
##
##
## Saving model in /tmp/RtmpqDu32I/model.hdf5...
If everything is successful, you should observe an output analogous to the following:
######################################
## Training the model with seed 1 ##
######################################
Iteration 1: time=0.03, ELBO=-52650.68, deltaELBO=837116.802 (94.082647669%), Factors=10
(...)
Iteration 9: time=0.04, ELBO=-50114.43, deltaELBO=23.907 (0.002686924%), Factors=10
#######################
## Training finished ##
#######################
Saving model in `/var/folders/.../model.hdf5.../tmp/RtmpqDu32I/model.hdf5.
This finishes the tutorial on how to train a MOFA object from R. To continue with the downstream analysis, follow this tutorial
Session Info
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.15.4 MOFA2_1.14.0 ggplot2_3.5.1 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.1
## [4] filelock_1.0.3 fastmap_1.1.1 GGally_2.2.1
## [7] digest_0.6.35 lifecycle_1.0.4 magrittr_2.0.3
## [10] compiler_4.4.0 rlang_1.1.3 sass_0.4.9
## [13] tools_4.4.0 utf8_1.2.4 yaml_2.3.8
## [16] corrplot_0.92 knitr_1.46 ggsignif_0.6.4
## [19] S4Arrays_1.4.0 labeling_0.4.3 reticulate_1.36.1
## [22] DelayedArray_0.30.0 plyr_1.8.9 RColorBrewer_1.1-3
## [25] abind_1.4-5 HDF5Array_1.32.0 Rtsne_0.17
## [28] withr_3.0.0 purrr_1.0.2 BiocGenerics_0.50.0
## [31] grid_4.4.0 stats4_4.4.0 fansi_1.0.6
## [34] ggpubr_0.6.0 colorspace_2.1-0 Rhdf5lib_1.26.0
## [37] scales_1.3.0 tinytex_0.50 cli_3.6.2
## [40] rmarkdown_2.26 crayon_1.5.2 generics_0.1.3
## [43] reshape2_1.4.4 cachem_1.0.8 rhdf5_2.48.0
## [46] stringr_1.5.1 splines_4.4.0 zlibbioc_1.50.0
## [49] parallel_4.4.0 BiocManager_1.30.22 XVector_0.44.0
## [52] matrixStats_1.3.0 basilisk_1.16.0 vctrs_0.6.5
## [55] Matrix_1.7-0 carData_3.0-5 jsonlite_1.8.8
## [58] dir.expiry_1.12.0 car_3.1-2 bookdown_0.39
## [61] IRanges_2.38.0 S4Vectors_0.42.0 ggrepel_0.9.5
## [64] rstatix_0.7.2 irlba_2.3.5.1 magick_2.8.3
## [67] tidyr_1.3.1 jquerylib_0.1.4 glue_1.7.0
## [70] codetools_0.2-20 ggstats_0.6.0 cowplot_1.1.3
## [73] uwot_0.2.2 RcppAnnoy_0.0.22 stringi_1.8.3
## [76] gtable_0.3.5 munsell_0.5.1 tibble_3.2.1
## [79] pillar_1.9.0 basilisk.utils_1.16.0 htmltools_0.5.8.1
## [82] rhdf5filters_1.16.0 R6_2.5.1 evaluate_0.23
## [85] lattice_0.22-6 highr_0.10 backports_1.4.1
## [88] png_0.1-8 pheatmap_1.0.12 broom_1.0.5
## [91] bslib_0.7.0 Rcpp_1.0.12 nlme_3.1-164
## [94] SparseArray_1.4.0 mgcv_1.9-1 xfun_0.43
## [97] MatrixGenerics_1.16.0 forcats_1.0.0 pkgconfig_2.0.3