mfa
is an R package for fitting a Bayesian mixture of factor analysers to infer developmental trajectories with bifurcations from single-cell gene expression data. It is able to jointly infer pseudotimes, branching, and genes differentially regulated across branches using a generative, Bayesian hierarchical model. Inference is performed using fast Gibbs sampling.
mfa
can be installed in one of two ways:
source("https://bioconductor.org/biocLite.R")
biocLite("mfa")
library(mfa)
This requires the devtools
package to be installed first
install.packages("devtools") # If not already installed
devtools::install_github("kieranrcampbell/mfa")
library(mfa)
We first create some synthetic data for 100 cells and 40 genes calling the mfa
function create_synthetic
. This returns a list with gene expression, pseudotime, branch allocation, and various parameter estimates:
synth <- create_synthetic(C = 100, G = 40)
print(str(synth))
## List of 7
## $ X : num [1:100, 1:40] 6.028 6.369 0 1.874 0.603 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:100] "cell1" "cell2" "cell3" "cell4" ...
## .. ..$ : chr [1:40] "feature1" "feature2" "feature3" "feature4" ...
## $ branch : int [1:100] 0 1 0 0 0 0 0 1 0 1 ...
## $ pst : num [1:100] 0.177 0.196 0.935 0.961 0.408 ...
## $ k : num [1:40, 1:2] -9.12 6.95 7.42 6.4 -9.16 ...
## $ phi : num [1:40, 1:2] 8.19 9.63 7.05 6.13 8.05 ...
## $ delta : num [1:40, 1:2] 0.119 0.18 0.127 0.422 0.327 ...
## $ p_transient: num 0
## NULL
We can then PCA and put into a tidy format:
df_synth <- as_data_frame(prcomp(synth$X)$x[,1:2]) %>%
mutate(pseudotime = synth$pst,
branch = factor(synth$branch))
and have a look at a PCA representation, coloured by both pseudotime and branch allocation:
ggplot(df_synth, aes(x = PC1, y = PC2, color = pseudotime)) + geom_point()
ggplot(df_synth, aes(x = PC1, y = PC2, color = branch)) + geom_point()
mfa
The input to mfa
is either an ExpressionSet
(e.g. from using the package Scater) or a cell-by-gene expression matrix. If an ExpressionSet
is provided then the values in the exprs
slot are used for gene expression.
We invoke mfa
with a call to the mfa(...)
function. Depending on the size of the dataset and number of MCMC iterations used, this may take some time:
m <- mfa(synth$X)
print(m)
## MFA fit with
## 100 cells and 40 genes
## ( 2000 iterations )
Particular care must be paid to the initialisation of the pseudotimes: by default they are initialised to the first principal component, though if the researcher suspects (based on plotting marker genes) that the trajectory corresponds to a different PC, this can be set using the pc_initialise
argument.
As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. Cowles and Carlin (1996) for a full discussion).
For a quick summary of these, mfa
provides two functions: plot_mfa_trace
and plot_mfa_autocorr
for quick plotting of the trace and autocorrelation of the posterior log-likelihood:
plot_mfa_trace(m)
plot_mfa_autocorr(m)
We can extract posterior mean estimates along with credible intervals using the summary
function:
ms <- summary(m)
print(head(ms))
## # A tibble: 6 x 5
## pseudotime branch branch_certainty pseudotime_lower pseudotime_upper
## <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 -0.54376885 2 1 -0.8606464 -0.3808977
## 2 -0.62685025 2 1 -0.8779826 -0.4015615
## 3 1.03879308 1 1 0.8334416 1.3256056
## 4 1.03366282 1 1 0.8403877 1.3320335
## 5 -0.08539131 1 1 -0.3965906 0.1127942
## 6 0.85058607 1 1 0.6117504 1.0773458
This has six entries:
pseudotime
The MAP pseudotime estimatebranch
The MAP branch estimatebranch_certainty
The proportion of MCMC traces (after burn-in) for which the cell was assigned to the MAP branchpseudotime_lower
and pseudotime_upper
: the lower and upper 95% highest-probability-density posterior credible intervalsWe can compare the inferred pseudotimes to the true values:
qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) +
xlab('True pseudotime') + ylab('Inferred pseudotime') +
scale_color_discrete(name = 'True\nbranch')
And we can equivalently plot the PCA representation coloured by MAP branch:
mutate(df_synth, inferred_branch = ms[['branch']]) %>%
ggplot(aes(x = PC1, y = PC2, color = inferred_branch)) +
geom_point() +
scale_color_discrete(name = 'Inferred\nbranch')
A unique part of this model is that through an ARD-like prior structure on the loading matrices we can automatically infer which genes are involved in the bifurcation process. For a quick-and-dirty look we can use the plot_chi
function, where larger values of inverse-chi imply the gene is associated with the bifurcation:
plot_chi(m)
To calculate the MAP values for chi we can call the calculate_chi
function, which returns a data_frame
with the feature names and values:
posterior_chi_df <- calculate_chi(m)
head(posterior_chi_df)
## # A tibble: 6 x 2
## feature chi_map
## <chr> <dbl>
## 1 feature1 0.2059126
## 2 feature2 0.2916703
## 3 feature3 0.4447259
## 4 feature4 0.7840491
## 5 feature5 0.7549322
## 6 feature6 0.9805756
mfa
classA call to mfa(...)
returns an mfa
object that contains all the information about the dataset and the MCMC inference performed. Note that it does not contain a copy of the original data. We can see the structure by calling str
on an mfa
object:
str(m, max.level = 1)
## List of 10
## $ traces :List of 10
## $ iter : num 2000
## $ thin : num 1
## $ burn : num 1000
## $ b : num 2
## $ collapse : logi FALSE
## $ N : int 100
## $ G : int 40
## $ feature_names: chr [1:40] "feature1" "feature2" "feature3" "feature4" ...
## $ cell_names : chr [1:100] "cell1" "cell2" "cell3" "cell4" ...
## - attr(*, "class")= chr "mfa"
This contains the following slots:
traces
- the raw MCMC traces (discussed in following section)iter
- the number of MCMC iterationsthin
- the thinning of the MCMC chainburn
- the number of MCMC iterations thrown away as burn-inb
- the number of branches modelledcollapse
- whether collapsed Gibbs sampling was implementedN
- the number of cellsG
- the number of features (e.g. genes)feature_names
- the names of the features (e.g. genes)cell_names
- the names of the cellsMCMC traces can be accessed through the traces
slot of an mfa
object. This gives a list with an element for each variable, along with the log-likelihood:
print(names(m$traces))
## [1] "tau_trace" "gamma_trace" "pst_trace"
## [4] "theta_trace" "lambda_theta_trace" "chi_trace"
## [7] "eta_trace" "k_trace" "c_trace"
## [10] "lp_trace"
For non-branch-specific variables this is simply a matrix. For example, for the variable \(\tau\) is just an interation-by-gene matrix:
str(m$traces$tau_trace)
## num [1:1000, 1:40] 7.2 8.2 8.8 7.02 9.04 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:40] "tau[1]" "tau[2]" "tau[3]" "tau[4]" ...
We can easily get the posterior mean by calling colMeans
. More fancy posterior density estimation can be perfomed using the MCMCglmm
package, such as posterior.mode(...)
for MAP estimation (though in practice this is often similar to posterior mean). We can estimate posterior intervals using the HPDInterval(...)
function from the coda
package (note that traces must be converted to coda
objects before calling either of these).
Some variables are branch dependent, meaning the traces returned are arrays (or tensors in fashionable speak) that have dimension iteration x gene x branch
. An example is the \(k\) variable:
str(m$traces$k_trace)
## num [1:1000, 1:40, 1:2] -0.285 -0.388 -0.301 -0.442 -0.494 ...
To get posterior means (or modes, or intervals) we then need to use the apply
function to iterate over the branches. To find the posterior means of k
, we then call
pmean_k <- apply(m$traces$k_trace, 3, colMeans)
str(pmean_k)
## num [1:40, 1:2] -0.41 0.706 0.664 1.174 -0.995 ...
This returns a gene-by-branch matrix of posterior estimates.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 dplyr_0.7.4 ggplot2_2.2.1 mfa_1.0.0
## [5] BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.2 purrr_0.2.4 lattice_0.20-35
## [4] colorspace_1.3-2 htmltools_0.3.6 yaml_2.1.14
## [7] MCMCpack_1.4-0 rlang_0.1.2 glue_1.2.0
## [10] BiocGenerics_0.24.0 RColorBrewer_1.1-2 plyr_1.8.4
## [13] bindr_0.1 stringr_1.2.0 MatrixModels_0.4-1
## [16] munsell_0.4.3 gtable_0.2.0 codetools_0.2-15
## [19] coda_0.19-1 evaluate_0.10.1 labeling_0.3
## [22] Biobase_2.38.0 knitr_1.17 GGally_1.3.2
## [25] SparseM_1.77 quantreg_5.34 parallel_3.4.2
## [28] Rcpp_0.12.13 corpcor_1.6.9 scales_0.5.0
## [31] backports_1.1.1 ggmcmc_1.1 MCMCglmm_2.25
## [34] mcmc_0.9-5 tensorA_0.36 digest_0.6.12
## [37] stringi_1.1.5 bookdown_0.5 grid_3.4.2
## [40] rprojroot_1.2 tools_3.4.2 magrittr_1.5
## [43] lazyeval_0.2.1 tibble_1.3.4 ape_5.0
## [46] tidyr_0.7.2 pkgconfig_2.0.1 MASS_7.3-47
## [49] Matrix_1.2-11 assertthat_0.2.0 rmarkdown_1.6
## [52] reshape_0.8.7 cubature_1.3-11 R6_2.2.2
## [55] nlme_3.1-131 compiler_3.4.2
Cowles, Mary Kathryn, and Bradley P Carlin. 1996. “Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review.” Journal of the American Statistical Association 91 (434). Taylor & Francis: 883–904.