Melissa 1.2.0
## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("Melissa")
## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/Melissa", build_vignettes = TRUE)
Measurements of DNA methylation at the single cell level are promising to revolutionise our understanding of epigenetic control of gene expression. Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites (around 5% to 20% coverage), effectively limiting the analysis repertoire to a semi-quantitative level. Melissa (MEthyLation Inference for Single cell Analysis) [1], is a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq).
Melissa addresses the data sparsity issue by leveraging local correlations between neighbouring CpGs and similarity between individual cells (see Fig.1). The starting point is the definition of a set of genomic regions (e.g. genes or enhancers). Within each region, Melissa postulates a latent profile of methylation, a function mapping each CpG within the region to a number in \([0,1]\) which defines the probability of that CpG being methylated. To ensure spatial smoothness of the profile, Melissa uses a generalised linear model of basis function regression along the lines of [2,3] (with modified likelihood to account for single cell data). Local correlations are however often insufficient for regions with extremely sparse coverage, and these are quite common in scBS-seq data. Therefore, we share information across different cells by coupling the local GLM regressions through a shared prior distribution. In order to respect the (generally unknown) population structure that may be present within the cells assayed, we choose a (finite) Dirichlet mixture model prior.
For reading, processing and filtering raw
scBS-seq data consult the vignette “Process and filter scBS-seq data”, which provides a step by step tutorial on obtaining a melissa_data_obj
object, which will be the input to the variational inference machinery for Melissa.
To create a minimal working example, Melissa comes with already generated synthetic data of \(N = 200\) cells and \(M = 100\) genomic regions, consisting of \(K = 4\) cell subpopulations.
suppressPackageStartupMessages(library(Melissa)) # Load package
dt_obj <- melissa_encode_dt # Load synthetic data
The structure of the dt_obj
is a list of three elements, which only the met
element is of direct interest, the remaining ones are mostly for keeping metadata information and information about the data generating / processing procedure.
# Elements of `dt_obj` object
names(dt_obj)
## [1] "met" "anno_region" "opts"
From the \(2^{nd}\) cell we can access the \(50^{th}\) genomic region using the following code, where the 1st column represents the relative CpG location within the region (scaled to [-1, 1] interval) and the 2nd column represents the methylation state: 1 = methylated and 0 = unmethylated.
head(dt_obj$met[[2]][[50]])
## [,1] [,2]
## [1,] -0.998 0
## [2,] -0.985 1
## [3,] -0.982 0
## [4,] -0.757 1
## [5,] -0.697 1
## [6,] -0.668 1
# Number of cells
cat("Number of cells: ", length(dt_obj$met))
## Number of cells: 200
# Number of genomic regions in each cell
cat("Number of genomic regions: ", length(dt_obj$met[[1]]) )
## Number of genomic regions: 100
For each genomic region we infer a methylation profile using a GLM of basis function regression along the lines of [1,2]. Again we use the functionality of BPRMeth to create a basis object, in our case it will be a Radial Basis Function (RBF), however, one can create polynomial and Fourier basis functions which can be created with the create_polynomial_object
and create_fourier_object
functions, respectively.
library(BPRMeth)
# Create RBF basis object with 4 RBFs
basis_obj <- create_rbf_object(M = 4)
The rbf
object contains information such as the centre locations \(\mu_{j}\) and the value of the spatial scale parameter \(\gamma\)
# Show the slots of the 'rbf' object
basis_obj
## $M
## [1] 4
##
## $mus
## [1] -0.6 -0.2 0.2 0.6
##
## $gamma
## [1] 4
##
## $eq_spaced_mus
## [1] TRUE
##
## $whole_region
## [1] TRUE
##
## attr(,"class")
## [1] "rbf"
Now we are ready to perfrom inference using the Melissa model and jointly cluster cells based on their DNA methylation landscape and impute missing CpG methylation states.
To be able to evaluate the imputation performance of Melissa, we need ground truth labels of methylation states. To do so, we can partition the original dataset to training and test set, where a subset of CpGs will be used for training and the remaining sites for testing. For instance in a genomic region with 50 CpGs, 35 will be used for training and the remaining will be used for testing. We should highlight that the partitioning is at the genomic region level and not at the cell level, so we do not have to impute the methylome of a whole cell (which is not practically useful from the first place).
Note that this process is optional and is required only for being able to evaluate how well Melissa performs imputation. The user can skip the partitioning step and directly run Melissa on the whole dataset. For real scBS-seq data the user can use the inferred profiles to impute CpGs with no coverage at all.
set.seed(15)
# Partition to training and test set
dt_obj <- partition_dataset(dt_obj = dt_obj, data_train_prcg = 0.2,
region_train_prcg = 1, cpg_train_prcg = 0.4,
is_synth = TRUE)
This code snippet, will update the met
element of the dt_obj
to retain only the CpGs used as training set and the test set will be now stored in the met_test
element in the same object. Note that during inference, Melissa will only have access to the training data and will ignore totally the test set. For more details on the usage of the additional parameters type ?partition_dataset
.
Whether or not a subset of the data was used as test set, the command for running Melissa remains the same. We need to provide it with the (training) data X
, the number of clusters K
that we expect to find in the cell population, the basis function object and with initial values for the hyperparameters. In this example, we will mostly keep the default values for the (hyper)-parameters.
set.seed(15)
# Run Melissa with K = 4 clusters
melissa_obj <- melissa(X = dt_obj$met, K = 4, basis = basis_obj,
vb_max_iter = 20, vb_init_nstart = 1, is_parallel = FALSE)
##
|
| | 0%
|
|==== | 6%
|
|======= | 11%
|
|=========== | 17%
|
|=============== | 22%
|
|=================== | 28%
|
|====================== | 33%
|
|========================== | 39%
|
|============================== | 44%
|
|================================== | 50%
|
|===================================== | 56%
|
|========================================= | 61%
|
|============================================= | 67%
|
|================================================ | 72%
|
|==================================================== | 78%
|
|======================================================== | 83%
|
|============================================================ | 89%
|
|=============================================================== | 94%
|
|===================================================================| 100%
We can check the mixing proportions for each cluster, to obtain an estimate of the proportion of cells that are assigned to each cell subpopulation.
melissa_obj$pi_k
## cluster1 cluster2 cluster3 cluster4
## 0.1559409 0.4232673 0.2450502 0.1757436
The posterior probabilities of each cell belonging to a certain cluster (responsibilities) are stored in the r_nk
element. Here each column represents a different cluster and each row a different cell.
head(melissa_obj$r_nk)
## cluster1 cluster2 cluster3 cluster4
## [1,] 1.005152e-228 1.000000e+00 2.292296e-169 3.328830e-175
## [2,] 1.923678e-225 1.000000e+00 2.961327e-176 7.823470e-167
## [3,] 4.816203e-255 1.000000e+00 6.470452e-156 2.263771e-174
## [4,] 8.372995e-240 1.000000e+00 3.419202e-135 1.298840e-190
## [5,] 2.813925e-239 1.000000e+00 1.001113e-156 8.391105e-194
## [6,] 4.382205e-183 5.945496e-166 4.085146e-199 1.000000e+00
The posterior mean methylation profile of each genomic region and cell subtype is stored in the W
element. We can access the posterior weights of the 10th genomic region for the 2nd cell subpopulation. Here each element of the vector corresponds to a basis function, except the first element which corresponds to the bias term.
melissa_obj$W[10, , 3]
## [1] -1.786867 0.742057 -1.953177 1.153712 2.835504
We can plot methylation profiles of each cell subtype for specific genomic regions using the plot_melissa_profiles
function.
# Plot profiles from all cell subtypes for genomic region 22
plot_melissa_profiles(melissa_obj = melissa_obj, region = 22,
title = "Methylation profiles for region 22")
# Plot profiles from all cell subtypes for genomic region 77
plot_melissa_profiles(melissa_obj = melissa_obj, region = 77,
title = "Methylation profiles for region 77")
Since these are synthetic generated data, we have the ground truth labels for the assignment of each cell to the corresponding cluster, which is stored in dt_obj$opts$C_true
. Let’s now evaluate the clustering performance both in terms of Adjusted Rand Index (ARI) and clustering assignment error metrics.
# Run clustering performance
melissa_obj <- eval_cluster_performance(melissa_obj, dt_obj$opts$C_true)
# ARI metric
cat("ARI: ", melissa_obj$clustering$ari)
## ARI: 1
# Clustering assignment error metric
cat("Clustering assignment error: ", melissa_obj$clustering$error)
## Clustering assignment error: 0
As we can observe Melissa clustered perfectly the synthetic data, which are also clearly separable from the example methylation profiles shown above.
To evaluate the imputation performance of Melissa, we use the held out test set and compare the true methylation state of each CpG with the predicted methylation value, which is the evaluation of the latent methylation profile at the corresponding location.
imputation_obj <- impute_met_state(obj = melissa_obj, test = dt_obj$met_test)
Now we use different metrics to evaluate the prediction performance, such as AUC, ROC curve and precision-recall curves and store them as elements in melissa_obj
object.
melissa_obj <- eval_imputation_performance(obj = melissa_obj,
imputation_obj = imputation_obj)
# AUC
cat("AUC: ", melissa_obj$imputation$auc)
## AUC: 0.8566162
# F-measure
cat("F-measure: ", melissa_obj$imputation$f_measure)
## F-measure: 0.7415819
This vignette was compiled using:
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Melissa_1.2.0 BPRMeth_1.12.0 GenomicRanges_1.38.0
## [4] GenomeInfoDb_1.22.0 IRanges_2.20.0 S4Vectors_0.24.0
## [7] BiocGenerics_0.32.0 knitr_1.25 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 mvtnorm_1.0-11 lattice_0.20-38
## [4] gtools_3.8.1 png_0.1-7 clues_0.6.1
## [7] assertthat_0.2.1 digest_0.6.22 foreach_1.4.7
## [10] truncnorm_1.0-8 R6_2.4.0 MatrixModels_0.4-1
## [13] evaluate_0.14 coda_0.19-3 ggplot2_3.2.1
## [16] highr_0.8 pillar_1.4.2 gplots_3.0.1.1
## [19] zlibbioc_1.32.0 rlang_0.4.1 lazyeval_0.2.2
## [22] gdata_2.18.0 data.table_1.12.6 SparseM_1.77
## [25] Matrix_1.2-17 rmarkdown_1.16 labeling_0.3
## [28] stringr_1.4.0 RCurl_1.95-4.12 munsell_0.5.0
## [31] compiler_3.6.1 xfun_0.10 pkgconfig_2.0.3
## [34] mcmc_0.9-6 htmltools_0.4.0 tidyselect_0.2.5
## [37] tibble_2.1.3 GenomeInfoDbData_1.2.2 bookdown_0.14
## [40] matrixcalc_1.0-3 codetools_0.2-16 crayon_1.3.4
## [43] dplyr_0.8.3 MASS_7.3-51.4 bitops_1.0-6
## [46] grid_3.6.1 gtable_0.3.0 magrittr_1.5
## [49] scales_1.0.0 KernSmooth_2.23-16 stringi_1.4.3
## [52] XVector_0.26.0 ROCR_1.0-7 cowplot_1.0.0
## [55] RColorBrewer_1.1-2 iterators_1.0.12 tools_3.6.1
## [58] glue_1.3.1 purrr_0.3.3 yaml_2.2.0
## [61] colorspace_1.4-1 BiocManager_1.30.9 caTools_1.17.1.2
## [64] quantreg_5.51 MCMCpack_1.4-4
[1] Kapourani, C. A., & Sanguinetti, G. (2018). Melissa: Bayesian clustering and imputation of single cell methylomes. bioRxiv, 312025, DOI: https://doi.org/10.1101/312025
[2] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412, DOI: https://doi.org/10.1093/bioinformatics/btw432
[3] Kapourani, C. A. & Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129
This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.
This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.