deconvR 1.0.1
Recent studies associated the differences of cell-type proportions may be
correlated to certain phenotypes, such as cancer. Therefore, the demand for the
development of computational methods to predict cell type proportions increased.
Hereby, we developed deconvR
, a collection of functions designed for analyzing
deconvolution of the bulk sample(s) using an atlas of reference omic signature
profiles and a user-selected model. We wanted to give users an option to extend
their reference atlas. Users can create new reference atlases using
findSignatures
extend their atlas by adding more cell types. Additionally, we
included BSMeth2Probe
to make mapping whole-genome bisulfite sequencing data
to their probe IDs easier. So users can map WGBS methylation data (as in
methylKit or GRanges object format) to probe IDs, and the results of
this mapping can be used as the bulk samples in the deconvolution. We also
included a comprehensive DNA methylation atlas of 25 different cell types to use
in the main function deconvolute
. deconvolute
allows the user to specify
their desired deconvolution model (non-negative least squares regression,
support vector regression, quadratic programming, or robust linear regression),
and returns a dataframe which contains predicted cell-type proportions of bulk
methylation profiles, as well as partial R-squared values for each sample.
As an another option, users can generate a simulated table of a desired number
of samples, with either user-specified or random origin proportions using
simulateCellMix
. simulateCellMix
returns a second data frame called
proportions
, which contains the actual cell-type proportions of the simulated
sample. It can be used for testing the accuracy of the deconvolution by
comparing these actual proportions to the proportions predicted by
deconvolute
.
deconvolute
returns partial R-squares, to check if deconvolution brings
advantages on top of the basic bimodal profiles. The reference matrix usually
follows a bimodal distribution in the case of methylation, and taking the
average of the rows of methylation matrix might give a pretty similar profile
to the bulk methylation profile you are trying to deconvolute. If the
deconvolution is advantageous, partial R-squared is expect to be high.
The deconvR package can be installed from Bioconductor with:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("deconvR")
The comprehensive human methylome reference atlas created by Moss et al. 1 Moss,
J. et al. (2018). Comprehensive human cell-type methylation atlas reveals
origins of circulating cell-free DNA in health and disease. Nature
communications, 9(1), 1-12. https://doi.org/10.1038/s41467-018-07466-6 can be
used as the reference atlas parameter for several functions in this package.
This atlas was modified to remove duplicate CpG loci before being included in
the package as the dataframe. The dataframe is composed of 25 human cell types
and roughly 6000 CpG loci, identified by their Illumina Probe ID. For each cell
type and CpG locus, a methylation value between 0 and 1 is provided. This value
represents the fraction of methylated bases of the CpG locus. The atlas
therefore provides a unique methylation pattern for each cell type and can be
directly used as reference
in deconvolute
and simulateCellMix
, and atlas
in findSignatures
. Below is an example dataframe to illustrate the atlas
format.
library(deconvR)
data("HumanCellTypeMethAtlas")
head(HumanCellTypeMethAtlas[,1:5])
## IDs Monocytes_EPIC B.cells_EPIC CD4T.cells_EPIC NK.cells_EPIC
## 1 cg08169020 0.8866 0.2615 0.0149 0.0777
## 2 cg25913761 0.8363 0.2210 0.2816 0.4705
## 3 cg26955540 0.7658 0.0222 0.1492 0.4005
## 4 cg25170017 0.8861 0.5116 0.1021 0.4363
## 5 cg12827637 0.5212 0.3614 0.0227 0.2120
## 6 cg19442545 0.2013 0.1137 0.0608 0.0410
The GRanges object IlluminaMethEpicB5ProbeIDs
contains the Illumina probe
IDs of 400000 genomic loci (identified using the “seqnames”, “ranges”, and
“strand” values). This object is based off of the Infinium MethylationEPIC v1.0
B5 Manifest data. Unnecessary columns were removed and rows were truncated to
reduce file size before converting the file to a GRanges object. It can be
used directly as probe_id_locations
in BSmeth2Probe
.
data("IlluminaMethEpicB5ProbeIDs")
head(IlluminaMethEpicB5ProbeIDs)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | ID
## <Rle> <IRanges> <Rle> | <character>
## [1] chr19 5236005-5236006 + | cg07881041
## [2] chr20 63216298-63216299 + | cg18478105
## [3] chr1 6781065-6781066 + | cg23229610
## [4] chr2 197438742-197438743 - | cg03513874
## [5] chrX 24054523-24054524 + | cg09835024
## [6] chr14 93114794-93114795 - | cg05451842
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
As mentioned in the introduction section, users can extend their reference atlas
to incorporate new data. Or may combine different reference atlases to construct
a more comprehensive one. This can be done using the findSignatures
function.
In this example, since we don’t have any additional reference atlas, we will add
simulated data as a new cell type to reference atlas for example purposes.
First, ensure that atlas
(the signature matrix to be extended) and samples
(the new data to be added to the signature matrix) are compliant with the
function requirements. Below illustrates the samples
format.
samples <- simulateCellMix(3,reference = HumanCellTypeMethAtlas)$simulated
head(samples)
## IDs Sample 1 Sample 2 Sample 3
## 1 cg08169020 0.0340 0.02721569 0.24143097
## 2 cg25913761 0.2389 0.23493740 0.34955831
## 3 cg26955540 0.0338 0.07167034 0.25261550
## 4 cg25170017 0.2259 0.16381234 0.28276015
## 5 cg12827637 0.0307 0.06486313 0.17420701
## 6 cg19442545 0.1574 0.05850345 0.07537919
sampleMeta
should include all sample names in samples
, and specify the
origins they should be mapped to when added to atlas
.
sampleMeta <- data.table("Experiment_accession" = colnames(samples)[-1],
"Biosample_term_name" = "new cell type")
head(sampleMeta)
## Experiment_accession Biosample_term_name
## 1: Sample 1 new cell type
## 2: Sample 2 new cell type
## 3: Sample 3 new cell type
Use findSignatures
to extend the matrix.
extended_matrix <- findSignatures(samples = samples,
sampleMeta = sampleMeta,
atlas = HumanCellTypeMethAtlas)
## CELL TYPES IN EXTENDED ATLAS:
## new cell type
## Monocytes_EPIC
## B.cells_EPIC
## CD4T.cells_EPIC
## NK.cells_EPIC
## CD8T.cells_EPIC
## Neutrophils_EPIC
## Erythrocyte_progenitors
## Adipocytes
## Cortical_neurons
## Hepatocytes
## Lung_cells
## Pancreatic_beta_cells
## Pancreatic_acinar_cells
## Pancreatic_duct_cells
## Vascular_endothelial_cells
## Colon_epithelial_cells
## Left_atrium
## Bladder
## Breast
## Head_and_neck_larynx
## Kidney
## Prostate
## Thyroid
## Upper_GI
## Uterus_cervix
head(extended_matrix)
## IDs new_cell_type Monocytes_EPIC B.cells_EPIC CD4T.cells_EPIC
## 1 cg08169020 0.10088222 0.8866 0.2615 0.0149
## 2 cg25913761 0.27446524 0.8363 0.2210 0.2816
## 3 cg26955540 0.11936194 0.7658 0.0222 0.1492
## 4 cg25170017 0.22415750 0.8861 0.5116 0.1021
## 5 cg12827637 0.08992338 0.5212 0.3614 0.0227
## 6 cg19442545 0.09709421 0.2013 0.1137 0.0608
## NK.cells_EPIC CD8T.cells_EPIC Neutrophils_EPIC Erythrocyte_progenitors
## 1 0.0777 0.0164 0.8680 0.9509
## 2 0.4705 0.3961 0.8293 0.2385
## 3 0.4005 0.3474 0.7915 0.1374
## 4 0.4363 0.0875 0.7042 0.9447
## 5 0.2120 0.0225 0.5368 0.4667
## 6 0.0410 0.0668 0.1952 0.1601
## Adipocytes Cortical_neurons Hepatocytes Lung_cells Pancreatic_beta_cells
## 1 0.0336 0.0168 0.0340 0.0416 0.038875
## 2 0.3578 0.3104 0.2389 0.2250 0.132000
## 3 0.1965 0.0978 0.0338 0.0768 0.041725
## 4 0.0842 0.2832 0.2259 0.0544 0.111750
## 5 0.0287 0.1368 0.0307 0.1607 0.065975
## 6 0.0364 0.0222 0.1574 0.0122 0.003825
## Pancreatic_acinar_cells Pancreatic_duct_cells Vascular_endothelial_cells
## 1 0.0209 0.0130 0.0323
## 2 0.2249 0.1996 0.3654
## 3 0.0314 0.0139 0.2382
## 4 0.0309 0.0217 0.0972
## 5 0.0370 0.0230 0.0798
## 6 0.0378 0.0347 0.0470
## Colon_epithelial_cells Left_atrium Bladder Breast Head_and_neck_larynx Kidney
## 1 0.0163 0.0386 0.0462 0.0264 0.0470 0.0269
## 2 0.2037 0.2446 0.2054 0.1922 0.2045 0.1596
## 3 0.0193 0.1134 0.1269 0.1651 0.1523 0.1034
## 4 0.0187 0.0674 0.0769 0.0691 0.0704 0.0604
## 5 0.0193 0.0432 0.0459 0.0228 0.0687 0.0234
## 6 0.0193 0.0287 0.0246 0.0081 0.0098 0.0309
## Prostate Thyroid Upper_GI Uterus_cervix
## 1 0.0353 0.0553 0.0701 0.0344
## 2 0.1557 0.1848 0.1680 0.2026
## 3 0.0686 0.0943 0.1298 0.1075
## 4 0.0369 0.0412 0.0924 0.0697
## 5 0.0508 0.0726 0.0759 0.0196
## 6 0.0055 0.0188 0.0090 0.0166
WGBS methylation data first needs to be mapped to probes using BSmeth2Probe
before being deconvoluted. The methylation data WGBS_data
in BSmeth2Probe
may be either a GRanges object or a methylKit object.
Format of WGBS_data
as GRanges object:
load(system.file("extdata", "WGBS_GRanges.rda",
package = "deconvR"))
head(WGBS_GRanges)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | sample1
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 47176 + | 0.833333
## [2] chr1 47177 - | 0.818182
## [3] chr1 47205 + | 0.681818
## [4] chr1 47206 - | 0.583333
## [5] chr1 47362 + | 0.416667
## [6] chr1 49271 + | 0.733333
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
or as methylKit object:
head(methylKit::methRead(system.file("extdata", "test1.myCpG.txt",
package = "methylKit"),
sample.id="test", assembly="hg18",
treatment=1, context="CpG", mincov = 0))
## chr start end strand coverage numCs numTs
## 1 chr21 9764513 9764513 - 12 0 12
## 2 chr21 9764539 9764539 - 12 3 9
## 3 chr21 9820622 9820622 + 13 0 13
## 4 chr21 9837545 9837545 + 11 0 11
## 5 chr21 9849022 9849022 + 124 90 34
## 6 chr21 9853296 9853296 + 17 10 7
probe_id_locations
contains information needed to map cellular loci to probe IDs
data("IlluminaMethEpicB5ProbeIDs")
head(IlluminaMethEpicB5ProbeIDs)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | ID
## <Rle> <IRanges> <Rle> | <character>
## [1] chr19 5236005-5236006 + | cg07881041
## [2] chr20 63216298-63216299 + | cg18478105
## [3] chr1 6781065-6781066 + | cg23229610
## [4] chr2 197438742-197438743 - | cg03513874
## [5] chrX 24054523-24054524 + | cg09835024
## [6] chr14 93114794-93114795 - | cg05451842
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
Use BSmeth2Probe
to map WGBS data to probe IDs.
mapped_WGBS_data <- BSmeth2Probe(probe_id_locations = IlluminaMethEpicB5ProbeIDs,
WGBS_data = WGBS_GRanges,
multipleMapping = TRUE,
cutoff = 10)
head(mapped_WGBS_data)
## IDs sample1
## 1 cg00305774 1.0000000
## 2 cg00546078 0.8181818
## 3 cg00546307 0.5454545
## 4 cg00546971 0.5000000
## 5 cg00774867 0.8461538
## 6 cg00830435 0.9166667
This mapped data can now be used in deconvolute
. Here we will deconvolute it
using the previously extended signature matrix as the reference atlas.
deconvolution <- deconvolute(reference = HumanCellTypeMethAtlas,
bulk = mapped_WGBS_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9963 0.9963 0.9963 0.9963 0.9963 0.9963
deconvolution$proportions
## Monocytes_EPIC B.cells_EPIC CD4T.cells_EPIC NK.cells_EPIC
## sample1 0 0 0 0
## CD8T.cells_EPIC Neutrophils_EPIC Erythrocyte_progenitors Adipocytes
## sample1 0 0 0 0.5011789
## Cortical_neurons Hepatocytes Lung_cells Pancreatic_beta_cells
## sample1 0.2917357 0 0 0
## Pancreatic_acinar_cells Pancreatic_duct_cells
## sample1 0 0
## Vascular_endothelial_cells Colon_epithelial_cells Left_atrium Bladder
## sample1 0 0.001012524 0 0
## Breast Head_and_neck_larynx Kidney Prostate Thyroid Upper_GI
## sample1 0 0.2060729 0 0 0 0
## Uterus_cervix
## sample1 0
Here, we will use RNA-seq example data from the granulator package. We will
only use small sample of data. Notice that we set an IDs column. In order to run
deconvR
functions, you have to set the Gene names
colname as IDs
.
simulateCellMix
to simulate bulk RNA-seq data
and, it is also possible to use findSignatures
to extend RNA-seq reference
atlas. In this example, we won’t extend the reference atlas or use a simulated
data.library(granulator)
#To load the data from granulator package
load_ABIS()
## [1] "ABIS dataset was loaded successfully."
#Read the bulk RNA-seq data
bulk_RNA <- bulkRNAseq_ABIS[1:50,] %>%
as.data.frame() %>%
mutate(IDs = rownames(bulkRNAseq_ABIS[1:50,])) %>%
select("IDs", everything())
head(bulk_RNA[,1:5])
## IDs CYFZ FY2H FLWA 453W
## S1PR3 S1PR3 4.275496 11.544026 10.4491331 13.192052
## RXFP2 RXFP2 2.038530 3.434462 5.4659900 2.877763
## ADAMTS5 ADAMTS5 0.010506 0.000000 0.1700436 0.000000
## CLEC6A CLEC6A 4.786615 9.357342 6.1288175 8.606988
## FXYD6 FXYD6 19.881627 29.860584 20.3102595 26.095918
## NRG1 NRG1 20.896435 5.769681 21.5369025 9.638740
#Read the reference RNAseq data
reference_RNA <- sigMatrix_ABIS_S0 %>%
as.data.frame() %>%
mutate(IDs = rownames(sigMatrix_ABIS_S0))%>%
select("IDs", everything())
head(reference_RNA[1:5])
## IDs Monocytes.C NK T.CD8.Memory T.CD4.Naive
## S1PR3 S1PR3 45.720735 0.2790023 0.1981103 0.3657506
## RXFP2 RXFP2 17.877398 0.0000000 0.0000000 0.0000000
## ADAMTS5 ADAMTS5 2.550237 0.0000000 0.0000000 0.0000000
## CLEC6A CLEC6A 33.695996 0.0000000 0.0000000 0.0000000
## FXYD6 FXYD6 114.167642 0.4707691 0.1832934 0.2908456
## NRG1 NRG1 142.457347 0.1949568 0.1527024 0.2790682
We will use deconvolute
to deconvolute cell type proportions of the bulk
samples to their origin proportions using a quadric programming model.
deconv_RNA <- deconvR::deconvolute(reference = reference_RNA,
bulk = bulk_RNA,model = "qp")
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2190 0.3287 0.5162 0.5047 0.6498 0.7602
Then we can access the deconvoluted proportions of bulk sample profiles.
head(deconv_RNA$proportions)
## Monocytes.C NK T.CD8.Memory T.CD4.Naive T.CD8.Naive
## CYFZ 0.06348309 7.147599e-16 0.000000e+00 0.000000e+00 0.000000e+00
## FY2H 0.12192004 0.000000e+00 0.000000e+00 2.242393e-15 3.335591e-15
## FLWA 0.14104057 7.055411e-16 1.179500e-15 6.386099e-15 0.000000e+00
## 453W 0.15737710 0.000000e+00 4.598167e-15 1.635336e-14 7.598590e-16
## 684C 0.11535642 0.000000e+00 0.000000e+00 2.139016e-15 1.090887e-15
## CZJE 0.12817739 0.000000e+00 6.059986e-15 2.404947e-15 0.000000e+00
## B.Naive T.CD4.Memory MAIT T.gd.Vd2 Neutrophils.LD
## CYFZ 0.000000e+00 0.000000e+00 6.366268e-15 0.000000e+00 0.2687346
## FY2H 2.734682e-17 0.000000e+00 0.000000e+00 0.000000e+00 0.3144954
## FLWA 1.688036e-01 0.000000e+00 0.000000e+00 5.730223e-16 0.3559542
## 453W 3.024458e-01 4.845950e-17 0.000000e+00 0.000000e+00 0.4653581
## 684C 2.060142e-01 1.722048e-15 0.000000e+00 0.000000e+00 0.1898895
## CZJE 4.075307e-02 0.000000e+00 0.000000e+00 1.984425e-15 0.2649981
## T.gd.non.Vd2 Basophils.LD Monocytes.NC.I B.Memory mDCs
## CYFZ 5.104239e-15 2.465226e-16 0.12302750 0.00000e+00 0.012795335
## FY2H 8.589163e-15 7.842739e-16 0.21114858 0.00000e+00 0.000000000
## FLWA 6.538782e-15 7.249774e-16 0.05502392 0.00000e+00 0.000000000
## 453W 0.000000e+00 0.000000e+00 0.03605163 0.00000e+00 0.000000000
## 684C 0.000000e+00 8.896594e-16 0.12257615 7.61391e-16 0.007300725
## CZJE 0.000000e+00 1.569145e-15 0.07846819 0.00000e+00 0.002757467
## pDCs Plasmablasts
## CYFZ 4.803158e-01 0.05164364
## FY2H 1.625745e-01 0.18986147
## FLWA 2.401507e-17 0.27917772
## 453W 0.000000e+00 0.03876731
## 684C 2.078994e-01 0.15096372
## CZJE 2.361915e-01 0.24865428
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] granulator_1.2.0 dplyr_1.0.7 doParallel_1.0.16 iterators_1.0.13
## [5] foreach_1.5.1 deconvR_1.0.1 data.table_1.14.2 knitr_1.36
## [9] BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] minqa_1.2.4 colorspace_2.0-2
## [3] rjson_0.2.20 ellipsis_0.3.2
## [5] class_7.3-19 mclust_5.4.7
## [7] qvalue_2.26.0 XVector_0.34.0
## [9] GenomicRanges_1.46.0 proxy_0.4-26
## [11] Deriv_4.1.3 fansi_0.5.0
## [13] mvtnorm_1.1-3 lubridate_1.8.0
## [15] codetools_0.2-18 splines_4.1.1
## [17] R.methodsS3_1.8.1 jsonlite_1.7.2
## [19] nloptr_1.2.2.2 Rsamtools_2.10.0
## [21] R.oo_1.24.0 pheatmap_1.0.12
## [23] BiocManager_1.30.16 compiler_4.1.1
## [25] assertthat_0.2.1 Matrix_1.3-4
## [27] fastmap_1.1.0 limma_3.50.0
## [29] htmltools_0.5.2 tools_4.1.1
## [31] coda_0.19-4 gtable_0.3.0
## [33] glue_1.4.2 GenomeInfoDbData_1.2.7
## [35] reshape2_1.4.4 Rcpp_1.0.7
## [37] limSolve_1.5.6 bbmle_1.0.24
## [39] Biobase_2.54.0 jquerylib_0.1.4
## [41] vctrs_0.3.8 Biostrings_2.62.0
## [43] nlme_3.1-153 rtracklayer_1.54.0
## [45] rsq_2.2 xfun_0.27
## [47] stringr_1.4.0 fastseg_1.40.0
## [49] lme4_1.1-27.1 lpSolve_5.6.15
## [51] lifecycle_1.0.1 restfulr_0.0.13
## [53] gtools_3.9.2 XML_3.99-0.8
## [55] zlibbioc_1.40.0 MASS_7.3-54
## [57] scales_1.1.1 MatrixGenerics_1.6.0
## [59] SummarizedExperiment_1.24.0 RColorBrewer_1.1-2
## [61] yaml_2.2.1 ggplot2_3.3.5
## [63] pander_0.6.4 yulab.utils_0.0.4
## [65] emdbook_1.3.12 sass_0.4.0
## [67] bdsmatrix_1.3-4 stringi_1.7.5
## [69] S4Vectors_0.32.0 BiocIO_1.4.0
## [71] e1071_1.7-9 BiocGenerics_0.40.0
## [73] boot_1.3-28 methylKit_1.20.0
## [75] BiocParallel_1.28.0 epiR_2.0.38
## [77] GenomeInfoDb_1.30.0 rlang_0.4.12
## [79] pkgconfig_2.0.3 matrixStats_0.61.0
## [81] bitops_1.0-7 evaluate_0.14
## [83] lattice_0.20-45 purrr_0.3.4
## [85] GenomicAlignments_1.30.0 cowplot_1.1.1
## [87] tidyselect_1.1.1 plyr_1.8.6
## [89] magrittr_2.0.1 bookdown_0.24
## [91] R6_2.5.1 IRanges_2.28.0
## [93] generics_0.1.1 nnls_1.4
## [95] DelayedArray_0.20.0 DBI_1.1.1
## [97] pillar_1.6.4 survival_3.2-13
## [99] RCurl_1.98-1.5 tibble_3.1.5
## [101] dtangle_2.0.9 crayon_1.4.1
## [103] utf8_1.2.2 rmarkdown_2.11
## [105] grid_4.1.1 digest_0.6.28
## [107] tidyr_1.1.4 numDeriv_2016.8-1.1
## [109] gridGraphics_0.5-1 R.utils_2.11.0
## [111] stats4_4.1.1 munsell_0.5.0
## [113] ggplotify_0.1.0 BiasedUrn_1.07
## [115] bslib_0.3.1 quadprog_1.5-8
stopCluster(cl)