Applying enrichment analysis to methylation array data is difficult due to the presence of a variable number of probes per gene and the fact that a probe could belong to overlapping genes. There are existing over-representation based approaches to this, however they appear to lack sensitivity. To address this, we have developed a simple approach to aggregating differential methylation data to make it suitable for downstream use by mitch. The process begins with the differential probe methylation results from limma. Here, we summarise the limma t-statistics by gene using the arithmetic mean. The resulting gene level differential methylation scores then undergo mitch as usual.
In addition to mitch v1.15.0 of higher, you will need an annotation set for the array you are using. These are conveniently provided as Bioconductor packages for HM450K and EPIC arrays.
HM450k: IlluminaHumanMethylation450kanno.ilmn12.hg19
EPIC: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
One issue with these is that the annotations are quite old, which means that some of the gene names have changed (~12%), so it is a good idea to screen the list of gene names and update obsolete names using the HGNChelper package.
In this cut down example we will be using a sample of 200 Reactome gene sets.
## $`2-LTR circle formation`
## [1] "Reactome Pathway" "BANF1" "HMGA1" "LIG4"
## [5] "PSIP1" "XRCC4" "XRCC5" "XRCC6"
## [9] "gag" "gag-pol" "rev" "vif"
## [13] "vpr" "vpu"
##
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] "Reactome Pathway" "PRPS1" "PRPS1L1" "PRPS2"
##
## $`A tetrasaccharide linker sequence is required for GAG synthesis`
## [1] "Reactome Pathway" "AGRN" "B3GALT6" "B3GAT1"
## [5] "B3GAT2" "B3GAT3" "B4GALT7" "BCAN"
## [9] "BGN" "CSPG4" "CSPG5" "DCN"
## [13] "GPC1" "GPC2" "GPC3" "GPC4"
## [17] "GPC5" "GPC6" "HSPG2" "NCAN"
## [21] "SDC1" "SDC2" "SDC3" "SDC4"
## [25] "VCAN" "XYLT1" "XYLT2"
In order to get mitch working, we need a 2 column table that maps probes to genes. The workflow shown here is for the HM450k array, and an EPIC example is show at the end of the report.
anno <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
head(myann)
## UCSC_RefGene_Name UCSC_RefGene_Group Islands_Name
## cg00050873 TSPY4;FAM197Y2 Body;TSS1500 chrY:9363680-9363943
## cg00212031 TTTY14 TSS200 chrY:21238448-21240005
## cg00213748 chrY:8147877-8148210
## cg00214611 TMSB4Y;TMSB4Y 1stExon;5'UTR chrY:15815488-15815779
## cg00455876 chrY:9385471-9385777
## cg01707559 TBL1Y;TBL1Y;TBL1Y TSS200;TSS200;TSS200 chrY:6778574-6780028
## Relation_to_Island
## cg00050873 N_Shore
## cg00212031 Island
## cg00213748 S_Shore
## cg00214611 Island
## cg00455876 Island
## cg01707559 Island
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt1 <- stack(gp2)
colnames(gt1) <- c("gene","probe")
gt1$probe <- as.character(gt1$probe)
dim(gt1)
## [1] 407090 2
## gene probe
## 1 TSPY4 cg00050873
## 2 FAM197Y2 cg00050873
## 3 TTTY14 cg00212031
## 4 TMSB4Y cg00214611
## 5 TBL1Y cg01707559
## 6 TMSB4Y cg02004872
Update old gene symbols using HGNChelper (13% of 21k names).
#new.hgnc.table <- getCurrentHumanMap()
new.hgnc.table <- readRDS("../inst/extdata/new.hgnc.table.rds")
fix <- checkGeneSymbols(gt1$gene,map=new.hgnc.table)
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): Human gene symbols
## should be all upper-case except for the 'orf' in open reading frames. The case
## of some letters was corrected.
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): x contains
## non-approved gene symbols
## [1] 2793
## gene probe
## 1 TSPY4 cg00050873
## 2 FAM197Y2 cg00050873
## 3 TTTY14 cg00212031
## 4 TMSB4Y cg00214611
## 5 TBL1Y cg01707559
## 6 TMSB4Y cg02004872
Here we will read in a table of differential probe methylation data generated by limma. We will use the t-statistics for downstream analysis.
## logFC AveExpr t P.Value adj.P.Val B
## cg04905210 -0.2926788 -2.812451 -5.380966 2.372336e-07 0.1002013 5.906039
## cg09338148 -0.2938597 -1.155475 -5.114443 8.261969e-07 0.1744820 4.880538
## cg04247967 -0.2070451 3.180593 -4.986000 1.484616e-06 0.2090211 4.399278
## cg06458106 -0.1906572 1.675741 -4.793141 3.511038e-06 0.2413324 3.693145
## cg26425904 -0.2540917 -4.019515 -4.761482 4.034832e-06 0.2413324 3.579161
## cg19590707 -0.2471850 -1.237702 -4.716409 4.912691e-06 0.2413324 3.417844
Now that the profiling data is loaded, we need to import with mitch package, which establishes the probe-gene relationships and aggregates the data to gene level scores. As you can see, the input was a table of 422k probes and the output is 19,380 gene scores. Many probes not annotated to genes are discarded.
## The input is a single dataframe; one contrast only. Converting
## it to a list for you.
## Note: Mean no. genes in input = 422374
## Note: no. genes in output = 19380
## Warning in mitch_import(x, DEtype = "limma", geneTable = gt1): Warning: less than half of the input genes are also in the
## output
## x
## A1BG -0.3505339
## A1BG-AS1 -0.1904379
## A1CF -0.8443613
## A2M -0.6794687
## A2ML1 0.2157592
## A4GALT -0.4001430
## [1] 19380 1
The mitch_calc
function performs an enrichment test. If you imported multiple data tables in the previous step, mitch will conduct a multivariate enrichment test. The results can be prioritised by significance or effect size. My recommendation is to discard results with FDR>0.05 then prioritise by effect size, which for us is the mitch enrichment score called S distance. In this example I also set the minimum gene set size to 5.
## Note: Enrichments with large effect sizes may not be
## statistically significant.
## set setSize pANOVA
## 113 Apoptotic cleavage of cell adhesion proteins 11 6.207579e-04
## 1 2-LTR circle formation 7 4.362176e-02
## 33 Acetylcholine Neurotransmitter Release Cycle 17 1.672291e-03
## 52 Activation of C3 and C5 6 6.479358e-02
## 155 Bicarbonate transporters 10 2.806954e-02
## 68 Activation of SMO 18 4.483409e-03
## 88 Adenylate cyclase inhibitory pathway 14 1.240799e-02
## 87 Adenylate cyclase activating pathway 10 3.530185e-02
## 80 Acyl chain remodeling of DAG and TAG 5 1.416075e-01
## 134 Autodegradation of the E3 ubiquitin ligase COP1 50 4.346153e-06
## s.dist p.adjustANOVA
## 113 -0.5958772 4.947074e-03
## 1 0.4403846 1.541302e-01
## 33 -0.4402727 1.063577e-02
## 52 0.4353429 2.102485e-01
## 155 -0.4010945 1.117618e-01
## 68 -0.3869091 2.545936e-02
## 88 -0.3859415 5.636774e-02
## 87 -0.3843986 1.277451e-01
## 80 -0.3795716 3.482414e-01
## 134 0.3755137 7.678203e-05
For presentation of the results you could consider making a volcano plot and/or making a barplot of S.dist values for selected gene sets that meet the FDR cutoff. You can also use the built in functions to make a set of charts and html report. It is vitally important to inspect the detailed resports to understand which genes are driving the enrichment in your dataset.
## png
## 2
## Dataset saved as " /tmp/RtmpK4y2Il/methreport.rds ".
## processing file: mitch.Rmd
## output file: /tmp/RtmpbpsniZ/Rbuild2d68ef1fe40c77/mitch/vignettes/mitch.knit.md
## /usr/bin/pandoc +RTS -K512m -RTS /tmp/RtmpbpsniZ/Rbuild2d68ef1fe40c77/mitch/vignettes/mitch.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/RtmpK4y2Il/mitch_report.html --lua-filter /home/biocbuild/bbs-3.19-bioc/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/biocbuild/bbs-3.19-bioc/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --section-divs --template /home/biocbuild/bbs-3.19-bioc/R/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/RtmpK4y2Il/rmarkdown-str2d6b79345887cc.html
##
## Output created: /tmp/RtmpK4y2Il/mitch_report.html
## [1] TRUE
Here is the code to create a probe-gene table for EPIC array data. Be sure to update the gene symbols with HGNChelper
before conducting enrichment analysis.
anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt <- stack(gp2)
colnames(gt) <- c("gene","probe")
gt$probe <- as.character(gt$probe)
dim(gt)
## [1] 684970 2
## 'data.frame': 684970 obs. of 2 variables:
## $ gene : chr "YTHDF1" "EIF2S3" "PKN3" "CCDC57" ...
## $ probe: chr "cg18478105" "cg09835024" "cg14361672" "cg01763666" ...
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] kableExtra_1.4.0
## [2] pkgload_1.4.0
## [3] GGally_2.2.1
## [4] ggplot2_3.5.1
## [5] reshape2_1.4.4
## [6] beeswarm_0.4.0
## [7] gplots_3.1.3.1
## [8] gtools_3.9.5
## [9] tibble_3.2.1
## [10] dplyr_1.1.4
## [11] echarts4r_0.4.5
## [12] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
## [13] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [14] minfi_1.50.0
## [15] bumphunter_1.46.0
## [16] locfit_1.5-9.10
## [17] iterators_1.0.14
## [18] foreach_1.5.2
## [19] Biostrings_2.72.1
## [20] XVector_0.44.0
## [21] SummarizedExperiment_1.34.0
## [22] Biobase_2.64.0
## [23] MatrixGenerics_1.16.0
## [24] matrixStats_1.4.1
## [25] GenomicRanges_1.56.1
## [26] GenomeInfoDb_1.40.1
## [27] IRanges_2.38.1
## [28] S4Vectors_0.42.1
## [29] BiocGenerics_0.50.0
## [30] HGNChelper_0.8.14
## [31] mitch_1.16.1
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.1 later_1.3.2
## [3] BiocIO_1.14.0 bitops_1.0-8
## [5] preprocessCore_1.66.0 XML_3.99-0.17
## [7] lifecycle_1.0.4 lattice_0.22-6
## [9] MASS_7.3-61 base64_2.0.1
## [11] scrime_1.3.5 magrittr_2.0.3
## [13] limma_3.60.4 sass_0.4.9
## [15] rmarkdown_2.28 jquerylib_0.1.4
## [17] yaml_2.3.10 httpuv_1.6.15
## [19] doRNG_1.8.6 askpass_1.2.0
## [21] DBI_1.2.3 RColorBrewer_1.1-3
## [23] abind_1.4-5 zlibbioc_1.50.0
## [25] quadprog_1.5-8 purrr_1.0.2
## [27] RCurl_1.98-1.16 GenomeInfoDbData_1.2.12
## [29] genefilter_1.86.0 annotate_1.82.0
## [31] svglite_2.1.3 DelayedMatrixStats_1.26.0
## [33] codetools_0.2-20 DelayedArray_0.30.1
## [35] xml2_1.3.6 tidyselect_1.2.1
## [37] UCSC.utils_1.0.0 beanplot_1.3.1
## [39] illuminaio_0.46.0 GenomicAlignments_1.40.0
## [41] jsonlite_1.8.8 multtest_2.60.0
## [43] survival_3.7-0 systemfonts_1.1.0
## [45] tools_4.4.1 Rcpp_1.0.13
## [47] glue_1.7.0 gridExtra_2.3
## [49] SparseArray_1.4.8 xfun_0.47
## [51] HDF5Array_1.32.1 withr_3.0.1
## [53] fastmap_1.2.0 rhdf5filters_1.16.0
## [55] fansi_1.0.6 openssl_2.2.1
## [57] caTools_1.18.3 digest_0.6.37
## [59] R6_2.5.1 mime_0.12
## [61] colorspace_2.1-1 RSQLite_2.3.7
## [63] utf8_1.2.4 tidyr_1.3.1
## [65] generics_0.1.3 data.table_1.16.0
## [67] rtracklayer_1.64.0 httr_1.4.7
## [69] htmlwidgets_1.6.4 S4Arrays_1.4.1
## [71] ggstats_0.6.0 pkgconfig_2.0.3
## [73] gtable_0.3.5 blob_1.2.4
## [75] siggenes_1.78.0 htmltools_0.5.8.1
## [77] scales_1.3.0 png_0.1-8
## [79] knitr_1.48 rstudioapi_0.16.0
## [81] tzdb_0.4.0 rjson_0.2.22
## [83] nlme_3.1-166 curl_5.2.2
## [85] cachem_1.1.0 rhdf5_2.48.0
## [87] stringr_1.5.1 KernSmooth_2.23-24
## [89] AnnotationDbi_1.66.0 restfulr_0.0.15
## [91] GEOquery_2.72.0 pillar_1.9.0
## [93] grid_4.4.1 reshape_0.8.9
## [95] vctrs_0.6.5 promises_1.3.0
## [97] xtable_1.8-4 evaluate_0.24.0
## [99] readr_2.1.5 GenomicFeatures_1.56.0
## [101] cli_3.6.3 compiler_4.4.1
## [103] Rsamtools_2.20.0 rlang_1.1.4
## [105] crayon_1.5.3 rngtools_1.5.2
## [107] nor1mix_1.3-3 mclust_6.1.1
## [109] plyr_1.8.9 stringi_1.8.4
## [111] viridisLite_0.4.2 BiocParallel_1.38.0
## [113] munsell_0.5.1 Matrix_1.7-0
## [115] hms_1.1.3 sparseMatrixStats_1.16.0
## [117] bit64_4.0.5 Rhdf5lib_1.26.0
## [119] KEGGREST_1.44.1 statmod_1.5.0
## [121] shiny_1.9.1 highr_0.11
## [123] memoise_2.0.1 bslib_0.8.0
## [125] bit_4.0.5 splitstackshape_1.4.8