The GEM package provides a highly efficient R tool suite for performing epigenome wide association studies (EWAS). GEM provides three major functions named GEM_Emodel
, GEM_Gmodel
and GEM_GxEmodel
to study the interplay of Gene, Environment and Methylation (GEM). Within GEM, the existing “Matrix eQTL” package is utilized and extended to study methylation quantitative trait loci (methQTL) and the interaction of genotype and environment (GxE) to determine DNA methylation variation, using matrix based iterative correlation and memory-efficient data analysis. GEM can facilitate reliable genome-wide methQTL and GxE analysis on a standard laptop computer within minutes.
The input data to this package are normal text files presenting methylation profiles, genotype variants and environmental factors including covariates. Each row presents one CpG probe or SNP position or an environment measure, while each column represents one sample.
If you are using Rpackage GEM in a publication, please cite [1]. Rpackage GEM adopted the matrix operation method, which is described in [2]. Some of the sample data were from [3].
GEM package provides a user friendly GUI for runing GEM package easily and quickly. Launch the GUI with following codes:
GUI for selection models:
Run the model analysis:
User can find the demo data with codes below:
## [1] "cov.txt" "env.txt" "gxe.txt" "methylation.txt"
## [5] "snp.txt"
The format of input files for GEM are explained as below:
2.1 cov.txt - Artificial covariate data for GEM sample code.
Artificial data set with 1 covariate, for example, gender (encoded as 1 for male and 2 for female), across 237 samples. Columns of the file must match to those of the methylation and genotype data sets. In practical use, covariate data can contain multiple covariates, with each row representing one covariate and columns must match those in methylation and genotype data sets. Data in this file is the “covt” in GEM_Emodel
lm (M ~ E + covt)
and GEM_Gmodel
lm(M ~ G + covrt)
.
Format:
## S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
## Gender 2 2 2 1 2 1 1 2 1 1
2.2 env.txt - Artificial environment factor for GEM sample code.
Artificial data set with 1 environmental factor. Environmental factor can be one of the phenotypes, or maternal conditions or birth outcomes that is studied the association with methylation or genotype variants, for example, gestational age (GA) from 28 to 41 weeks across 237 samples. Columns of the file must match to those of the methylation and genotype data sets. Data in this file is the “E” in GEM_Emodel
lm (M ~ E + covt)
.
Format:
## S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
## env 28 29 28 34 25 31 33 30 28 28
2.3 gxe.txt - Artificial covariate and environment data for GEM sample code.
Artificial data set with 1 covariate and 1 environmental factor across 237 samples. Columns of the file must match to those of the methylation and genotype data sets. In practical use, it can contain n covariates and 1 environmental factor, just need to put the environmental factor as the last covariates at the last row. Data in this file combines both environment (E) and covariates (covt) in GEM_GxEmodel
lm (M ~ G x E + covt)
.
Format:
## S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
## Gender 2 2 2 1 2 1 1 2 1 1
## env 28 29 28 34 25 31 33 30 28 28
2.4 methylation.txt - A subset of methylation data for GEM sample code.
A subset of DNA methylation profiles ranged in [0,1] for 100 CpGs across 237 real clinical samples. Each row represents one CpG’s profile across all 237 samples. Columns of the file must match to those of the covariate and genotype data sets. Data in this file is “M” used in GEM_Emodel
, GEM_Gmodel
, and GEM_GxEmodel
.
Format:
## ID S1 S2 S3 S4 S5 S6
## 1 CpG1 0.257156 0.276115 0.226727 0.209648 0.424285 0.307873
## 2 CpG2 0.474323 0.262313 0.374242 0.401164 0.652304 0.301403
## 3 CpG3 0.635235 0.769377 0.657936 0.639285 0.669690 0.710200
## 4 CpG4 0.454893 0.439979 0.251926 0.292658 0.365113 0.217411
## 5 CpG5 0.878137 0.843197 0.679204 0.890284 0.686224 0.764305
2.5 snp.txt - A subset of genotype data for GEM sample code.
A subset with genotype data encoded as 1,2,3 for major allele homozygote (AA), heterozygote (AB) and minor allele homozygote (BB) for 100 SNPs across 237 real clinical samples. Each row represents one SNP profile across 237 samples. Columns of the file must match to those of the covariate and methylation data sets. Data in this file is “G” used in GEM_Gmodel
and GEM_GxEmodel
.
Format:
## ID S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
## 1 SNP1 3 3 2 3 1 3 3 3 2 3 2 2 2 2
## 2 SNP2 1 1 2 1 2 1 2 2 1 1 2 2 1 1
## 3 SNP3 1 1 2 3 3 3 2 3 3 2 3 3 2 3
## 4 SNP4 2 3 3 3 3 3 3 3 3 3 2 3 3 3
## 5 SNP5 2 3 2 1 3 3 3 3 3 2 2 2 3 3
3.1 GEM_Emodel:
env_file_name = paste(DATADIR, "env.txt", sep = .Platform$file.sep)
covariate_file_name = paste(DATADIR, "cov.txt", sep = .Platform$file.sep)
methylation_file_name = paste(DATADIR, "methylation.txt", sep = .Platform$file.sep)
Emodel_pv = 1
Emodel_result_file_name = "Result_Emodel.txt"
Emodel_qqplot_file_name = "QQplot_Emodel.jpg"
GEM_Emodel(env_file_name, covariate_file_name, methylation_file_name, Emodel_pv, Emodel_result_file_name, Emodel_qqplot_file_name, savePlot=FALSE)
## 100.00% done, 100 CpGs
## Analysis done in: 0.03 seconds
Results:
## cpg beta stats pvalue FDR
## 1 CpG99 0.002820575 2.779243 0.005891264 0.5891264
## 2 CpG32 0.003551990 2.008946 0.045691601 0.9345452
## 3 CpG21 0.001896850 1.886230 0.060502399 0.9345452
## 4 CpG66 0.001602048 1.866568 0.063212780 0.9345452
## 5 CpG75 -0.001969172 -1.696740 0.091075634 0.9345452
## 6 CpG34 0.003809056 1.688606 0.092627076 0.9345452
3.2 GEM_Gmodel:
snp_file_name = paste(DATADIR, "snp.txt", sep = .Platform$file.sep)
covariate_file_name = paste(DATADIR, "cov.txt", sep = .Platform$file.sep)
methylation_file_name = paste(DATADIR, "methylation.txt", sep = .Platform$file.sep)
Gmodel_pv = 1e-04
Gmodel_result_file_name = "Result_Gmodel.txt"
GEM_Gmodel(snp_file_name, covariate_file_name, methylation_file_name, Gmodel_pv, Gmodel_result_file_name)
## 100.00% done, 144 methQTL
## Analysis done in: 0.088 seconds
Results:
## cpg snp beta stats pvalue FDR
## 1 CpG32 SNP962 0.17808859 42.28204 1.482360e-111 1.482360e-106
## 2 CpG94 SNP700 -0.21534752 -18.43573 1.761554e-47 8.807769e-43
## 3 CpG14 SNP578 -0.15171656 -16.70323 9.169281e-42 3.056427e-37
## 4 CpG81 SNP690 0.10567235 13.47239 5.237893e-31 1.309473e-26
## 5 CpG75 SNP589 0.07781375 13.07099 1.112935e-29 2.225870e-25
## 6 CpG94 SNP703 0.13979006 12.55871 5.390763e-28 8.984606e-24
3.3 GEM_GxEmodel:
snp_file_name = paste(DATADIR, "snp.txt", sep = .Platform$file.sep)
covariate_file_name = paste(DATADIR, "gxe.txt", sep = .Platform$file.sep)
methylation_file_name = paste(DATADIR, "methylation.txt", sep = .Platform$file.sep)
GxEmodel_pv = 1
GxEmodel_result_file_name = "Result_GxEmodel.txt"
GEM_GxEmodel(snp_file_name, covariate_file_name, methylation_file_name, GxEmodel_pv, GxEmodel_result_file_name, topKplot = 1, savePlot=FALSE)
## 100.00% done, 100,000 cpg-snp pairs
## Analysis done in: 0.128 seconds
## `geom_smooth()` using formula 'y ~ x'
Results:
## cpg snp beta stats pvalue FDR
## 1 CpG25 SNP991 0.007363396 4.424023 1.490302e-05 0.8948726
## 2 CpG70 SNP232 -0.007460388 -4.341772 2.112071e-05 0.8948726
## 3 CpG83 SNP77 -0.006540673 -4.130645 5.051351e-05 0.8948726
## 4 CpG66 SNP44 0.007428030 4.043966 7.155883e-05 0.8948726
## 5 CpG59 SNP592 0.025209578 4.042911 7.186018e-05 0.8948726
## 6 CpG30 SNP847 -0.005175795 -4.031095 7.532016e-05 0.8948726
[1] Pan H, Holbrook JD, Karnani N, Kwoh CK (2016). “Gene, Environment and Methylation (GEM): A tool suite to efficiently navigate large scale epigenome wide association studies and integrate genotype and interaction between genotype and environment.” BMC Bioinformatics (submitted).
[2] Shabalin AA. (2012). “Matrix eQTL: ultra fast eQTL analysis via large matrix operations.” Bioinformatics 28(10): 1353-1358.
[3] Teh AL, Pan H, Chen L, Ong ML, Dogra S, Wong J, MacIsaac JL, Mah SM, McEwen LM, Saw SM et al(2014): “The effect of genotype and in utero environment on interindividual variation in neonate DNA methylomes”. Genome research, 24(7):1064-1074.