This notebook describes how to obtain the GLINT
software suite for DNAm analysis, and how to run GLINT
with the EPISTRUCTURE
method for inferring population genetic ancestry/descent from HM450K DNAm array probes. These command line tools are called using a conda virtual environment managed from an R session with the basilisk
Bioconductor package. Code in this notebook should work for Mac and Linux-like environments. Consult Rahmani et al. (2017) for more details about the EPISTRUCTURE
method.
GLINT
softwareFirst, obtain the latest version of GLINT
online by downloading the source from GitHub.
Also ensure the basilisk
Bioconductor/R package is installed. We’ll use this to conveniently manage conda virtual environments to run GLINT
from an R session.
BiocManager::install("basilisk")
library(basilisk)
Next, set up a virtual environment using conda. This step is crucial for reproducible research, as it enables control over which software versions to use in a session. This enables running software that is older or no longer maintained, a fairly common task in computational sciences.
Using basilisk
, set up a Python 2 environment with seven dependencies (numpy
, pandas
, scipy
, scikit-learn
, matplotlib
, statsmodels
, and cvxopt
) for which we specify the version using the form “packagename==version” (e.g. “numpy==1.15”).
env.name <- "glint_env" # name the new virtual environment
pkg.name <- "recountmethylation" # set env properties
pkgv <- c("python==2.7", # python version (v2.7)
"numpy==1.15", # numpy version (v1.15)
"pandas==0.22", # pandas version (v0.22)
"scipy==1.2", # scipy version (v1.2)
"scikit-learn==0.19", # scikit-learn (v0.19)
"matplotlib==2.2", # matplotlib (v2.2)
"statsmodels==0.9", # statsmodels (v0.9)
"cvxopt==1.2") # cvxopt (v1.2)
glint_env <- BasiliskEnvironment(envname = env.name, pkgname = pkg.name,
packages = pkgv)
proc <- basiliskStart(glint_env) # define run process
on.exit(basiliskStop(proc)) # define exit process
This makes a BasiliskEnvironment
object, glint_env
, with a starting process called proc
and a session end process specified with on.exit()
.
This section shows how to run GLINT
on an example dataset, with the EPISTRUCTURE
method enabled. It includes info about the required data formats and shows how to adjust for covariates.
To run GLINT
on DNAm array stored as a SummarizedExperiment
object, first access the test HM450K MethylSet
from the minfiData
package.
library(minfiData)
ms <- get(data("MsetEx")) # load example MethylSet
Now load the explanatory CpG probe vector for EPISTRUCTURE
. This small subset of 4,913 HM450K array probes was found by Rahmani et al. (2017) to be strongly correlated with SNPs informing population ancestry and genetic structure. Access them from recountmethylation
as follows.
dpath <- system.file("extdata", "glint_files",
package = "recountmethylation") # get the dir path
cgv.fpath <- file.path(dpath, "glint_epistructure_explanatory-cpgs.rda")
glint.cgv <- get(load(cgv.fpath)) # load explanatory probes
Subset ms
, a MethylSet
, to include just the 4,913 explanatory probes. This will save considerable disk space and memory for processing very large datasets.
mf <- ms[rownames(ms) %in% glint.cgv,] # filter ms on explanatory probes
dim(mf) # mf dimensions: [1] 4913 6
Next, identify desired model covariates from sample metadata, then convert to numeric/float format (required by GLINT
). Specify the variables “age” and “sex” corresponding to columns in the file covariates_minfidata.txt
.
# get covar -- all vars should be numeric/float
covar <- as.data.frame(colData(mf)[,c("age", "sex")]) # get sample metadata
covar[,"sex"] <- ifelse(covar[,"sex"] == "M", 1, 0) # relabel sex for glint
# write covariates matrix
covar.fpath <- file.path("covariates_minfidata.txt") # specify covariate table path
# write table colnames, values
write.table(covar, file = covar.fpath, sep = "\t", row.names = T,
col.names = T, append = F, quote = F) # write covariates table
Now calculate the DNAm fractoins or “Beta-values”. Impute any missing values with row medians, and write the final file to bval_minfidata.txt
.
bval.fpath <- file.path("bval_minfidata.txt") # specify dnam fractions table name
mbval <- t(apply(as.matrix(getBeta(mf)),1,function(ri){
ri[is.na(ri)] <- median(ri,na.rm=T) # impute na's with row medians
return(round(ri, 4))
})); rownames(mbval) <- rownames(mf) # assign probe ids to row names
write.table(mbval, file = bval.fpath, sep = sepsym,
row.names = T, col.names = T, append = F,
quote = F) # write dnam fractions table
Next, set the system commands to make command line calls from R. Define these manually as strings to be passed to the system()
function, specifying the paths to the new minfiData
example files.
glint.dpath <- "glint-1.0.4" # path to the main glint app dir
glint.pypath <- file.path(glint.dpath, "glint.py") # path to the glint .py script
data.fpath <- file.path("bval_minfidata.txt") # path to the DNAm data table
covar.fpath <- file.path("covariates_minfidata.txt") # path to the metadata table
out.fstr <- file.path("glint_results_minfidata") # base string for ouput results files
covarv <- c("age", "sex") # vector of valid covariates
covar.str <- paste0(covarv, collapse = " ") # format the covariates vector
cmd.str <- paste0(c("python", glint.pypath,
"--datafile", data.fpath,
"--covarfile", covar.fpath,
"--covar", covar.str,
"--epi", "--out", out.fstr),
collapse = " ") # get the final command line call
The commands stored as cmd.str
include the path to the latest GLINT
version, glint.path
, the paths to the datafile.txt
and covariates.txt
tutorial files, the variable names age
and gender
which are our covariates of interest and correspond to column names in covariates.txt
. We also used the --epi
flag to ensure the EPISTRUCTURE
method is run.
Now run GLINT
with basiliskRun()
. This should relay system outputs back to our console, which are included as comments in the below code chunk.
basiliskRun(proc, function(cmd.str){system(cmd.str)}, cmd.str = cmd.str) # run glint
# this returns:
# INFO >>> python glint-1.0.4/glint.py --datafile bval_minfidata.txt --covarfile covariates_minfidata.txt --covar age sex --epi --out glint_results_minfidata
# INFO Starting GLINT...
# INFO Validating arguments...
# INFO Loading file bval_minfidata.txt...
# INFO Checking for missing values in the data file...
# INFO Validating covariates file...
# INFO Loading file covariates_minfidata.txt...
# INFO New covariates were found: age, sex.
# INFO Running EPISTRUCTURE...
# INFO Removing non-informative sites...
# INFO Including sites...
# INFO Include sites: 4913 CpGs from the reference list of 4913 CpGs will be included...
# WARNING Found no sites to exclude.
# INFO Using covariates age, sex.
# INFO Regressing out covariates...
# INFO Running PCA...
# INFO The first 1 PCs were saved to glint_results_minfidata.epistructure.pcs.txt.
# INFO Added covariates epi1.
# Validating all dependencies are installed...
# All dependencies are installed
# [1] 0
Since we declared --out glint_results_minfidata
, results files are saved with the beginning string “glint_results_minfidata” appended. Logs were saved to the file with the *.glint.log
extension, while data were saved to the file with the *.epistructure.pcs.txt
extension.
Now inspect the output results data file glint_results_minfidata.epistructure.pcs.txt
.
out.fpath <- paste0(out.fpath, ".epistructure.pcs.txt")
res2 <- read.table(out.fpath, sep = "\t")
colnames(res2) <- c("sample", "epistructure.pc")
dim(res2)
## [1] 6 2
res2
The first results column reflects sample IDs from the columns in bval_minfidata.txt
. Remaining columns show the EPISTRUCTURE
population components. While just one population component calculated in this example, experiment datasets may generate outputs with more than one population component and thus several component columns.
For more details about GLINT
, see the software documentation and GitHub repo. Additional tutorial files are also available.
Consult Rahmani et al. (2017) for more details about the EPISTRUCTURE
method, including the discovery of explanatory CpG probes associated with population structure SNPs.
For more details about setting up virtual environments from R, consult the basilisk
package documentation.
utils::sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] HDF5Array_1.22.1 DelayedArray_0.20.0
## [3] Matrix_1.4-0 limma_3.50.1
## [5] gridExtra_2.3 ggplot2_3.3.5
## [7] knitr_1.37 recountmethylation_1.4.5
## [9] minfi_1.40.0 bumphunter_1.36.0
## [11] locfit_1.5-9.5 iterators_1.0.14
## [13] foreach_1.5.2 Biostrings_2.62.0
## [15] XVector_0.34.0 SummarizedExperiment_1.24.0
## [17] Biobase_2.54.0 MatrixGenerics_1.6.0
## [19] matrixStats_0.61.0 GenomicRanges_1.46.1
## [21] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [23] S4Vectors_0.32.3 BiocGenerics_0.40.0
## [25] rhdf5_2.38.1 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] BiocFileCache_2.2.1 plyr_1.8.6
## [3] splines_4.1.2 BiocParallel_1.28.3
## [5] digest_0.6.29 htmltools_0.5.2
## [7] magick_2.7.3 fansi_1.0.2
## [9] magrittr_2.0.2 memoise_2.0.1
## [11] tzdb_0.2.0 readr_2.1.2
## [13] annotate_1.72.0 askpass_1.1
## [15] siggenes_1.68.0 prettyunits_1.1.1
## [17] colorspace_2.0-3 blob_1.2.2
## [19] rappdirs_0.3.3 xfun_0.30
## [21] dplyr_1.0.8 crayon_1.5.0
## [23] RCurl_1.98-1.6 jsonlite_1.8.0
## [25] genefilter_1.76.0 GEOquery_2.62.2
## [27] survival_3.3-1 glue_1.6.2
## [29] gtable_0.3.0 zlibbioc_1.40.0
## [31] Rhdf5lib_1.16.0 scales_1.1.1
## [33] DBI_1.1.2 rngtools_1.5.2
## [35] Rcpp_1.0.8.2 xtable_1.8-4
## [37] progress_1.2.2 bit_4.0.4
## [39] mclust_5.4.9 preprocessCore_1.56.0
## [41] httr_1.4.2 RColorBrewer_1.1-2
## [43] ellipsis_0.3.2 farver_2.1.0
## [45] pkgconfig_2.0.3 reshape_0.8.8
## [47] XML_3.99-0.9 sass_0.4.0
## [49] dbplyr_2.1.1 utf8_1.2.2
## [51] labeling_0.4.2 tidyselect_1.1.2
## [53] rlang_1.0.2 AnnotationDbi_1.56.2
## [55] munsell_0.5.0 tools_4.1.2
## [57] cachem_1.0.6 cli_3.2.0
## [59] generics_0.1.2 RSQLite_2.2.10
## [61] evaluate_0.15 stringr_1.4.0
## [63] fastmap_1.1.0 yaml_2.3.5
## [65] bit64_4.0.5 beanplot_1.2
## [67] scrime_1.3.5 purrr_0.3.4
## [69] KEGGREST_1.34.0 nlme_3.1-155
## [71] doRNG_1.8.2 sparseMatrixStats_1.6.0
## [73] nor1mix_1.3-0 xml2_1.3.3
## [75] biomaRt_2.50.3 compiler_4.1.2
## [77] filelock_1.0.2 curl_4.3.2
## [79] png_0.1-7 tibble_3.1.6
## [81] bslib_0.3.1 stringi_1.7.6
## [83] highr_0.9 GenomicFeatures_1.46.5
## [85] lattice_0.20-45 multtest_2.50.0
## [87] vctrs_0.3.8 pillar_1.7.0
## [89] lifecycle_1.0.1 rhdf5filters_1.6.0
## [91] BiocManager_1.30.16 jquerylib_0.1.4
## [93] data.table_1.14.2 bitops_1.0-7
## [95] rtracklayer_1.54.0 R6_2.5.1
## [97] BiocIO_1.4.0 bookdown_0.24
## [99] codetools_0.2-18 MASS_7.3-55
## [101] assertthat_0.2.1 openssl_2.0.0
## [103] rjson_0.2.21 withr_2.5.0
## [105] GenomicAlignments_1.30.0 Rsamtools_2.10.0
## [107] GenomeInfoDbData_1.2.7 mgcv_1.8-39
## [109] hms_1.1.1 quadprog_1.5-8
## [111] grid_4.1.2 tidyr_1.2.0
## [113] base64_2.0 rmarkdown_2.13
## [115] DelayedMatrixStats_1.16.0 illuminaio_0.36.0
## [117] restfulr_0.0.13
Rahmani, Elior, Liat Shenhav, Regev Schweiger, Paul Yousefi, Karen Huen, Brenda Eskenazi, Celeste Eng, et al. 2017. “Genome-Wide Methylation Data Mirror Ancestry Information.” Epigenetics & Chromatin 10 (1): 1. https://doi.org/10.1186/s13072-016-0108-y.