This package integrates colocalization probabilities from colocalization analysis with transcriptome-wide association study (TWAS) scan summary statistics to implicate genes that may be biologically relevant to a complex trait. Given gene set annotations, this package can estimate gene set enrichment using posterior probabilities from the TWAS-colocalization integration step.
INTACT 1.2.0
To install this package, run the following code chunk (in R 4.2 or later):
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("INTACT")
For a comprehensive description of the probabilistic framework behind INTACT please refer to:
Okamoto, Jeffrey, et al. “Probabilistic integration of transcriptome-wide association studies and colocalization analysis identifies key molecular pathways of complex traits.” The American Journal of Human Genetics 110.1 (2023): 44-57.
Integrative genetic association methods have shown great promise in post-GWAS
(genome-wide association study) analyses, in which one of the most challenging
tasks is identifying putative causal genes and uncovering molecular mechanisms
of complex traits. Prevailing computational approaches include
transcriptome-wide association studies (TWASs) and colocalization analysis.
TWASs aim to assess the correlation between predicted gene expression of a
target gene and a GWAS trait. Common output for TWASs include gene-level
z-statistics. Colocalization analysis attempts to determine whether genetic
variants that are causal for a molecular phenotype (such as gene expression)
overlap with variants that are causal for a GWAS trait. Common output for
colocalization analysis often include gene-level colocalization probabilities,
which provide evidence regarding whether there exists a colocalized variant
for the expression of a target gene and GWAS trait. Recent studies suggest that
TWASs and colocalization analysis are individually imperfect, but their joint
usage can yield robust and powerful inference results. INTACT is a computational
framework to integrate probabilistic evidence from these distinct types of
analyses and implicate putative causal genes. This procedure is flexible and can
work with a wide range of existing integrative analysis approaches. It has the
unique ability to quantify the uncertainty of implicated genes, enabling
rigorous control of false-positive discoveries. INTACT-GSE is an efficient
algorithm for gene set enrichment analysis based on the integrated probabilistic
evidence. This package is intended for performing integrative genetic
association analyses in tandem with other Bioconductor packages such as
biomaRt
or GO.db
, which could be used to obtain gene set annotations for
gene set enrichment analysis.
To illustrate the functionality of the INTACT
package, we include a
simulated data set simdat
. See the methodology reference for an
explanation of the simulation design. The data is organized as a 1197 row by
3 column data frame, where rows correspond to genes, the GLCP column provides
gene-level colocalization probabilities, and the TWAS_z column provides TWAS
scan z-scores.
Additionally, we include a simulated gene set list gene_set_list
,
which contains two gene sets. The first gene set has 503 gene members and is
significantly enriched among the genes included in simdat
, based on the
probabilistic INTACT output. The second gene set has 200 gene members and is
not enriched among the simdat
genes. We include gene_set_list
to
show how to perform gene set enrichment estimation using INTACT-GSE.
The first main functionality of this package is integrating results from a transcriptome-wide association study (TWAS) scan and a colocalization analysis. The TWAS scan results must be in the form of gene-level z scores, and the colocalization analysis results should be in the form of gene-level colocalization probabilities. These are provided as output by most popular TWAS and colocalization methods.
Below, we include an example of how to use INTACT to integrate TWAS scan and
colocalization results for a simulated data set simdat
.
library(INTACT)
##
## Attaching package: 'INTACT'
## The following object is masked from 'package:stats':
##
## step
data(simdat)
rst <- INTACT::intact(GLCP_vec=simdat$GLCP,
z_vec=simdat$TWAS_z)
The intact
function takes a vector of gene-level colocalization
probabilities GLCP_vec
and TWAS scan z-scores z_vec
. It outputs
gene-level posterior probabilities of putative causality. The example included
above uses default settings for the prior function and truncation threshold
\(t\) (prior_fun = linear
and t=0.05
). There are three additional
prior functions implemented in the INTACT
software, including expit
,
step
, and hybrid
. The expit
and hybrid
options have an
additional curvature shrinkage parameter D
, with a default value of 0.1.
The default truncation parameter value for the step prior function is 0.5,
while the default value is 0.05 for all other prior functions. Below are three
additional examples of how to integrate the TWAS z scores and colocalization
probabilies from the simulated data using different prior function, truncation
threshold, and curvature shrinkage settings:
rst1 <- INTACT::intact(GLCP_vec=simdat$GLCP,
prior_fun=INTACT::expit,
z_vec = simdat$TWAS_z,
t = 0.02,D = 0.09)
rst2 <- INTACT::intact(GLCP_vec=simdat$GLCP,
prior_fun=INTACT::step,
z_vec = simdat$TWAS_z,
t = 0.49)
rst3 <- INTACT::intact(GLCP_vec=simdat$GLCP,
prior_fun=INTACT::hybrid,
z_vec = simdat$TWAS_z,
t = 0.49,D = 0.05)
If the user wishes to specify TWAS Bayes factors instead of z-scores, they can
do so through the argument twas_BFs
. The Bayes factors should be a numeric
vector with genes in the same order as the colocalization probabilities vector.
If the user wishes to specify gene-specific TWAS priors, they can do so through
the argument twas_priors
. If no input is supplied, INTACT computes a
scalar prior using the TWAS data (see the methodology reference for more
details).
We provide an additional function fdr_rst
that is useful if the user wishes
to perform Bayesian FDR control on the INTACT output. An example of how to
apply this function at a target control level of 0.05 is shown below.
fdr_example <- fdr_rst(rst1, alpha = 0.05)
head(fdr_example)
## posterior sig
## 1 1 TRUE
## 2 1 TRUE
## 3 1 TRUE
## 4 1 TRUE
## 5 1 TRUE
## 6 1 TRUE
The INTACT
package provides the intactGSE
function to perform gene
set enrichment estimation and inference using integrated TWAS scan z-scores and
colocalization probabilities. This function requires a data frame
gene_data
containing gene names and corresponding colocalization
probabilities and TWAS z-scores for each gene. Column names should be “gene”,
“GLCP”, and “TWAS_z’. If the user wishes to specify TWAS Bayes factors instead
of z-scores, use the column name”TWAS_BFs“. If the user wishes to specify
gene-specific TWAS priors, use the column name”TWAS_priors".
In addition to gene_data
, the user must provide a list of gene sets
gene_sets
. The format of gene_sets
must match the included example
gene_set_list
: it must named list of gene sets for which enrichment is to
be estimated. List items should be character vectors of gene IDs. Gene ID format
should match the gene column in gene_data
.
The user can specify the same prior-related arguments as in the intact
function, including prior_fun
, t
, and D
(only when the
prior function is specified as expit
or hybrid
).
The user can specify the method by which the standard error of the enrichment
estimate is computed. Options include a numerical differentiation of the score
function (default): NDS
; a profile likelihood approach:
profile_likelihood
, and bootstrapping: bootstrap
. For hypothesis
testing, the user can specify a significance threshold, which is 0.05 by
default.
An example of how to estimate gene set enrichment in the gene sets provided in
gene_set_list
(using default settings) is shown below:
data(gene_set_list)
INTACT::intactGSE(gene_data = simdat,gene_sets = gene_set_list)
## Gene_Set Estimate SE z pval CI_Leftlim
## 1 gene_set1 1.01981520 0.1808068 5.64035913 1.696958e-08 0.6654404
## 2 gene_set2 -0.01650172 0.2314519 -0.07129655 9.431617e-01 -0.4701391
## CI_Rightlim CONVERGED
## 1 1.3741900 1
## 2 0.4371357 1
The output of intactGSE
includes one row per gene set and eight columns:
the gene set name, the enrichment parameter \(\alpha_1\) estimate, the enrichment
parameter estimate standard error, the z-score, the p-value, the left and right
CIs, and the convergence flag (if CONVERGED = 1, then the enrichment estimation
algorithm converged. If not, CONVERGED = 0). Some data sets are not informative
for gene set enrichment estimation; in this case, the algorithm will fail to
converge. We emphasize that failure of the algorithm to converge does not
provide information regarding the enrichment (or lack thereof) for a given
gene set.
Finally, we include three additional examples of how to estimate enrichment for the same data sets using non-default prior parameters:
INTACT::intactGSE(gene_data = simdat,prior_fun = INTACT::step,
t = 0.45,gene_sets = gene_set_list)
## Gene_Set Estimate SE z pval CI_Leftlim CI_Rightlim
## 1 gene_set1 0.9897163 0.1828463 5.4128322 6.203562e-08 0.6313442 1.3480884
## 2 gene_set2 -0.1320656 0.2404421 -0.5492617 5.828259e-01 -0.6033235 0.3391922
## CONVERGED
## 1 1
## 2 1
INTACT::intactGSE(gene_data = simdat,prior_fun = INTACT::expit,
t = 0.08,D = 0.08, gene_sets = gene_set_list)
## Gene_Set Estimate SE z pval CI_Leftlim CI_Rightlim
## 1 gene_set1 1.0198388 0.1818588 5.6078591 2.048446e-08 0.6634020 1.3762756
## 2 gene_set2 -0.0424509 0.2348399 -0.1807652 8.565518e-01 -0.5027287 0.4178269
## CONVERGED
## 1 1
## 2 1
INTACT::intactGSE(gene_data = simdat,prior_fun = INTACT::hybrid,
t = 0.08,D = 0.08, gene_sets = gene_set_list)
## Gene_Set Estimate SE z pval CI_Leftlim
## 1 gene_set1 1.02016151 0.1822905 5.5963496 2.189120e-08 0.6628786
## 2 gene_set2 -0.04068458 0.2351970 -0.1729809 8.626665e-01 -0.5016622
## CI_Rightlim CONVERGED
## 1 1.3774444 1
## 2 0.4202931 1
Session information is included below:
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] INTACT_1.2.0 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.33 R6_2.5.1 numDeriv_2016.8-1.1
## [4] bookdown_0.36 fastmap_1.1.1 xfun_0.40
## [7] SQUAREM_2021.1 cachem_1.0.8 knitr_1.44
## [10] htmltools_0.5.6.1 rmarkdown_2.25 cli_3.6.1
## [13] sass_0.4.7 jquerylib_0.1.4 compiler_4.3.1
## [16] tools_4.3.1 bdsmatrix_1.3-6 evaluate_0.22
## [19] bslib_0.5.1 yaml_2.3.7 BiocManager_1.30.22
## [22] jsonlite_1.8.7 rlang_1.1.1