library(methodical)
library(HDF5Array)
methodical
includes a method for identifying genomic regions where DNA methylation is
strongly correlated with the transcriptional activity of nearby transcription start sites. We refer to these regions
as TSS-proximal methylation-controlled regulatory regions - TMRs for short. This vignette walks through
the identification of TMRs for the TUBB6 gene in normal prostate samples. The process is easily
scaled-up to search for TMRs associated with a large number of TSS simultaneously using parallel computing.
We start by loading the objects we need for this analysis: The location of the TUBB6 TSS (based on the ENST00000591909 isoform), a RangedSummarizedExperiment object housing methylation values for all CpGs within +/- 5,000 base pairs of the TSS and normalized counts for the TUBB6 transcript.
# Installing Methodical
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("methodical")
# Load required objects
data("tubb6_tss", package = "methodical")
data("tubb6_meth_rse", package = "methodical"); tubb6_meth_rse <- eval(tubb6_meth_rse)
tubb6_meth_rse <- eval(tubb6_meth_rse)
data("tubb6_transcript_counts", package = "methodical")
The first step in the identification of TMRs is to calculate the correlation values between methylation
of CpG sites close to a TSS and the expression of the associated transcript. We do this using the
calculateMethSiteTranscriptCors
function which takes a RangedSummarizedExperiment with DNA
methylation data, a GRanges object with TSS of interest and a table with counts for the transcripts
associated with each TSS. We can choose the maximum distance of CpGs we want to consider upstream
and downstream of the TSS. Here we use 5,000 bp upstream and downstream.
calculateMethSiteTranscriptCors
returns a list of tables with the correlation results
for each TSS, which is just a single TSS in our case. Each table has an attribute tss_range
which is a GRanges object with the location of the associated TSS.
# Calculate correlation values between methylation values and transcript values for TUBB6
cpg_meth_transcript_cors <- calculateMethSiteTranscriptCors(meth_rse = tubb6_meth_rse,
transcript_expression_table = tubb6_transcript_counts, tss_gr = tubb6_tss,
expand_upstream = 5000, expand_downstream = 5000, cor_method = "spearman")
#> There are 1 genes/transcripts in common between tss_gr and transcript_expression_table
#> Using spearman correlation method
#> Calculating correlations for chunk 1 of 1
# Since cpg_meth_transcript_cors is just a list of with 1 table, we'll extract this table
# from the list
tubb6_cpg_meth_transcript_cors <- cpg_meth_transcript_cors$ENST00000591909
# Take a look at the results
head(tubb6_cpg_meth_transcript_cors)
#> meth_site transcript_name cor p_val distance_to_tss
#> 1 chr18:12303394 ENST00000591909 0.055177462 0.53943969 -4840
#> 2 chr18:12303408 ENST00000591909 0.208891300 0.01890552 -4826
#> 3 chr18:12303428 ENST00000591909 0.043386526 0.62953167 -4806
#> 4 chr18:12303472 ENST00000591909 0.050422445 0.57499756 -4762
#> 5 chr18:12303480 ENST00000591909 0.005252191 0.95345508 -4754
#> 6 chr18:12303484 ENST00000591909 -0.052151327 0.56194303 -4750
# Extract the location of the TSS from the results
attributes(tubb6_cpg_meth_transcript_cors)$tss_range
#> GRanges object with 1 range and 6 metadata columns:
#> seqnames ranges strand | gene_id gene_name
#> <Rle> <IRanges> <Rle> | <character> <character>
#> ENST00000591909 chr18 12308234 + | ENSG00000176014.13 TUBB6
#> hgnc_id gene_type transcript_id transcript_type
#> <character> <character> <character> <character>
#> ENST00000591909 HGNC:20776 protein_coding ENST00000591909 protein_coding
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
We’ll next create a plot of the correlation values using plotMethSiteCorCoefs
so that we can
visually inspect if there are any interesting patterns. On the x-axis we can show the chromosomal
coordinates of the CpG sites or alternatively their distance to the TSS if we provide a
GRanges object with the TSS location to the argument reference_region
.
# Plot methylation-transcription correlation values showing chromosomal coordinates of CpG sites.
tubb6_correlation_plot_chrom_coords <- plotMethSiteCorCoefs(tubb6_cpg_meth_transcript_cors,
ylabel = "Spearman Correlation", value_colours = "set2") +
geom_hline(yintercept = 0, linetype = "dashed")
print(tubb6_correlation_plot_chrom_coords)
# Plot methylation-transcription correlation values showing distance of CpG sites to TSS.
tubb6_correlation_plot_tss_dist <- plotMethSiteCorCoefs(tubb6_cpg_meth_transcript_cors,
ylabel = "Spearman Correlation", value_colours = "set2",
reference_tss = attributes(tubb6_cpg_meth_transcript_cors)$tss_range) +
geom_hline(yintercept = 0, linetype = "dashed")
print(tubb6_correlation_plot_tss_dist)
We’ll now identify TMRs using the correlation values for CpG methylation and the TUBB6 transcript counts and the statistical significance of the correlations. This is achieved by using the correlation values and their associated p-values to calculate a Methodical score which is the the log10 of the p-value multiplied by the opposite if the sign of the correlation (i.e. -1 for a positive correlation and 1 for a negative correlation) Thus, negative correlations will have a negative Methodical score and positive correlations will have a positive Methodical score, with the significance of the correlations determining their magnitude.
We’ll show the Methodical scores for the TUBB6 CpG correlation values using the plotMethodicalScores
function.
# Plot methodical scores for CpGs near TUBB6 TSS
tubb6_methodical_scores_plot <- plotMethodicalScores(
meth_site_values = tubb6_cpg_meth_transcript_cors,
p_value_threshold = 0.005, smooth_scores = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed")
print(tubb6_methodical_scores_plot)
We smooth the Methodical scores using an exponential moving average to avoid single CpGs having too much influence during the identification of TMRs. This smoothing involves taking the mean methodical score of a central CpG and an equal number of CpGs upstream and downstream of it, with weights that decay exponentially moving away from the central CpG.
We set the smooth_scores
parameter to TRUE to add a curve going through the smoothed values to the plot.
The arguments offset_length
and smoothing_factor
control the number of CpGs considered upstream
and downstream of the central CpG and the exponential decay rate, respectively.
# Smooth Methodical scores
tubb6_smoothed_methodical_scores_plot <- plotMethodicalScores(
meth_site_values = tubb6_cpg_meth_transcript_cors, p_value_threshold = 0.005,
smooth_scores = TRUE, smoothed_curve_colour = "hotpink2", offset_length = 10,
smoothing_factor = 0.75) +
geom_hline(yintercept = 0, linetype = "dashed")
print(tubb6_smoothed_methodical_scores_plot)
#> Warning: Removed 20 rows containing missing values or values outside the scale range
#> (`geom_line()`).
We use two significance thresholds to identify TMRs: one for TMRs where DNA methylation is negatively associated with transcription (negative TMRs) and another for TMRs where DNA methylation is positively associated with transcription (positive TMRs). These thresholds are defined with a p-value which is converted into negative and positive Methodical scores thresholds. We’ll use a p-value of 0.005 for the thresholds here which results in TMR thresholds of 2.30 and -2.30 for positive and negative TMRs respectively. We can then visualize if the smoothed Methodical scores breach these thresholds anywhere and identify TMRs.
# Add significance thresholds to plot
tubb6_smoothed_methodical_scores_plot <- plotMethodicalScores(meth_site_values =
tubb6_cpg_meth_transcript_cors, p_value_threshold = 0.005, smooth_scores = TRUE,
smoothed_curve_colour = "hotpink2") +
geom_hline(yintercept = 0, linetype = "dashed")
print(tubb6_smoothed_methodical_scores_plot)
#> Warning: Removed 20 rows containing missing values or values outside the scale range
#> (`geom_line()`).
We can see two regions where the Methodical scores breach the thresholds: one region around 12,305,000 where the positive threshold is breached and another just before 12,307,500 where the negative threshold is breached. Thus, we should find one negative and one positive TMR for TUBB6.
The calculation of Methodical scores, their smoothing and identification of TMRs is all
done using the findTMRs
function which takes the correlation results from calculateMethSiteTranscriptCors
as input. We can set the p-value threshold as well as the parameters for the smoothing,
offset_length
and smoothing_factor
, which behave the same as they do in plotMethodicalScores
.
We can also specify the minimum number of CpG sites a TMR contain using min_meth_sites
and merge TMRs
with the same direction within a minimum distance using min_gapwidth
. There are no close TMRs
with the same direction for TUBB6 so this argument has no effect, but it can prove useful since it would
generally be desirable to have one large TMR rather than several shorter TMRs occurring in the same region
since the intervening CpGs will likely also display strong correlations.
findTMRs
returns a GRanges with metadata columns giving the TMR direction, the name of the TMRs,
the number of CpG sites they overlap (meth_site_count), their distance to the TSS and the location of
the TSS as a character which can be converted to a GRanges by calling GRanges
on it.
# Identify TMRs for TUBB6
tubb6_tmrs <- findTMRs(correlation_df = tubb6_cpg_meth_transcript_cors, offset_length = 10,
smoothing_factor = 0.75, min_meth_sites = 5, min_gapwidth = 150)
print(tubb6_tmrs)
#> GRanges object with 2 ranges and 11 metadata columns:
#> seqnames ranges strand | direction tmr_name
#> <Rle> <IRanges> <Rle> | <factor> <character>
#> [1] chr18 12304719-12305243 * | Positive ENST00000591909_tmr_1
#> [2] chr18 12306668-12307631 * | Negative ENST00000591909_tmr_2
#> meth_site_count distance_to_tss tss_location gene_id
#> <integer> <numeric> <character> <character>
#> [1] 13 -2991 chr18:12308234:+ ENSG00000176014.13
#> [2] 38 -603 chr18:12308234:+ ENSG00000176014.13
#> gene_name hgnc_id gene_type transcript_id transcript_type
#> <character> <character> <character> <character> <character>
#> [1] TUBB6 HGNC:20776 protein_coding ENST00000591909 protein_coding
#> [2] TUBB6 HGNC:20776 protein_coding ENST00000591909 protein_coding
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
We can add these TMRs to either the plot of correlation values or Methodical scores using
plotTMRs
to visualize where they are located in relation to these values. We’ll
plot the TMRs on the Methodical scores first and then the correlation values. As with
plotMethSiteCorCoefs
, we can add a reference_positive so that we can see where the TMRs
are located relative to the TSS, provided the plot used the same reference_position. We
can clearly see that the TMRs overlap regions where many CpG sites display strong
correlation values between their methylation and expression of TUBB6.
# Show location of TMRs on Methodical scores plot
plotTMRs(meth_site_plot = tubb6_smoothed_methodical_scores_plot, tmrs_gr = tubb6_tmrs)
#> Warning: Removed 20 rows containing missing values or values outside the scale range
#> (`geom_line()`).
# Show location of TMRs on correlation value plot
plotTMRs(meth_site_plot = tubb6_correlation_plot_tss_dist, tmrs_gr = tubb6_tmrs,
reference_tss = GRanges(tubb6_tmrs$tss_location[1]))
We can see that the negative TMR is centred around 1,000 base pairs upstream of the TSS and that the positive TMR is centred around 3,000 base pairs upstream.
sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] HDF5Array_1.32.0 DelayedArray_0.30.0
#> [3] SparseArray_1.4.0 S4Arrays_1.4.0
#> [5] abind_1.4-5 Matrix_1.7-0
#> [7] rtracklayer_1.64.0 AnnotationHub_3.12.0
#> [9] BiocFileCache_2.12.0 dbplyr_2.5.0
#> [11] rhdf5_2.48.0 TumourMethData_1.1.0
#> [13] DESeq2_1.44.0 methodical_1.0.0
#> [15] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [17] MatrixGenerics_1.16.0 matrixStats_1.3.0
#> [19] ggplot2_3.5.1 GenomicRanges_1.56.0
#> [21] GenomeInfoDb_1.40.0 IRanges_2.38.0
#> [23] S4Vectors_0.42.0 BiocGenerics_0.50.0
#> [25] BiocStyle_2.32.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.2 bitops_1.0-7 rlang_1.1.3
#> [4] magrittr_2.0.3 compiler_4.4.0 RSQLite_2.3.6
#> [7] png_0.1-8 vctrs_0.6.5 pkgconfig_2.0.3
#> [10] crayon_1.5.2 fastmap_1.1.1 magick_2.8.3
#> [13] XVector_0.44.0 labeling_0.4.3 utf8_1.2.4
#> [16] Rsamtools_2.20.0 rmarkdown_2.26 UCSC.utils_1.0.0
#> [19] tinytex_0.50 purrr_1.0.2 bit_4.0.5
#> [22] xfun_0.43 zlibbioc_1.50.0 cachem_1.0.8
#> [25] jsonlite_1.8.8 blob_1.2.4 rhdf5filters_1.16.0
#> [28] Rhdf5lib_1.26.0 BiocParallel_1.38.0 parallel_4.4.0
#> [31] R6_2.5.1 bslib_0.7.0 jquerylib_0.1.4
#> [34] RcppRoll_0.3.0 iterators_1.0.14 Rcpp_1.0.12
#> [37] bookdown_0.39 knitr_1.46 R.utils_2.12.3
#> [40] tidyselect_1.2.1 yaml_2.3.8 codetools_0.2-20
#> [43] curl_5.2.1 lattice_0.22-6 tibble_3.2.1
#> [46] withr_3.0.0 KEGGREST_1.44.0 evaluate_0.23
#> [49] ExperimentHub_2.12.0 Biostrings_2.72.0 pillar_1.9.0
#> [52] BiocManager_1.30.22 filelock_1.0.3 foreach_1.5.2
#> [55] generics_0.1.3 RCurl_1.98-1.14 BiocVersion_3.19.1
#> [58] munsell_0.5.1 scales_1.3.0 glue_1.7.0
#> [61] tools_4.4.0 BiocIO_1.14.0 data.table_1.15.4
#> [64] GenomicAlignments_1.40.0 locfit_1.5-9.9 XML_3.99-0.16.1
#> [67] grid_4.4.0 AnnotationDbi_1.66.0 colorspace_2.1-0
#> [70] GenomeInfoDbData_1.2.12 restfulr_0.0.15 cli_3.6.2
#> [73] rappdirs_0.3.3 fansi_1.0.6 dplyr_1.1.4
#> [76] gtable_0.3.5 R.methodsS3_1.8.2 sass_0.4.9
#> [79] digest_0.6.35 farver_2.1.1 rjson_0.2.21
#> [82] memoise_2.0.1 htmltools_0.5.8.1 R.oo_1.26.0
#> [85] lifecycle_1.0.4 httr_1.4.7 mime_0.12
#> [88] bit64_4.0.5