1 Introduction

methcp is a differentially methylated region (DMR) detecting method for whole-genome bisulfite sequencing (WGBS) data. It is applicable for a wide range of experimental designs. In this document, we provide examples for two-group comparisons and time-course analysis.

methcp identifies DMRs based on change point detection, which naturally segments the genome and provides region-level differential analysis. We direct the interested reader to our paper here

Load packages.

library(bsseq)
library(MethCP)
## Warning: replacing previous import 'matrixStats::rowMedians' by
## 'Biobase::rowMedians' when loading 'DSS'
## Warning: replacing previous import 'matrixStats::anyMissing' by
## 'Biobase::anyMissing' when loading 'DSS'

2 Two-group comparison

In this section, we use the CpG methylation data from an Arabidopsis dataset available from GEO with accession number GSM954584. We take a subset of chromosome 1 and 2 from each of the samples and perform differential analysis between the wild-type plants and the H2A.Z mutant plants, which we refer to as treatment and control in the rest of the document.

2.1 Read data

We use a well-developed Bioconductor package bsseq to load the store the raw data. Below is an example of how to read raw counts using bsseq.

We provide a helper function createBsseqObject to create a bsseq object when the data for each sample is stored in a separate text file.

For more operations regarding the bsseq object, or to create a bsseq object customized to your file format, please refer to their User’s Guide.

# The dataset is consist of 6 samples. 3 samples are H2A.Z mutant 
# plants, and 3 samples are controls.
sample_names <- c(
    paste0("control", seq_len(3)), 
    paste0("treatment", seq_len(3))
)

# Get the vector of file path and names 
raw_files <- system.file(
    "extdata", paste0(sample_names, ".txt"), package = "MethCP")

# load the data
bs_object <- createBsseqObject(
    files = raw_files, sample_names = sample_names, 
    chr_col = 'Chr', pos_col = 'Pos', m_col = "M", cov_col = 'Cov')

The bsseq object.

bs_object
## An object of type 'BSseq' with
##   82103 methylation loci
##   6 samples
## has not been smoothed
## All assays are in-memory

Header of the raw file for one of the samples.

dt <- read.table(
            raw_files[1], stringsAsFactors = FALSE, header = TRUE)
head(dt)
##    Chr Pos M Cov
## 1 chr1 109 3   3
## 2 chr1 110 4   4
## 3 chr1 115 3   3
## 4 chr1 116 4   4
## 5 chr1 161 0   1
## 6 chr1 162 1   1

2.2 Calculate the statistics

We calculate the per-cytosine statistics using two different test DSS and methylKit using the function calcLociStat. This function returns a MethCP object that is used for the future segmentation step. We allow parallelized computing when there are multiple chromosomes in the dataset.

# the sample names of the two groups to compare. They should be subsets of the 
# sample names provided when creating the `bsseq` objects.
group1 <- paste0("control", seq_len(3))
group2 <- paste0("treatment", seq_len(3))

# Below we calculate the per-cytosine statistics using two different 
# test `DSS` and `methylKit`. The users may pick one of the two for their
#  application.
# obj_DSS <- calcLociStat(bs_object, group1, group2, test = "DSS")
obj_methylKit <- calcLociStat(
    bs_object, group1, group2, test = "methylKit")
obj_methylKit
## MethCP object with 2 chromosomes, 81936 methylation loci
## test: methylKit
## group1: control1 control2 control3
## group2: treatment1 treatment2 treatment3
## has not been segmented

In cases the user wants to use their pre-calculated test statistics for experiments other than two-group comparison and time course data, we use the calculated statistics and create a MethCP object.

data <- data.frame(
    chr = rep("Chr01", 5),
    pos = c(2, 5, 9, 10, 18),
    effect.size = c(1,-1, NA, 9, Inf),
    pvals = c(0, 0.1, 0.9, NA, 0.02))
obj <- MethCPFromStat(
    data, test.name="myTest",
    pvals.field = "pvals",
    effect.size.field="effect.size",
    seqnames.field="chr",
    pos.field="pos"
)
## Filtering out NAs and infinite values. 
## Cytosine counts before filtering: 5. 
## Cytosine counts after filtering: 3.
obj
## MethCP object with 1 chromosomes, 3 methylation loci
## test: myTest
## group1: notApplicable
## group2: notApplicable
## has not been segmented

2.3 Segmentation

segmentMethCP performs segmentation on a MethCP object. We allow parallelized computing when there are multiple chromosomes in the dataset. Different from calcLociStat function in the previous section, we do not put any constraint on the number of cores used. Please see the documentation for adjusting the parameters used in the segmentation.

# obj_DSS <- segmentMethCP(
#     obj_DSS, bs_object, region.test = "weighted-coverage")

obj_methylKit <- segmentMethCP(
    obj_methylKit, bs_object, region.test = "fisher")
obj_methylKit
## MethCP object with 2 chromosomes, 81936 methylation loci
## test: methylKit
## group1: control1 control2 control3
## group2: treatment1 treatment2 treatment3
## has been segmented

2.4 Significant regions

Use function getSigRegion on a MethCP object to get the list of DMRs.

# region_DSS <- getSigRegion(obj_DSS)
# head(region_DSS)
region_methylKit <- getSigRegion(obj_methylKit)
head(region_methylKit)
##      seqnames start   end nC.valid nC mean.diff mean.cov  region.pval
## 1.2      chr1 26853 27083       10 10   -0.3126   6.8500 6.306067e-14
## 1.5      chr1 30691 30768       11 11   -0.1776   8.3788 4.028997e-05
## 1.13     chr1 47817 47966       16 16   -0.2058   7.1354 4.314515e-08
## 1.21     chr1 65562 65735       10 10   -0.5742   5.2167 0.000000e+00
## 1.38     chr1 92672 92891       10 10    0.4067   8.9667 0.000000e+00
## 1.40     chr1 94712 95021       12 12    0.1304   7.9444 1.295822e-10

3 A time-course example

MethCP is flexible for a wide variety of experimental designs. We apply MethCP on an Arabidopsis thaliana seed germination dataset available from the GEO with accession number https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94712 The data is generated from two replicates of dry seed and germinating seeds of wild-type plants (Col-0) and ros1 dml2 dml3 (rdd) triple demethylase mutant plants at 0-4 days after imbibition for 4 days (DAI). For cytosine-based statistics, we fit linear models on the methylation ratios and test the differences between the time coefficient of condition Col-0 and condition rdd.

Read the meta data.

meta_file <- system.file(
    "extdata", "meta_data.txt", package = "MethCP")
meta <- read.table(meta_file, sep = "\t", header = TRUE)

head(meta)
##   Condition TimeName Time Replicate                  SampleName
## 1     Col-0       DS    0         1   GSM2481288_mC_Col-0_DS_r1
## 2     Col-0       DS    0         2   GSM2481289_mC_Col-0_DS_r2
## 3     Col-0     0DAI    1         1 GSM2481290_mC_Col-0_0DAI_r1
## 4     Col-0     0DAI    1         2 GSM2481291_mC_Col-0_0DAI_r2
## 5     Col-0     1DAI    2         1 GSM2481292_mC_Col-0_1DAI_r1
## 6     Col-0     1DAI    2         2 GSM2481293_mC_Col-0_1DAI_r2

Read the counts data.

# Get the vector of file path and names 
raw_files <- system.file(
    "extdata", paste0(meta$SampleName, ".tsv"), package = "MethCP")

# read files
bs_object <- createBsseqObject(
    files = raw_files, sample_names = meta$SampleName, 
    chr_col = 1, pos_col = 2, m_col = 4, cov_col = 5, header = TRUE)

Apply coverage filter to make sure each loci has total coverage (summed across samples) more than 3 for each condition.

groups <- split(seq_len(nrow(meta)), meta$Condition)
coverages <- as.data.frame(getCoverage(bs_object, type = "Cov"))
filter <- rowSums(coverages[, meta$SampleName[groups[[1]]]] != 0) >= 3 &
    rowSums(coverages[, meta$SampleName[groups[[2]]]] != 0) >= 3
bs_object <- bs_object[filter, ]

Calculate the statistics. A dataframe of the meta data will be passed to function calcLociStatTimeCourse. Note that there must be columns named Condition, Time and SampleName in the dataframe.

obj <- calcLociStatTimeCourse(bs_object, meta)
obj
## MethCP object with 1 chromosomes, 26001 methylation loci
## test: TimeCourse
## group1: GSM2481288_mC_Col-0_DS_r1 GSM2481289_mC_Col-0_DS_r2 GSM2481290_mC_Col-0_0DAI_r1 GSM2481291_mC_Col-0_0DAI_r2 GSM2481292_mC_Col-0_1DAI_r1 GSM2481293_mC_Col-0_1DAI_r2 GSM2481294_mC_Col-0_2DAI_r1 GSM2481295_mC_Col-0_2DAI_r2 GSM2481296_mC_Col-0_3DAI_r1 GSM2481297_mC_Col-0_3DAI_r2 GSM2481298_mC_Col-0_4DAI_r1 GSM2481299_mC_Col-0_4DAI_r2
## group2: GSM2481300_mC_rdd_DS_r1 GSM2481301_mC_rdd_DS_r2 GSM2481302_mC_rdd_0DAI_r1 GSM2481303_mC_rdd_0DAI_r2 GSM2481304_mC_rdd_1DAI_r1 GSM2481305_mC_rdd_1DAI_r2 GSM2481306_mC_rdd_2DAI_r1 GSM2481307_mC_rdd_2DAI_r2 GSM2481308_mC_rdd_3DAI_r1 GSM2481309_mC_rdd_3DAI_r2 GSM2481310_mC_rdd_4DAI_r1 GSM2481311_mC_rdd_4DAI_r2
## has not been segmented

Segmentation.

obj <- segmentMethCP(obj, bs_object, region.test = "stouffer")

Get the DMRs.

regions <- getSigRegion(obj)
head(regions)
##     seqnames   start     end nC.valid  nC mean.diff mean.cov  region.pval
## 31         1  238367  239326       69  69   -0.1042   5.7144 0.0012861236
## 66         1  394175  395220       50  50   -0.2490   3.9292 0.0026789957
## 70         1  405983  408116       52  52   -0.1079   6.3197 0.0082183067
## 92         1  531037  532534       36  36   -0.1900   4.3507 0.0007397345
## 187        1  962256  966028      145 145   -0.1049   5.9529 0.0029463070
## 241        1 1258866 1259519       30  30   -0.1454   5.6153 0.0028433185

4 Session info

sessionInfo() 
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] MethCP_1.11.0               bsseq_1.33.0               
##  [3] SummarizedExperiment_1.27.2 Biobase_2.57.1             
##  [5] MatrixGenerics_1.9.1        matrixStats_0.62.0         
##  [7] GenomicRanges_1.49.1        GenomeInfoDb_1.33.5        
##  [9] IRanges_2.31.2              S4Vectors_0.35.1           
## [11] BiocGenerics_0.43.1         BiocStyle_2.25.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              DSS_2.45.0               
##  [3] numDeriv_2016.8-1.1       tools_4.2.1              
##  [5] bslib_0.4.0               utf8_1.2.2               
##  [7] R6_2.5.1                  HDF5Array_1.25.2         
##  [9] DBI_1.1.3                 colorspace_2.0-3         
## [11] permute_0.9-7             rhdf5filters_1.9.0       
## [13] fastseg_1.43.5            DNAcopy_1.71.0           
## [15] tidyselect_1.1.2          compiler_4.2.1           
## [17] cli_3.3.0                 DelayedArray_0.23.1      
## [19] rtracklayer_1.57.0        bookdown_0.28            
## [21] sass_0.4.2                scales_1.2.1             
## [23] mvtnorm_1.1-3             stringr_1.4.1            
## [25] digest_0.6.29             Rsamtools_2.13.4         
## [27] rmarkdown_2.15            R.utils_2.12.0           
## [29] XVector_0.37.0            pkgconfig_2.0.3          
## [31] htmltools_0.5.3           sparseMatrixStats_1.9.0  
## [33] fastmap_1.1.0             bbmle_1.0.25             
## [35] limma_3.53.6              BSgenome_1.65.2          
## [37] rlang_1.0.4               DelayedMatrixStats_1.19.0
## [39] generics_0.1.3            jquerylib_0.1.4          
## [41] BiocIO_1.7.1              jsonlite_1.8.0           
## [43] mclust_5.4.10             BiocParallel_1.31.12     
## [45] gtools_3.9.3              dplyr_1.0.9              
## [47] R.oo_1.25.0               RCurl_1.98-1.8           
## [49] magrittr_2.0.3            GenomeInfoDbData_1.2.8   
## [51] Matrix_1.4-1              fansi_1.0.3              
## [53] Rcpp_1.0.9                munsell_0.5.0            
## [55] Rhdf5lib_1.19.2           lifecycle_1.0.1          
## [57] R.methodsS3_1.8.2         stringi_1.7.8            
## [59] yaml_2.3.5                MASS_7.3-58.1            
## [61] zlibbioc_1.43.0           rhdf5_2.41.1             
## [63] plyr_1.8.7                qvalue_2.29.0            
## [65] grid_4.2.1                parallel_4.2.1           
## [67] bdsmatrix_1.3-6           crayon_1.5.1             
## [69] lattice_0.20-45           Biostrings_2.65.2        
## [71] splines_4.2.1             methylKit_1.23.0         
## [73] locfit_1.5-9.6            knitr_1.40               
## [75] pillar_1.8.1              rjson_0.2.21             
## [77] reshape2_1.4.4            codetools_0.2-18         
## [79] XML_3.99-0.10             glue_1.6.2               
## [81] evaluate_0.16             data.table_1.14.2        
## [83] BiocManager_1.30.18       vctrs_0.4.1              
## [85] purrr_0.3.4               gtable_0.3.0             
## [87] assertthat_0.2.1          cachem_1.0.6             
## [89] ggplot2_3.3.6             emdbook_1.3.12           
## [91] xfun_0.32                 restfulr_0.0.15          
## [93] coda_0.19-4               tibble_3.1.8             
## [95] GenomicAlignments_1.33.1