The epialleleR User’s Guide

21 August, 2023

Abstract

A comprehensive guide to using the epialleleR package for calling the hypermethylated epiallele frequencies from next-generation sequencing data


image/svg+xml epiallele

Introduction

The cytosine DNA methylation is an important epigenetic mechanism for regulation of gene expression. Abnormal methylation is linked to several diseases, being for example the most common molecular lesion in cancer cell.1 Multiple studies suggest that alterations in DNA methylation, despite occurring at a low mosaic level, may confer increased risk of cancer later in life.2

The cytosine methylation levels within relatively small regions of the human genome are thought to be often concordant, resulting in a limited number of distinct methylation patterns of short sequencing reads.3 Due to the cell-to-cell variations in methylation, DNA purified from tissue samples contains a mix of hyper- and hypomethylated alleles with varying ratios that depend on the genomic region and tissue type.

Unsurprisingly, when the frequencies of hypermethylated epialleles are low (e.g. 1e-02 and lower) and cytosine methylation levels are averaged and reported using conventional algorithms, the identification of such hypermethylated epialleles becomes nearly impossible. In order to increase the sensitivity of DNA methylation analysis we have developed epialleleR — an R package for calling hypermethylated variant epiallele frequencies (VEF).

Two edge cases epialleleR was designed to distinguish are presented below. While these are simplified and entirely artificial, they still give an idea of two different methylation patterns that may exist in real life, be characterised by very similar quantitative metrics (beta value, the ratio of methylated cytosines, \(C\), to total number of cytosines, \(C+T\), per genomic position, i.e. \(\beta = \frac{C}{C+T}\)), but have entirely different biological properties: non-functional scattered methylation / technical artefacts on the left, and epigenetic gene inactivation on the right. VEF values, that are essentially the ratio of methylated cytosines in hypermethylated (above threshold) reads, \(C^a\), to total number of cytosines, \(C+T\), per genomic position, i.e. \(VEF = \frac{C^a}{C+T}\), clearly separate these cases and are thought to be more useful in detection and quantification of concordant methylation events.

epialleleR is a very fast and scalable solution for analysis of data obtained by next-generation sequencing of bisulfite-treated DNA samples. The minimum requirement for the input is a Binary Alignment Map (BAM) file containing sequencing reads. These reads can be obtained from either deep or ultra-deep sequencing, using either narrowly targeted gene panels (amplicon sequencing), larger methylation capture panels, or even whole-genome approaches.

Current Features

Processing speed

Currently epialleleR runs in a single-thread mode only. Loading the BAM data is now done by means of HTSlib, therefore it is possible to speed it up using additional decompression threads (nthreads option in preprocessBam). All operations are performed using optimised C++ functions, and usually take reasonable time. Running time for complete task “BAM on disk -> CX report on disk” depends on the size of the BAM file, and the speed is usually within the range of 30-50 MB/s (or 250-400 thousand reads per second) for a single core of a relatively modern CPU (Intel(R) Core(TM) i7-7700).

Major bottlenecks (in BAM loading and preprocessing) were removed in the release v1.2, full multithreading and minor improvements are expected in the future.


Sample data

The epialleleR package includes sample data, which was obtained using targeted sequencing. The description of assays and files is given below. All the genomic coordinates for external data files are according to GRCh38 reference assembly.

Amplicon-based methylation NGS data

The samples of Human HCT116 DKO Non-Methylated (Zymo Research, cat # D5014-1), or Human HCT116 DKO Methylated (Zymo Research, cat # D5014-2) DNA,4 or their mix were bisulfite-converted, and the BRCA1 gene promoter region was amplified using four pairs of primers. Amplicons were mixed, indexed and sequenced at Illumina MiSeq system. The related files are:

Name Type Description
amplicon000meth.bam BAM a subset of reads for non-methylated DNA sample
amplicon010meth.bam BAM a subset of reads for a 1:9 mix of methylated and non-methylated DNA samples
amplicon100meth.bam BAM a subset of reads for fully methylated DNA sample
amplicon.bed BED genomic coordinates of four amplicons covering promoter area of BRCA1 gene
amplicon.vcf.gz VCF a relevant subset of sequence variations
amplicon.vcf.gz.tbi tabix tabix file for the amplicon.vcf.gz

Capture-based methylation NGS data

The tumour DNA was bisulfite-converted, fragmented and hybridized with custom-made probes covering promoter regions of 283 tumour suppressor genes (as described in 5). Libraries were sequenced using Illumina MiSeq system. The related files are:

Name Type Description
capture.bam BAM a subset of reads
capture.bed BED genomic coordinates of capture target regions
capture.vcf.gz VCF a relevant subset of sequence variations
capture.vcf.gz.tbi tabix tabix file for the capture.vcf.gz

Typical workflow

As mentioned earlier, epialleleR uses data stored in Binary Alignment Map (BAM) files as its input. It is a prerequisite that records in the BAM file are sorted by their name (and not a genomic location), contain an XG tag with a genomic strand they map to, and an XM tag with the methylation call string — such files are produced by virtually any software tool for mapping and alignment of bisulfite sequencing reads (such as Bismark, BSMAP or Illumina DRAGEN Bio-IT Platform).

Please use the function help files for a full description of available parameters, as well as explanation of the function’s logic and output values.

Reading the data

All epialleleR methods can load BAM data using the file path. However, if a file is very large and several reports need to be prepared, it is advised to use the preprocessBam convenience function as shown below. This function is also used internally when a BAM file location string is supplied as an input for other epialleleR methods.

preprocessBam currently accepts only BAM files that are derived from paired-end sequencing (create an issue if you need to process single-end BAM files). During preprocessing, paired reads are merged according to their base quality: nucleotide base with the highest value in the QUAL string is taken, unless its quality is less than min.baseq, which results in no information for that particular position (“-”/“N”). These merged reads are then processed as a single entity in all epialleleR methods. Due to merging, overlapping bases in read pairs are counted only once, and the base with the highest quality is taken.

library(epialleleR)

capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
bam.data    <- preprocessBam(capture.bam)
#> Reading BAM file [0.028s]

Making cytosine reports

epialleleR can generate conventional cytosine reports in a format, which is similar to the genome-wide cytosine report produced by the coverage2cytosine Bismark module.6

Please note that generateCytosineReport produces thresholded (VEF) report by default: methylated cytosines from reads that do not pass the threshold (hypomethylated reads) are counted as being unmethylated. In order to make a conventional cytosine report, use threshold.reads=FALSE.

# data.table::data.table object for
# CpG VEF report
cg.vef.report <- generateCytosineReport(bam.data)
#> Already preprocessed BAM supplied as an input. Options 'min.mapq', 'min.baseq', 'skip.duplicates' and 'nthreads' will have no effect.
#> Thresholding reads [0.002s]
#> Preparing cytosine report [0.026s]
head(cg.vef.report[order(meth+unmeth, decreasing=TRUE)])
#>    rname strand      pos context meth unmeth
#> 1: chr17      + 61864475      CG    7      9
#> 2: chr17      + 61864486      CG   10      6
#> 3: chr17      + 61864504      CG    9      7
#> 4: chr20      - 57267455      CG   13      1
#> 5: chr17      - 61863826      CG    0     13
#> 6: chr17      - 61863830      CG    0     13

# CpG cytosine report
cg.report <- generateCytosineReport(bam.data, threshold.reads=FALSE)
#> Already preprocessed BAM supplied as an input. Options 'min.mapq', 'min.baseq', 'skip.duplicates' and 'nthreads' will have no effect.
#> Preparing cytosine report [0.026s]
head(cg.report[order(meth+unmeth, decreasing=TRUE)])
#>    rname strand      pos context meth unmeth
#> 1: chr17      + 61864475      CG    8      8
#> 2: chr17      + 61864486      CG   10      6
#> 3: chr17      + 61864504      CG   10      6
#> 4: chr20      - 57267455      CG   13      1
#> 5: chr17      - 61863826      CG    0     13
#> 6: chr17      - 61863830      CG    0     13

# CX cytosine report
cx.report <- generateCytosineReport(bam.data, threshold.reads=FALSE,
                                    report.context="CX")
#> Already preprocessed BAM supplied as an input. Options 'min.mapq', 'min.baseq', 'skip.duplicates' and 'nthreads' will have no effect.
#> Preparing cytosine report [0.028s]
head(cx.report[order(meth+unmeth, decreasing=TRUE)])
#>    rname strand      pos context meth unmeth
#> 1: chr17      + 61864338     CHG    1     25
#> 2: chr17      + 61864348     CHH    0     24
#> 3: chr17      + 61864364     CHH    0     24
#> 4: chr17      + 61864365     CHH    0     24
#> 5: chr17      + 61864373     CHH    0     24
#> 6: chr17      + 61864324     CHG    0     23

Making VEF reports for a set of genomic regions

epialleleR allows to make reports not only for individual cytosine bases, but also for a set of genomic regions. It is especially useful when the targeted methylation sequencing was used to produce reads (such as amplicon sequencing or hybridization capture using, e.g., Agilent SureSelect Target Enrichment Probes).

The amplicon sequencing principally differs from capture-based assays in that the coordinates of reads are known. Therefore, reads can be assigned to amplicons by their exact positions, while to the capture targets — by the overlap. For this, epialleleR provides generic generateBedReport function as well as two of its aliases, generateAmpliconReport (for amplicon-based NGS) and generateCaptureReport (for capture-based NGS).

# report for amplicon-based data
# matching is done by exact start or end positions plus/minus tolerance
amplicon.report <- generateAmpliconReport(
  bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
  bed=system.file("extdata", "amplicon.bed", package="epialleleR")
)
#> Reading BED file [0.057s]
#> Reading BAM file [0.007s]
#> Thresholding reads [0.000s]
#> Preparing amplicon report [0.055s]
amplicon.report
#>    seqnames    start      end width strand amplicon nreads+ nreads-        VEF
#> 1:    chr17 43125624 43126026   403      * CpG00-13       0     156 0.08333333
#> 2:    chr17 43125270 43125640   371      * CpG14-31       0      61 0.11475410
#> 3:    chr17 43125171 43125550   380      * CpG17-34       0      93 0.05376344
#> 4:    chr17 43124861 43125249   389      * CpG33-49       0      84 0.10714286
#> 5:     <NA>       NA       NA    NA   <NA>     <NA>      60      46 0.13207547

# report for capture-based data
# matching is done by overlap
capture.report <- generateCaptureReport(
  bam=system.file("extdata", "capture.bam", package="epialleleR"),
  bed=system.file("extdata", "capture.bed", package="epialleleR")
)
#> Reading BED file [0.035s]
#> Reading BAM file [0.022s]
#> Thresholding reads [0.001s]
#> Preparing capture report [0.023s]
head(capture.report)
#>    seqnames    start      end width strand     V4 nreads+ nreads-       VEF
#> 1:     chr1  3067647  3069703  2057      * PRDM16       2       1 1.0000000
#> 2:     chr1  3651039  3653096  2058      *   TP73       0       2 0.5000000
#> 3:     chr1  3689153  3691202  2050      *   TP73       0       2 1.0000000
#> 4:     chr1  3696519  3698570  2052      *   TP73       1       2 1.0000000
#> 5:     chr1  6179609  6181670  2062      *   CHD5       0       3 0.6666667
#> 6:     chr1 13698869 13699064   196      *  PRDM2      NA      NA        NA

# generateBedReport is a generic function for BED-guided reports
bed.report <- generateBedReport(
  bam=system.file("extdata", "capture.bam", package="epialleleR"),
  bed=system.file("extdata", "capture.bed", package="epialleleR"),
  bed.type="capture"
)
#> Reading BED file [0.013s]
#> Reading BAM file [0.018s]
#> Thresholding reads [0.001s]
#> Preparing capture report [0.025s]
identical(capture.report, bed.report)
#> [1] TRUE

Exploring DNA methylation patterns

Individual epialleles can be extracted and plotted in order to visualize methylation patters within a genomic region of interest. For this, epialleleR provides method extractPatterns which can be used as follows:

# First, let's extract base methylation information for sequencing reads
# of 1:9 mix of methylated and non-methylated control DNA
patterns <- extractPatterns(
  bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
  bed=as("chr17:43125200-43125600","GRanges")
)
#> Reading BAM file [0.006s]
#> Extracting methylation patterns [0.028s]

# that many read pairs overlap genomic region of interest
nrow(patterns)
#> [1] 238

# these are positions of bases within relevant read pairs
base.positions <- grep("^[0-9]+$", colnames(patterns), value=TRUE)

# let's make a summary table with counts for every pattern
patterns.summary <- patterns[, c(lapply(.SD, unique), .N),
                             by=.(pattern, beta), .SDcols=base.positions]

# that many unique methylation patterns were found
nrow(patterns.summary)
#> [1] 45

# let's melt and plot patterns
plot.data <- data.table::melt.data.table(patterns.summary,
  measure.vars=base.positions, variable.name="pos", value.name="base")

# categorical positions, all patterns sorted by beta, with counts on the right
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
  ggplot(na.omit(plot.data),
         aes(x=pos, y=reorder(pattern,beta),
             group=pattern, color=factor(base))) +
    geom_line(color="grey", position=position_dodgev(height=0.5)) +
    geom_point(position=position_dodgev(height=0.5)) +
    scale_colour_grey(start=0.8, end=0) +
    theme_light() +
    theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),
          axis.text.y=element_blank()) +
    labs(x="position", y="pattern", title="epialleles", color="base") +
    scale_x_discrete(expand=c(0.05,0)) +
    geom_text(inherit.aes=FALSE, data=patterns.summary,
              mapping=aes(x="count", y=pattern, label=N), size=3)
}
#> 
#> Attaching package: 'ggstance'
#> 
#> The following objects are masked from 'package:ggplot2':
#> 
#>     GeomErrorbarh, geom_errorbarh


# continuous positions, nonunique patterns according to their counts
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
  ggplot(na.omit(plot.data)[N>1],
         aes(x=as.numeric(as.character(pos)), y=factor(N),
             group=pattern, color=factor(base))) +
    geom_line(color="grey", position=position_dodgev(height=0.5)) +
    geom_point(position=position_dodgev(height=0.5)) +
    scale_colour_grey(start=0.8, end=0) +
    theme_light() +
    labs(x="position", y="count", title="epialleles", color="base")
}


# upset-like plot of all patterns, continuous positions, sorted by counts
if (require("ggplot2", quietly=TRUE) & require("gridExtra", quietly=TRUE)) {
  grid.arrange(
    ggplot(na.omit(plot.data),
           aes(x=as.numeric(as.character(pos)), y=reorder(pattern,N),
               color=factor(base))) +
      geom_line(color="grey") +
      geom_point() +
      scale_colour_grey(start=0.8, end=0) +
      theme_light() +
      theme(axis.text.y=element_blank(), legend.position="none") +
      labs(x="position", y=NULL, title="epialleles", color="base"),
    
    ggplot(unique(na.omit(plot.data)[, .(pattern, N, beta)]),
           aes(x=N+0.5, y=reorder(pattern,N), alpha=beta, label=N)) +
      geom_col() +
      geom_text(alpha=0.5, nudge_x=0.2, size=3) +
      scale_x_log10() +
      theme_minimal() +
      theme(axis.text.y=element_blank(), legend.position="none") +
      labs(x="count", y=NULL, title=""),
    ncol=2, widths=c(0.75, 0.25)
  )
}


# now let's explore methylation patterns in RAD51C gene promoter using
# methylation capture data
capture.patterns <- extractPatterns(
  bam=system.file("extdata", "capture.bam", package="epialleleR"),
  bed=as("chr17:58691673-58693108", "GRanges"),
  verbose=FALSE
)
capture.positions <- grep("^[0-9]+$", colnames(capture.patterns), value=TRUE)
capture.summary <-
  capture.patterns[, c(lapply(.SD, unique), .N),
                   by=.(pattern, beta), .SDcols=capture.positions]
capture.data <- data.table::melt.data.table(capture.summary,
  measure.vars=capture.positions, variable.name="pos", value.name="base")
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
  ggplot(na.omit(capture.data),
         aes(x=as.numeric(as.character(pos)), y=factor(N),
             group=pattern, color=factor(base))) +
    geom_line(color="grey", position=position_dodgev(height=0.9)) +
    geom_point(position=position_dodgev(height=0.9)) +
    scale_colour_grey(start=0.8, end=0) +
    theme_light() +
    labs(x="position", y="count", title="RAD51C", color="base")
}

Plotting the distribution of per-read beta values

As stated in the introduction, human genomic DNA regions often show bimodal methylation patterns. epialleleR allows to visualize this information by plotting empirical cumulative distribution functions (eCDFs) for within- and out-of-context beta values.

The code below produces plots for the sequencing reads of control DNA samples. Note that within-the-context eCDF(0.5) values are very close to the expected 1-VEF values for the corresponding control DNA samples:

# First, let's visualise eCDFs for within- and out-of-context beta values
# for all four amplicons and unmatched reads. Note that within-the-context eCDF
# of 0.5 is very close to the expected 1-VEF value (0.1) for all amplicons
# produced from this 1:9 mix of methylated and non-methylated control DNA

# let's compute eCDF
amplicon.ecdfs <- generateBedEcdf(
  bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
  bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
  bed.rows=NULL
)
#> Reading BED file [0.012s]
#> Reading BAM file [0.006s]
#> Computing ECDFs for within- and out-of-context per-read beta values [0.016s]

# there are 5 items in amplicon.ecdfs, let's plot all of them
par(mfrow=c(1,length(amplicon.ecdfs)))

# cycle through items
for (x in 1:length(amplicon.ecdfs)) {
  # four of them have names corresponding to genomic regions of amplicon.bed
  # fifth - NA for all the reads that don't match to any of those regions
  main <- if (is.na(names(amplicon.ecdfs[x]))) "unmatched"
          else names(amplicon.ecdfs[x])
  
  # plotting eCDF for within-the-context per-read beta values (in red)
  plot(amplicon.ecdfs[[x]]$context, col="red", verticals=TRUE, do.points=FALSE,
       xlim=c(0,1), xlab="per-read beta value", ylab="cumulative density",
       main=main)
  
  # adding eCDF for out-of-context per-read beta values (in blue)
  plot(amplicon.ecdfs[[x]]$out.of.context, add=TRUE, col="blue",
       verticals=TRUE, do.points=FALSE)
}



# Second, let's compare eCDFs for within-the-context beta values for only one
# amplicon but all three sequenced samples: pure non-methylated DNA, 1:9 mix of
# methylated and non-methylated DNA, and fully methylated DNA

# our files
bam.files <- c("amplicon000meth.bam", "amplicon010meth.bam",
               "amplicon100meth.bam")

# let's plot all of them
par(mfrow=c(1,length(bam.files)))

# cycle through items
for (f in bam.files) {
  # let's compute eCDF
  amplicon.ecdfs <- generateBedEcdf(
    bam=system.file("extdata", f, package="epialleleR"),
    bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
    # only the second amplicon
    bed.rows=2, verbose=FALSE
  )
  
  # plotting eCDF for within-the-context per-read beta values (in red)
  plot(amplicon.ecdfs[[1]]$context, col="red", verticals=TRUE, do.points=FALSE,
       xlim=c(0,1), xlab="per-read beta value", ylab="cumulative density",
       main=f)
  
   # adding eCDF for out-of-context per-read beta values (in blue)
  plot(amplicon.ecdfs[[1]]$out.of.context, add=TRUE, col="blue",
       verticals=TRUE, do.points=FALSE)
}

Exploring sequence variants in epialleles

It is known that sequence variants can affect the methylation status of a DNA.7 The generateVcfReport function calculates frequencies of single nucleotide variants (SNVs) within epialleles and tests for the association between SNV and epiallelic status using Fisher Exact test. Base counts and the test’s p-values are included in the returned value.

In addition to BAM file location string or preprocessed BAM object, the function requires a location string for the Variant Call Format (VCF) file or the VCF object that was obtained using VariantAnnotation::readVcf function. As VCF files can be extremely large, it is strongly advised to prefilter the VCF object by the relevant set of genomic regions, or specify such relevant set of regions as a bed parameter when vcf points to a VCF file location.

Please note, that the output report is currently limited to SNVs only. Also, the default (min.baseq=0) output of generateVcfReport is equivalent to the one of samtools mplieup -Q 0 ..., and therefore may result in false SNVs caused by misalignments. Remember to increase min.baseq (samtools mplieup -Q default value is 13) to obtain results of a higher quality.

# VCF report
vcf.report <- generateVcfReport(
  bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
  bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
  vcf=system.file("extdata", "amplicon.vcf.gz", package="epialleleR"),
  # thresholds on alignment and base quality
  min.mapq=30, min.baseq=13,
  # when VCF seqlevels are different from BED and BAM it is possible
  # to convert them internally
  vcf.style="NCBI"
)
#> Reading BED file [0.012s]
#> Reading VCF file [1.010s]
#> Reading BAM file [0.006s]
#> Thresholding reads [0.000s]
#> Extracting base frequences [0.058s]

# NA values are shown for the C->T variants on the "+" and G->A on the "-"
# strands, because bisulfite conversion makes their counting impossible
head(vcf.report)
#>           name seqnames    range REF ALT M+Ref U+Ref M-Ref U-Ref M+Alt U+Alt M-Alt U-Alt SumRef
#> 1: rs546660277    chr17 43124874   A   C     0     0     9    74     0     0     0     0     83
#> 2: rs574263814    chr17 43124891   G   A     0     0    NA    NA     0     0    NA    NA      0
#> 3:   rs8176076    chr17 43124935   G   A     0     0    NA    NA     0     0    NA    NA      0
#> 4: rs535977743    chr17 43125016   C   T    NA    NA     9    73    NA    NA     0     0     82
#> 5: rs191784032    chr17 43125050   C   A     0     0     9    73     0     0     0     1     82
#> 6: rs111956204    chr17 43125083   C   A     0     0     9    74     0     0     0     0     83
#>    SumAlt FEp+ FEp-
#> 1:      0    1    1
#> 2:      0    1   NA
#> 3:      0    1   NA
#> 4:      0   NA    1
#> 5:      1    1    1
#> 6:      0    1    1

# let's sort the report by increasing Fisher's exact test's p-values.
# the p-values are given separately for reads that map to the "+"
head(vcf.report[order(`FEp-`, na.last=TRUE)])
#>           name seqnames    range REF ALT M+Ref U+Ref M-Ref U-Ref M+Alt U+Alt M-Alt U-Alt SumRef
#> 1: rs546660277    chr17 43124874   A   C     0     0     9    74     0     0     0     0     83
#> 2: rs535977743    chr17 43125016   C   T    NA    NA     9    73    NA    NA     0     0     82
#> 3: rs191784032    chr17 43125050   C   A     0     0     9    73     0     0     0     1     82
#> 4: rs111956204    chr17 43125083   C   A     0     0     9    74     0     0     0     0     83
#> 5:  rs55680227    chr17 43125086   A   C     0     0     8    64     0     0     0     0     72
#> 6: rs539733232    chr17 43125088   C   A     0     0     8    71     0     0     0     0     79
#>    SumAlt FEp+ FEp-
#> 1:      0    1    1
#> 2:      0   NA    1
#> 3:      1    1    1
#> 4:      0    1    1
#> 5:      0    1    1
#> 6:      0    1    1

# and to the "-" strand
head(vcf.report[order(`FEp+`, na.last=TRUE)])
#>           name seqnames    range REF ALT M+Ref U+Ref M-Ref U-Ref M+Alt U+Alt M-Alt U-Alt SumRef
#> 1: rs546660277    chr17 43124874   A   C     0     0     9    74     0     0     0     0     83
#> 2: rs574263814    chr17 43124891   G   A     0     0    NA    NA     0     0    NA    NA      0
#> 3:   rs8176076    chr17 43124935   G   A     0     0    NA    NA     0     0    NA    NA      0
#> 4: rs191784032    chr17 43125050   C   A     0     0     9    73     0     0     0     1     82
#> 5: rs111956204    chr17 43125083   C   A     0     0     9    74     0     0     0     0     83
#> 6:  rs55680227    chr17 43125086   A   C     0     0     8    64     0     0     0     0     72
#>    SumAlt FEp+ FEp-
#> 1:      0    1    1
#> 2:      0    1   NA
#> 3:      0    1   NA
#> 4:      1    1    1
#> 5:      0    1    1
#> 6:      0    1    1

# and finally, let's plot methylation patterns overlapping one of the most
# covered SNPs in the methylation capture test data set - rs573296191
# (chr17:61864584) in BRIP1 gene
brip1.patterns <- extractPatterns(
  bam=system.file("extdata", "capture.bam", package="epialleleR"),
  bed=as("chr17:61864583-61864585", "GRanges"),
  highlight.positions=61864584,
  verbose=FALSE
)
brip1.positions <- grep("^[0-9]+$", colnames(brip1.patterns), value=TRUE)
brip1.summary <-
  brip1.patterns[, c(lapply(.SD, unique), .N),
                 by=.(pattern, beta), .SDcols=brip1.positions]
brip1.data <- data.table::melt.data.table(brip1.summary,
  measure.vars=brip1.positions, variable.name="pos", value.name="base")
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
  ggplot(na.omit(brip1.data),
         aes(x=as.numeric(as.character(pos)), y=factor(N),
             group=pattern, color=factor(base))) +
    geom_line(color="grey", position=position_dodgev(height=0.9)) +
    geom_point(position=position_dodgev(height=0.9)) +
    scale_colour_manual(values=c("blue","red","grey80","black")) +
    theme_light() +
    labs(x="position", y="count", title="BRIP1", color="base")
}


Other information

Citing the epialleleR package

Oleksii Nikolaienko, Per Eystein Lønning, Stian Knappskog, epialleleR: an R/BioC package for sensitive allele-specific methylation analysis in NGS data. bioRxiv 2022.06.30.498213, https://www.biorxiv.org/content/10.1101/2022.06.30.498213v1

The data underlying epialleleR manuscript

Replication Data for: “epialleleR: an R/BioC package for quantifying and analysing low-frequency DNA methylation”, https://doi.org/10.18710/2BQTJP

NCBI GEO dataset GSE201690: “Methylation analysis of promoter regions for selected tumour suppressor genes in DNA from white blood cells”, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE201690

The experimental data analysed using the package

Per Eystein Lonning, Oleksii Nikolaienko, Kathy Pan, Allison W. Kurian, Hans Petter Petter Eikesdal, Mary Pettinger, Garnet L Anderson, Ross L Prentice, Rowan T. Chlebowski, and Stian Knappskog. Constitutional BRCA1 methylation and risk of incident triple-negative breast cancer and high-grade serous ovarian cancer. JAMA Oncology 2022. https://doi.org/10.1001/jamaoncol.2022.3846

Other papers will be published soon.

Session Info

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB             
#>  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
#> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] GenomeInfoDb_1.36.1 IRanges_2.34.1      S4Vectors_0.38.1    BiocGenerics_0.46.0
#> [5] ggstance_0.3.6      epialleleR_1.8.1    gridExtra_2.3       ggplot2_3.4.3      
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.1.3                   bitops_1.0-7                biomaRt_2.56.1             
#>  [4] rlang_1.1.1                 magrittr_2.0.3              matrixStats_1.0.0          
#>  [7] compiler_4.3.1              RSQLite_2.3.1               GenomicFeatures_1.52.1     
#> [10] png_0.1-8                   vctrs_0.6.3                 stringr_1.5.0              
#> [13] pkgconfig_2.0.3             crayon_1.5.2                fastmap_1.1.1              
#> [16] dbplyr_2.3.3                XVector_0.40.0              labeling_0.4.2             
#> [19] utf8_1.2.3                  Rsamtools_2.16.0            rmarkdown_2.24             
#> [22] bit_4.0.5                   xfun_0.40                   zlibbioc_1.46.0            
#> [25] cachem_1.0.8                jsonlite_1.8.7              progress_1.2.2             
#> [28] blob_1.2.4                  highr_0.10                  DelayedArray_0.26.7        
#> [31] BiocParallel_1.34.2         parallel_4.3.1              prettyunits_1.1.1          
#> [34] R6_2.5.1                    VariantAnnotation_1.46.0    bslib_0.5.1                
#> [37] stringi_1.7.12              rtracklayer_1.60.1          GenomicRanges_1.52.0       
#> [40] jquerylib_0.1.4             Rcpp_1.0.11                 SummarizedExperiment_1.30.2
#> [43] knitr_1.43                  Matrix_1.6-1                tidyselect_1.2.0           
#> [46] abind_1.4-5                 yaml_2.3.7                  codetools_0.2-19           
#> [49] curl_5.0.2                  plyr_1.8.8                  lattice_0.21-8             
#> [52] tibble_3.2.1                Biobase_2.60.0              withr_2.5.0                
#> [55] KEGGREST_1.40.0             evaluate_0.21               BiocFileCache_2.8.0        
#> [58] xml2_1.3.5                  Biostrings_2.68.1           pillar_1.9.0               
#> [61] filelock_1.0.2              MatrixGenerics_1.12.3       generics_0.1.3             
#> [64] RCurl_1.98-1.12             hms_1.1.3                   munsell_0.5.0              
#> [67] scales_1.2.1                glue_1.6.2                  tools_4.3.1                
#> [70] BiocIO_1.10.0               data.table_1.14.8           BSgenome_1.68.0            
#> [73] GenomicAlignments_1.36.0    XML_3.99-0.14               grid_4.3.1                 
#> [76] AnnotationDbi_1.62.2        colorspace_2.1-0            GenomeInfoDbData_1.2.10    
#> [79] restfulr_0.0.15             cli_3.6.1                   rappdirs_0.3.3             
#> [82] fansi_1.0.4                 S4Arrays_1.0.5              dplyr_1.1.2                
#> [85] gtable_0.3.4                sass_0.4.7                  digest_0.6.33              
#> [88] rjson_0.2.21                farver_2.1.1                memoise_2.0.1              
#> [91] htmltools_0.5.6             lifecycle_1.0.3             httr_1.4.7                 
#> [94] bit64_4.0.5

References


  1. https://doi.org/10.1146/annurev.pharmtox.45.120403.095832↩︎

  2. https://doi.org/10.1101/2020.12.01.403501↩︎

  3. https://doi.org/10.1093/bib/bbx077↩︎

  4. https://www.zymoresearch.com/products/human-methylated-non-methylated-dna-set-dna-w-primers↩︎

  5. https://dx.doi.org/10.1186%2Fs13148-020-00920-7↩︎

  6. https://www.bioinformatics.babraham.ac.uk/projects/bismark/↩︎

  7. https://doi.org/10.1038/modpathol.2009.130↩︎