Input data

A Multi Assay Experiment object (Ramos et al. 2017,SIG (2017)) is the input for all main functions of ELMER and can be generated by createMAE function.

To perform ELMER analyses, the Multi Assay Experiment needs:

  • a DNA methylation matrix or SummarizedExperiment object from HM450K or EPIC platform for multiple samples;
  • a gene expression matrix or SummarizedExperiment object for the same samples;
  • a matrix mapping DNA methylation samples to gene expression samples
  • a matrix with samples metadata (i.e. clinical data, molecular subtype information).

If TCGA data are used, the the last two matrices will be automatically generated. Based on the genome of reference selected, metadata for the DNA methylation probes, such as genomic coordinates, will be added from Wanding Zhou annotation (Zhou, Laird, and Shen 2016); and metadata for gene annotation will be added from ensemble database (Yates et al. 2015) using biomaRt (Durinck et al. 2009).

DNA methylation data

DNA methylation data feeding to ELMER should be a matrix of DNA methylation beta (\(\beta\)) value for samples (column) and probes (row) processed from row HM450K array data. If TCGA data is used, processed data from GDC website will be downloaded and automatically transformed to the matrix by ELMER. The processed TCGA DNA methylation data were calculated as \(\frac{M}{(M+U)}\), where M represents the methylated allele intensity and U the unmethylated allele intensity. Beta values range from 0 to 1, reflecting the fraction of methylated alleles at each CpG in the each tumor; beta values close to 0 indicates low levels of DNA methylation and beta values close to 1 indicates high levels of DNA methylation.

If user have raw HM450K data, these data can be processed by Methylumi or minfi generating DNA methylation beta (\(\beta\)) value for each CpG site and multiple samples. The getBeta function in minfi can be used to generate a matrix of DNA methylation beta (\(\beta\)) value to feed in ELMER. And we recommend to save this matrix as meth.rda since createMAE can read in files by specifying their path which will help to reduce memory usage.

# Example of DNA methylation data input
datatable(Meth[1:10, 1:10], 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = TRUE)

Gene expression data

Gene expresion data feeding to ELMER should be a matrix of gene expression values for samples (column) and genes (row). Gene expression value can be generated from different platforms: array or RNA-seq. The row data should be processed by other software to get gene or transcript level gene expression calls such as mapping by tophat, calling expression value by cufflink, RSEM or GenomeStudio for expression array. It is recommended to normalize expression data making gene expression comparable across samples such as quantile normalization. User can refer TCGA RNA-seq analysis pipeline to do generate comparable gene expression data. Then transform the gene expression values from each sample to the matrix for feeding into ELMER. If users want to use TCGA data, ELMER has functions to download the RNA-Seq Quantification data (HTSeq-FPKM-UQ) from GDC website and transform the data to the matrix for feeding into ELMER. It is recommended to save this matrix as RNA.rda since createMAE can read in files by specifying the path of files which will help to reduce memory usage.

# Example of Gene expression data input
datatable(GeneExp[1:10, 1:2], 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = TRUE)

Sample information

Sample information should be stored as a data.frame object containing sample ID, group labels (control and experiment). Sample ID and groups labels are required. Other information for each sample can be added to this data.frame object. When TCGA data were used, samples information will be automatically generated by createMAE function by specifying option TCGA=TRUE. A columns name TN will create the groups Tumor and Normal using the following samples to each group:

Tumor samples are:

  • Primary solid Tumor
  • Recurrent Solid Tumor
  • Primary Blood Derived Cancer - Peripheral Blood
  • Recurrent Blood Derived Cancer - Bone Marrow
  • Additional - New Primary
  • Metastatic
  • Additional Metastatic
  • Human Tumor Original Cells
  • Primary Blood Derived Cancer - Bone Marrow

Normal samples:

  • Blood Derived Normal
  • Solid Tissue Normal
  • Buccal Cell Normal
  • EBV Immortalized Normal
  • Bone Marrow Normal
library(MultiAssayExperiment)
data <- createMAE(exp = GeneExp, 
                  met = Meth,
                  met.platform = "450K",
                  genome = "hg19",
                  save = FALSE,
                  TCGA = TRUE)
data
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] DNA methylation: RangedSummarizedExperiment with 1679 rows and 234 columns 
##  [2] Gene expression: RangedSummarizedExperiment with 3808 rows and 234 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
as.data.frame(colData(data)[,c("patient","definition","TN")]) %>% 
  datatable(options = list(scrollX = TRUE,pageLength = 5)) 
# Adding sample information for non TCGA samples
# You should have two objects with one for DNA methylation and 
# one for gene expression. They should have the same number of samples and the names of the 
# sample in the gene expression object and in hte DNA methylation matrix 
# should be the same
not.tcga.exp <- GeneExp # 234 samples
colnames(not.tcga.exp) <- substr(colnames(not.tcga.exp),1,15)
not.tcga.met <- Meth # 268 samples
colnames(not.tcga.met) <- substr(colnames(not.tcga.met),1,15)
# Number of samples in both objects (234)
table(colnames(not.tcga.met) %in% colnames(not.tcga.exp)) 
## 
## FALSE  TRUE 
##    34   234
# Our sample information must have as row names the samples information
phenotype.data <- data.frame(row.names = colnames(not.tcga.exp), 
                             samples = colnames(not.tcga.exp), 
                             group = c(rep("group1", ncol(GeneExp)/2),
                                       rep("group2", ncol(GeneExp)/2)))

data.hg19 <- createMAE(exp = not.tcga.exp, 
                       met =  not.tcga.met, 
                       TCGA = FALSE, 
                       met.platform = "450K",
                       genome = "hg19", 
                       colData = phenotype.data)
data.hg19
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] DNA methylation: RangedSummarizedExperiment with 1679 rows and 234 columns 
##  [2] Gene expression: RangedSummarizedExperiment with 3808 rows and 234 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
# The samples that does not have data for both DNA methylation and Gene exprssion will be removed even for the phenotype data
phenotype.data <- data.frame(row.names = colnames(not.tcga.met), 
                             samples = colnames(not.tcga.met), 
                             group = c(rep("group1", ncol(Meth)/4),
                                       rep("group2", ncol(Meth)/4),
                                       rep("group3", ncol(Meth)/4),
                                       rep("group4", ncol(Meth)/4)))

data.hg38 <- createMAE(exp = not.tcga.exp, 
                       met =  not.tcga.met, 
                       TCGA = FALSE, 
                       save = FALSE,
                       met.platform = "450K",
                       genome = "hg38", 
                       colData = phenotype.data)
data.hg38
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] DNA methylation: RangedSummarizedExperiment with 1679 rows and 234 columns 
##  [2] Gene expression: RangedSummarizedExperiment with 3741 rows and 234 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
as.data.frame(colData(data.hg38)[1:20,]) %>%
  datatable(options = list(scrollX = TRUE,pageLength = 5)) 

Probe information

Probe information is stored as a GRanges object containing the coordinates of each probe on the DNA methylation array and names of each probe. The default probe information is fetching from Wanding Zhou annotation (Zhou, Laird, and Shen 2016)

library(SummarizedExperiment, quietly = TRUE)
rowRanges(getMet(data))[1:3,1:8]
## GRanges object with 3 ranges and 8 metadata columns:
##              seqnames                 ranges strand |  addressA  addressB
##                 <Rle>              <IRanges>  <Rle> | <numeric> <numeric>
##   cg00045114     chr1 [172674159, 172674160]      * |  52756415      <NA>
##   cg00050294     chr1 [  2886818,   2886819]      * |  51670436      <NA>
##   cg00066722     chr1 [ 43634520,  43634521]      * |  67752427      <NA>
##                  channel designType    nextBase nextBaseRef   probeType
##              <character>   <factor> <character> <character> <character>
##   cg00045114        Both         II         G/A           C          cg
##   cg00050294        Both         II         G/A           C          cg
##   cg00066722        Both         II         G/A           C          cg
##              orientation
##                 <factor>
##   cg00045114        down
##   cg00050294        down
##   cg00066722          up
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Gene information

Gene information is stored as a GRanges object containing coordinates of each gene, gene id, gene symbol and gene isoform id. The default gene information is the ensembl gene annotation fetched from biomaRt by ELMER function.

rowRanges(getExp(data))
## GRanges object with 3808 ranges and 2 metadata columns:
##                   seqnames                 ranges strand |
##                      <Rle>              <IRanges>  <Rle> |
##   ENSG00000188984     chr1 [ 12776118,  12788726]      + |
##   ENSG00000204518     chr1 [ 12704566,  12727097]      + |
##   ENSG00000108270    chr17 [ 35306175,  35414171]      + |
##   ENSG00000198691     chr1 [ 94458393,  94586688]      - |
##   ENSG00000135776     chr1 [229652329, 229694442]      - |
##               ...      ...                    ...    ... .
##   ENSG00000132003    chr19   [13906274, 13943044]      + |
##   ENSG00000162415     chr1   [45482071, 45771881]      - |
##   ENSG00000203995     chr1   [53308183, 53360670]      + |
##   ENSG00000162378     chr1   [53192126, 53293014]      + |
##   ENSG00000036549     chr1   [78028101, 78149104]      - |
##                   external_gene_name ensembl_gene_id
##                          <character>     <character>
##   ENSG00000188984            AADACL3 ENSG00000188984
##   ENSG00000204518            AADACL4 ENSG00000204518
##   ENSG00000108270               AATF ENSG00000108270
##   ENSG00000198691              ABCA4 ENSG00000198691
##   ENSG00000135776             ABCB10 ENSG00000135776
##               ...                ...             ...
##   ENSG00000132003             ZSWIM4 ENSG00000132003
##   ENSG00000162415             ZSWIM5 ENSG00000162415
##   ENSG00000203995             ZYG11A ENSG00000203995
##   ENSG00000162378             ZYG11B ENSG00000162378
##   ENSG00000036549               ZZZ3 ENSG00000036549
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Multi Assay Experiment object

A Multi Assay Experiment object from the MultiAssayExperiment package is the input for multiple main functions of ELMER. It contains the above components and making a Multi Assay Experiment object by createMAE function will keep each component consistent with each other. For example, althougth DNA methylation and gene expression matrixes have different rows (probe for DNA methylation and gene id for gene expression), the column (samples) order should be same in the two matrixes. The createMAE function will keep them consistent when it generates the Multi Assay Experiment object.

data <- createMAE(exp = GeneExp, 
                  met = Meth,
                  genome = "hg19",
                  save = FALSE,
                  met.platform = "450K",
                  TCGA = TRUE)

# For TGCA data 1-12 represents the patient and 1-15 represents the sample ID (i.e. primary solid tumor samples )
all(substr(colnames(getExp(data)),1,15) == substr(colnames(getMet(data)),1,15))
## [1] TRUE
# See sample information for data
as.data.frame(colData(data)) %>% datatable(options = list(scrollX = TRUE)) 
# See sample names for each experiment
as.data.frame(sampleMap(data)) %>% datatable(options = list(scrollX = TRUE)) 

You can also use your own data and annotations to create Multi Assay Experiment object.

# NON TCGA example: matrices has diffetrent column names
gene.exp <- DataFrame(sample1 = c("ENSG00000141510"=2.3,"ENSG00000171862"=5.4),
                      sample2 = c("ENSG00000141510"=1.6,"ENSG00000171862"=2.3))
dna.met <- DataFrame(sample1 = c("cg14324200"=0.5,"cg23867494"=0.1),
                     sample2 =  c("cg14324200"=0.3,"cg23867494"=0.9))
sample.info <- DataFrame(sample.type = c("Normal", "Tumor"))
rownames(sample.info) <- colnames(gene.exp)

mae <- createMAE(exp = gene.exp,
                 met = dna.met,
                 save = FALSE,
                 met.platform = "450K",
                 colData = sample.info, 
                 genome = "hg38") 

# NON TCGA example: matrices has diffetrent column names
rownames(sample.info) <- c("sample1","sample2")

sampleMap <- DataFrame(primary = c("sample1","sample1","sample2","sample2"), 
                       colname = c("sample1.exp","sample1.met","sample2.exp","sample2.met"))

mae <- createMAE(exp = gene.exp, 
                 met = dna.met, 
                 sampleMap = sampleMap, 
                 colData = sample.info, 
                 save = FALSE,
                 met.platform = "450K",
                 genome = "hg38") 

Durinck, Steffen, Paul T Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor Package BiomaRt.” Nature Protocols 4 (8). Nature Publishing Group: 1184–91.

Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez Cabrera, Tiffany Chan, et al. 2017. “Software for the Integration of Multi-Omics Experiments in Bioconductor.” BioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/144774.

SIG, MultiAssay. 2017. MultiAssayExperiment: Software for the Integration of Multi-Omics Experiments in Bioconductor. https://github.com/waldronlab/MultiAssayExperiment/wiki/MultiAssayExperiment-API.

Yates, Andrew, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Konstantinos Billis, Denise Carvalho-Silva, Carla Cummins, et al. 2015. “Ensembl 2016.” Nucleic Acids Research. Oxford Univ Press, gkv1157.

Zhou, Wanding, Peter W Laird, and Hui Shen. 2016. “Comprehensive Characterization, Annotation and Innovative Use of Infinium Dna Methylation Beadchip Probes.” Nucleic Acids Research. Oxford Univ Press, gkw967.