SummarizedBenchmark

Patrick K. Kimes, Alejandro Reyes

23 July 2018

Abstract

“When performing a data analysis in R, users are often presented with multiple packages and methods for accomplishing the same task. Benchmarking the performance of these different methods on real and simulated data sets is a common way of learning the relative strengths and weaknesses of each approach. However, as the number of tools and parameters increases, keeping track of output and how it was generated can quickly becomes messy. The SummarizedBenchmark package provides a framework for organizing benchmark comparisons, making it easier to both reproduce the original benchmark and replicate the comparison with new data. This vignette introduces the general approach and features of the package using two examples. SummarizedBenchmark package version: 1.0.4”

Package

SummarizedBenchmark 1.0.4

1 Introduction
2 Quickstart Case Study
3 Differential Expression Case Study
- 3.1 Benchmark Set-Up and Execution
- 3.2 Benchmark Evaluation
4 Advanced Features
5 Related Work
References

1 Introduction

With SummarizedBenchmark, a complete benchmarking workflow is comprised of three primary components:

data,
methods, and
performance metrics.

The first two (data and methods) are necessary for carrying out the benchmark experiment, and the last (performance metrics) is essential for evaluating the results of the experiment. Following this approach, the SummarizedBenchmark package defines two types of objects: BenchDesign objects and SummarizedBenchmark objects. BenchDesign objects contain only the design of the benchmark experiment, namely the data and methods, where a method is defined as the combination of a function or algorithm and parameter settings. After constructing a BenchDesign, the experiment is executed to create a SummarizedBenchmark containing the results of applying the methods to the data. SummarizedBenchmark objects extend the Bioconductor SummarizedExperiment class, with the additional capability of working with performance metrics.

The basic framework is illustrated in the figure below. A BenchDesign is created with methods and combined with data to create a SummarizedBenchmark, which contains the output of applying the methods to the data. This output is then paired with performance metrics specified by the user. Note that the same BenchDesign can be combined with several data sets to produce several SummarizedBenchmark objects with the corresponding outputs. For convenience, several default performance metrics are implemented in the package, and can be added to SummarizedBenchmark objects using simple commands.

basic benchmarking class relationship

In this vignette, we first illustrate the basic use of both the BenchDesign and SummarizedBenchmark classes with a simple comparison of methods for p-value correction in the context of multiple hypothesis testing. Then, we describe more advanced features of the package with a case study comparing three methods for differential expression analysis.

2 Quickstart Case Study

library("SummarizedBenchmark")
library("magrittr")

To illustrate the basic use of the BenchDesign class, we use the tdat data set included with this package.

data(tdat)

The data set is a data.frame containing the results of 50 two-sample t-tests. The tests were performed using independently simulated sets of 20 observations drawn from a single standard Normal distribution (when H = 0) or two mean-shifted Normal distributions (when H = 1).

head(tdat)

##   H test_statistic effect_size         pval        SE
## 1 1     -3.2083437 -1.17151466 4.872282e-03 0.3651463
## 2 0      0.1692236  0.07321912 8.675080e-01 0.4326768
## 3 1     -5.7077940 -1.81715381 2.061521e-05 0.3183636
## 4 1     -1.9805856 -1.09107836 6.313031e-02 0.5508867
## 5 1     -1.0014358 -0.37726058 3.298895e-01 0.3767197
## 6 1     -0.9190433 -0.47583669 3.702252e-01 0.5177522

Several approaches have been proposed and implemented to compute adjusted p-values and q-values with the goal of controlling the total number of false discoveries across a collection of tests. In this example, we compare three such methods:

Bonferroni correction (p.adjust w/ method = "bonferroni") (Dunn 1961),
Benjamini-Hochberg (p.adjust w/ method = "BH") (Benjamini and Hochberg 1995), and
Storey’s FDR q-value (qvalue::qvalue) (Storey 2002).

First, consider how benchmarking the three methods might look without the SummarizedBenchmark framework.

To compare methods, each is applied to tdat, and the results are stored in separate variables.

adj_bonf <- p.adjust(p = tdat$pval, method = "bonferroni")

adj_bh <- p.adjust(p = tdat$pval, method = "BH")

qv <- qvalue::qvalue(p = tdat$pval)
adj_qv <- qv$qvalues

Since the values of interest are available from the ouput of each method as a vector of length 50 (the number of hypotheses tested), to keep things clean, they can be combined into a single data.frame.

adj <- cbind.data.frame(adj_bonf, adj_bh, adj_qv)
head(adj)

##     adj_bonf     adj_bh       adj_qv
## 1 0.24361409 0.02191765 0.0079813228
## 2 1.00000000 0.90365415 0.3290660527
## 3 0.00103076 0.00103076 0.0003753518
## 4 1.00000000 0.12140444 0.0442094809
## 5 1.00000000 0.47127065 0.1716134119
## 6 1.00000000 0.48250104 0.1757029652

The data.frame of adjusted p-values and q-values can be used to compare the methods, either by directly parsing the table or using a framework like iCOBRA. Additionally, the data.frame can be saved as a RDS or Rdata object for future reference, eliminating the need for recomputing on the original data.

While this approach can work well for smaller comparisons, it can quickly become overwhelming and unweildy as the number of methods and parameters increases. Furthermore, once each method is applied and the final data.frame (adj) is constructed, there is no way to determine how each value was calculated. While an informative name can be used to “label” each method (as done above), this does not capture the full complexity, e.g. parameters and context, where the function was evaluated. One solution might involve manually recording function calls and parameters in a separate data.frame with the hope of maintaining synchrony with the output data.frame. However, this is prone to errors, e.g. during fast “copy and paste” operations or additions and delations of parameter combinations. An alternative (and hopefully better) solution, is to use the framework of the SummarizedBenchmark package.

In the SummarizedBenchmark approach, a BenchDesign is first constructed with the data as the sole input. (A BenchDesign can also be constructed without any data input. This approach is described in a later section.)

b <- BenchDesign(tdat)

Then, each method of interest is added to the BenchDesign using addMethod().

b <- addMethod(bd = b, label = "bonf", func = p.adjust,
               params = rlang::quos(p = pval, method = "bonferroni"))

At a minimum, addMethod() requires three parameters:

bd: the BenchDesign object to modify,
label: a character name for the method, and
func: the function to be called.

After the minimum parameters are specified, any parameters needed by the func method should be passed as named parameters, e.g. p = pval, method = "bonferroni", to params = as a list of quosures using rlang::quos(..). Notice here that the pval wrapped in rlang::quos(..) does not need to be specified as tdat$pval for the function to access the column in the data.frame. For readers familiar with the ggplot2 package, the use of params = rlang::quos(..) here should be viewed similar to the use of aes = aes(..) in ggplot2 for mapping between the data and plotting (or benchmarking) parameters.

The process of adding methods can be written more concisely using the pipe operators from the magrittr package.

b <- b %>% 
    addMethod(label = "BH",
              func = p.adjust,
              params = rlang::quos(p = pval, method = "BH")) %>%
    addMethod(label = "qv",
              func = qvalue::qvalue,
              params = rlang::quos(p = pval),
              post = function(x) { x$qvalues })

For some methods, such as the q-value approach above, it may be necessary to call a “post-processing” function on the primary method to extract the desired output (here, the q-values). This should be specified using the optional post = parameter.

Now, the BenchDesign object contains three methods. This can be verified using the printMethods() function.

printMethods(b)

## bonf ------------------------------------------------------- 
## func:
##      p.adjust 
## post:
##      NULL 
## meta:
## parameters:
##      p : pval 
##      method : "bonferroni" 
## BH --------------------------------------------------------- 
## func:
##      p.adjust 
## post:
##      NULL 
## meta:
## parameters:
##      p : pval 
##      method : "BH" 
## qv --------------------------------------------------------- 
## func:
##      qvalue::qvalue 
## post:
##      function(x) {
##     x$qvalues
## } 
## meta:
## parameters:
##      p : pval

While the bench now includes all the information necessary for performing the benchmarking study, the actual adjusted p-values and q-values have not yet been calculated. To do this, we simply call buildBench(). While buildBench() does not require any inputs other than the BenchDesign object, when the corresponding ground truth is known, the truthCols = parameter should be specified. In this example, the H column of the tdat data.frame contains the true null or alternative status of each simulated hypothesis test. Note that if any of the methods are defined in a separate package, they must be installed and loaded before running the experiment.

sb <- buildBench(b, truthCols = "H")

The returned object is a SummarizedBenchmark class. The SummarizedBenchmark object is an extension of a SummarizedExperiment object. The table of adjusted p-values and q-values is contained in a single “assay” of the object with each method added using addMethod() as a column with the corresponding label as the name.

head(assay(sb))

##            bonf         BH           qv
## [1,] 0.24361409 0.02191765 0.0079813228
## [2,] 1.00000000 0.90365415 0.3290660527
## [3,] 0.00103076 0.00103076 0.0003753518
## [4,] 1.00000000 0.12140444 0.0442094809
## [5,] 1.00000000 0.47127065 0.1716134119
## [6,] 1.00000000 0.48250104 0.1757029652

Metadata for the methods is contained in the colData() of the same object, with each row corresponding to one method in the comparison.

colData(sb)

## DataFrame with 3 rows and 9 columns
##                func                          post func_anon    vers_src
##         <character>                   <character> <logical> <character>
## bonf       p.adjust                            NA     FALSE        func
## BH         p.adjust                            NA     FALSE        func
## qv   qvalue::qvalue function(x) {;    x$qvalues;}     FALSE        func
##         pkg_name    pkg_vers     param.p param.method       label
##      <character> <character> <character>  <character> <character>
## bonf       stats       3.5.1        pval "bonferroni"        bonf
## BH         stats       3.5.1        pval         "BH"          BH
## qv        qvalue      2.12.0        pval           NA          qv

In addition to columns for the functions and parameters specified with addMethod() (func, post, label, param.*), the colData() includes several other columns added during the buildBench() process. Most notably, columns for the package name and version of func if available (pkg_name, pkg_vers).

When available, ground truth data is contained in the rowData() of the SummarizedBenchmark object.

rowData(sb)

## DataFrame with 50 rows and 1 column
##             H
##     <numeric>
## 1           1
## 2           0
## 3           1
## 4           1
## 5           1
## ...       ...
## 46          0
## 47          1
## 48          0
## 49          1
## 50          1

An important advantage of building on the existing SummarizedExperiment class and Bioconductor infrastructure to save the results is that the metadata is tighly linked to the data. Thus, it is possible, for example, to subset the data while keeping the link to its respective metadata in a single step. For example, the code below extracts the data for only the first two methods.

sbSub <- sb[,1:2]
colData(sbSub)

## DataFrame with 2 rows and 9 columns
##             func        post func_anon    vers_src    pkg_name    pkg_vers
##      <character> <character> <logical> <character> <character> <character>
## bonf    p.adjust          NA     FALSE        func       stats       3.5.1
## BH      p.adjust          NA     FALSE        func       stats       3.5.1
##          param.p param.method       label
##      <character>  <character> <character>
## bonf        pval "bonferroni"        bonf
## BH          pval         "BH"          BH

In addition, the SummarizedBenchmark class contains an additional slot where users can define performance metrics to evaluate the different methods.

Since different benchmarking experiments may require the use of different metrics to evaluate the performance of the methods, the SummarizedBenchmark class provides a flexible way to define performance metrics. We can define performance metrics using the function addPerformanceMetric() by providing a SummarizedBenchmark object, a name of the metric, an assay name, and the function that defines it. Importantly, the function must contain the following two arguments: query (referring to a vector of values being evaluated, i.e. the output of one method) and truth (referring to the vector of ground truths). If further arguments are provided to the performance function, these must contain default values.

For our example, we define the performance metric “TPR” (True Positive Rate) that calculates the fraction of true positives recovered given an alpha value. This performance metric uses the H assay of our SummarizedBenchmark example object.

sb <- addPerformanceMetric(
  object = sb,
  assay = "H",
  evalMetric = "TPR",
  evalFunction = function(query, truth, alpha = 0.1) {
    goodHits <- sum((query < alpha) & truth == 1)
    goodHits / sum(truth == 1)
    }
)

performanceMetrics(sb)[["H"]]

## $TPR
## function (query, truth, alpha = 0.1) 
## {
##     goodHits <- sum((query < alpha) & truth == 1)
##     goodHits/sum(truth == 1)
## }

Having defined all the desired performance metrics, the function estimatePerformanceMetrics() calculates these for each method. Parameters for the performance functions can be passed here. In the case below, we specify several alpha = values to be used for calculating the performance metrics with each function.

resWide <- estimatePerformanceMetrics(sb, alpha = c(0.05, 0.1, 0.2))
resWide

## DataFrame with 3 rows and 12 columns
##                func                          post func_anon    vers_src
##         <character>                   <character> <logical> <character>
## bonf       p.adjust                            NA     FALSE        func
## BH         p.adjust                            NA     FALSE        func
## qv   qvalue::qvalue function(x) {;    x$qvalues;}     FALSE        func
##         pkg_name    pkg_vers     param.p param.method       label     TPR.1
##      <character> <character> <character>  <character> <character> <numeric>
## bonf       stats       3.5.1        pval "bonferroni"        bonf     0.125
## BH         stats       3.5.1        pval         "BH"          BH     0.375
## qv        qvalue      2.12.0        pval           NA          qv       0.6
##          TPR.2     TPR.3
##      <numeric> <numeric>
## bonf       0.2     0.225
## BH        0.55     0.625
## qv         0.7       0.9

By default, the function above returns a DataFrame, where the parameters of the performance function are stored in its elementMetadata().

elementMetadata(resWide)

## DataFrame with 12 rows and 4 columns
##               colType       assay performanceMetric     alpha
##           <character> <character>       <character> <numeric>
## 1   methodInformation          NA                NA        NA
## 2   methodInformation          NA                NA        NA
## 3   methodInformation          NA                NA        NA
## 4   methodInformation          NA                NA        NA
## 5   methodInformation          NA                NA        NA
## ...               ...         ...               ...       ...
## 8   methodInformation          NA                NA        NA
## 9   methodInformation          NA                NA        NA
## 10  performanceMetric           H               TPR      0.05
## 11  performanceMetric           H               TPR       0.1
## 12  performanceMetric           H               TPR       0.2

A second possibility is to set the parameter addColData = TRUE for these results to be stored in the colData() of the SummarizedBenchmark object.

sb <- estimatePerformanceMetrics(sb, 
                                 alpha = c(0.05, 0.1, 0.2), 
                                 addColData = TRUE)
colData(sb)

## DataFrame with 3 rows and 12 columns
##                func                          post func_anon    vers_src
##         <character>                   <character> <logical> <character>
## bonf       p.adjust                            NA     FALSE        func
## BH         p.adjust                            NA     FALSE        func
## qv   qvalue::qvalue function(x) {;    x$qvalues;}     FALSE        func
##         pkg_name    pkg_vers     param.p param.method       label     TPR.1
##      <character> <character> <character>  <character> <character> <numeric>
## bonf       stats       3.5.1        pval "bonferroni"        bonf     0.125
## BH         stats       3.5.1        pval         "BH"          BH     0.375
## qv        qvalue      2.12.0        pval           NA          qv       0.6
##          TPR.2     TPR.3
##      <numeric> <numeric>
## bonf       0.2     0.225
## BH        0.55     0.625
## qv         0.7       0.9

elementMetadata(colData(sb))

## DataFrame with 12 rows and 4 columns
##               colType       assay performanceMetric     alpha
##           <character> <character>       <character> <numeric>
## 1   methodInformation          NA                NA        NA
## 2   methodInformation          NA                NA        NA
## 3   methodInformation          NA                NA        NA
## 4   methodInformation          NA                NA        NA
## 5   methodInformation          NA                NA        NA
## ...               ...         ...               ...       ...
## 8   methodInformation          NA                NA        NA
## 9   methodInformation          NA                NA        NA
## 10  performanceMetric           H               TPR      0.05
## 11  performanceMetric           H               TPR       0.1
## 12  performanceMetric           H               TPR       0.2

Finally, if the user prefers tidier formats, by setting the parameter tidy = TRUE the function returns a long-formated version of the results.

estimatePerformanceMetrics(sb, 
                           alpha = c(0.05, 0.1, 0.2), 
                           tidy = TRUE)

##             func                          post func_anon vers_src pkg_name
## 1       p.adjust                          <NA>     FALSE     func    stats
## 2       p.adjust                          <NA>     FALSE     func    stats
## 3 qvalue::qvalue function(x) {;    x$qvalues;}     FALSE     func   qvalue
## 4       p.adjust                          <NA>     FALSE     func    stats
## 5       p.adjust                          <NA>     FALSE     func    stats
## 6 qvalue::qvalue function(x) {;    x$qvalues;}     FALSE     func   qvalue
## 7       p.adjust                          <NA>     FALSE     func    stats
## 8       p.adjust                          <NA>     FALSE     func    stats
## 9 qvalue::qvalue function(x) {;    x$qvalues;}     FALSE     func   qvalue
##   pkg_vers param.p param.method label   key value assay performanceMetric
## 1    3.5.1    pval "bonferroni"  bonf TPR.1 0.125     H               TPR
## 2    3.5.1    pval         "BH"    BH TPR.1 0.375     H               TPR
## 3   2.12.0    pval         <NA>    qv TPR.1 0.600     H               TPR
## 4    3.5.1    pval "bonferroni"  bonf TPR.2 0.200     H               TPR
## 5    3.5.1    pval         "BH"    BH TPR.2 0.550     H               TPR
## 6   2.12.0    pval         <NA>    qv TPR.2 0.700     H               TPR
## 7    3.5.1    pval "bonferroni"  bonf TPR.3 0.225     H               TPR
## 8    3.5.1    pval         "BH"    BH TPR.3 0.625     H               TPR
## 9   2.12.0    pval         <NA>    qv TPR.3 0.900     H               TPR
##   alpha
## 1  0.05
## 2  0.05
## 3  0.05
## 4  0.10
## 5  0.10
## 6  0.10
## 7  0.20
## 8  0.20
## 9  0.20

As an alternative to get the same data.frame as the previous chunk, we can call the function tidyUpMetrics() on the saved results from a SummarizedBenchmark object.

head(tidyUpMetrics(sb))

##             func                          post func_anon vers_src pkg_name
## 1       p.adjust                          <NA>     FALSE     func    stats
## 2       p.adjust                          <NA>     FALSE     func    stats
## 3 qvalue::qvalue function(x) {;    x$qvalues;}     FALSE     func   qvalue
## 4       p.adjust                          <NA>     FALSE     func    stats
## 5       p.adjust                          <NA>     FALSE     func    stats
## 6 qvalue::qvalue function(x) {;    x$qvalues;}     FALSE     func   qvalue
##   pkg_vers param.p param.method label   key value assay performanceMetric
## 1    3.5.1    pval "bonferroni"  bonf TPR.1 0.125     H               TPR
## 2    3.5.1    pval         "BH"    BH TPR.1 0.375     H               TPR
## 3   2.12.0    pval         <NA>    qv TPR.1 0.600     H               TPR
## 4    3.5.1    pval "bonferroni"  bonf TPR.2 0.200     H               TPR
## 5    3.5.1    pval         "BH"    BH TPR.2 0.550     H               TPR
## 6   2.12.0    pval         <NA>    qv TPR.2 0.700     H               TPR
##   alpha
## 1  0.05
## 2  0.05
## 3  0.05
## 4  0.10
## 5  0.10
## 6  0.10

For example, the code below extracts the TPR for an alpha of 0.1 for the Bonferroni method.

tidyUpMetrics(sb) %>%
  dplyr:::filter(label == "bonf", alpha == 0.1, performanceMetric == "TPR") %>%
  dplyr:::select(value)

##   value
## 1   0.2

3 Differential Expression Case Study

library("limma")
library("edgeR")
library("DESeq2")
library("tximport")

In this more advanced case study, we use a simulated data set from (Soneson and Robinson 2016) to demonstrate how the SummarizedBenchmark package can be used to benchmark methods for differential expression analysis. Namely, we compare the methods implemented in the DESeq2, edgeR, and limma packages. The simulated data set includes 6 samples of three replicates each from two conditions. For each sample, transcript-level expression is provided as transcripts per-million (TPM) values for 15,677 transcripts from human chromosome 1 (Ensembl GRCh37.71). A more complete description of the data, including code for how the data ws generated, is available in the Supplementary Materials of (Soneson and Robinson 2016) here. We provide precomputed objects containing these count and ground truth data.

data("soneson2016")
head( txi$counts )

##                         A1          A2         A3         B1        B2
## ENST00000367770   1.670571   0.8064829   5.472561  58.700418  32.89297
## ENST00000367771 558.834722 458.8887676 662.352695   5.299743  20.73813
## ENST00000367772 155.881534 110.6033685 183.417201   0.000000   0.00000
## ENST00000423670  11.809207  16.4752934  10.426669  20.392491   1.26733
## ENST00000470238  96.489863  34.2755231  32.489730 785.514128 614.71261
## ENST00000286031  12.442872  13.4797855   4.781290   1.152118  20.50770
##                         B3
## ENST00000367770  16.648100
## ENST00000367771  14.862318
## ENST00000367772   0.000000
## ENST00000423670   7.546371
## ENST00000470238 815.123229
## ENST00000286031   5.760588

head( truthdat )

## # A tibble: 6 x 13
##   transcript status logFC_cat   logFC avetpm length eff_length   tpm1  tpm2
##   <chr>       <int> <chr>       <dbl>  <dbl>  <int>      <dbl>  <dbl> <dbl>
## 1 ENST00000…      1 [ 3.38,3…   4.17   0.225   2916      2771. 0.0237 0.427
## 2 ENST00000…      1 [ 3.38,3…   5.25   3.20    2921      2776. 6.23   0.164
## 3 ENST00000…      1 Inf       Inf      0.817   3477      3332. 1.63   0    
## 4 ENST00000…      1 [ 0.00, …   0.322  0.203   2077      1932. 0.181  0.226
## 5 ENST00000…      1 [ 3.38,3…   4.07   4.08    1538      1393. 0.460  7.71 
## 6 ENST00000…      0 [ 0.00, …   0      0.154   4355      4210. 0.154  0.154
## # ... with 4 more variables: isopct1 <dbl>, isopct2 <dbl>, gene <chr>,
## #   avetpm_cat <chr>

3.1 Benchmark Set-Up and Execution

We begin the benchmarking process by creating our BenchDesign object with the data set. The BenchDesign can be initialized with a data.frame (as in the case study above), or more generally, with a list object. In this case study, since methods for differential expression require more than just the expression counts, e.g. the experimental design, we construct a list containing each of these inputs as a named entry.

The scaled TPM values are rounded before passing to the differential expression methods.

mycounts <- round(txi$counts)

Here, we simply use the the conditions for each sample to define the experimental design. The design matrix is stored as a data.frame, mycoldat.

mycoldat <- data.frame(condition = factor(rep(c(1, 2), each = 3)))
rownames(mycoldat) <- colnames(mycounts)

The data object for the benchmark experiment is now constructed with both the counts and the design matrix, along with some ground truth information (“status”: the true presence or absence of differential expression between conditions, and “lfc”: the expected log-fold change between conditions).

mydat <- list(coldat = mycoldat,
              cntdat = mycounts,
              status = truthdat$status,
              lfc = truthdat$logFC)

As before, the BenchDesign is constructed with the data as the sole input.

bd <- BenchDesign(mydat)

For simplicity, we focus on comparing only the p-values returned by each method after testing for differential expression between the two conditions. However, later in this vignette, we also show how multiple metrics (p-values and log-fold change) can be compared in a single BenchDesign object.

Since each method requires running multiple steps, we write wrapper functions which return only the vector of p-values for each method.

deseq2_pvals <- function(countData, colData, design, contrast) {
    dds <- DESeqDataSetFromMatrix(countData,
                                  colData = colData,
                                  design = design)
    dds <- DESeq(dds)
    res <- results(dds, contrast = contrast)
    res$pvalue
}

edgeR_pvals <- function(countData, group, design) {
    y <- DGEList(countData, group = group)
    y <- calcNormFactors(y)
    des <- model.matrix(design)
    y <- estimateDisp(y, des)
    fit <- glmFit(y, des)
    lrt <- glmLRT(fit, coef=2)
    lrt$table$PValue
}

voom_pvals <- function(countData, group, design) {
    y <- DGEList(countData, group = group)
    y <- calcNormFactors(y)
    des <- model.matrix(design)
    y <- voom(y, des)
    eb <- eBayes(lmFit(y, des))
    eb$p.value[, 2]
}

Next, each method is added to the BenchDesign using addMethod(), and the corresponding wrapper function passed as func. (For a review of the basic usage of addMethod(), revisit Section 2.) We again use the pipe notation for compactness.

bd <- bd %>%
    addMethod(label = "deseq2",
              func = deseq2_pvals,
              params = rlang::quos(countData = cntdat,
                                   colData = coldat, 
                                   design = ~condition,
                                   contrast = c("condition", "2", "1"))
              ) %>%
    addMethod(label = "edgeR",
              func = edgeR_pvals,
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition)
              ) %>%
    addMethod(label = "voom",
              func = voom_pvals,
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition)
              )

So far, none of the methods have been executed. The BenchDesign object simply serves as a container describing how the methods should be executed. The methods are applied by a simple call to buildBench(). Since the ground truth is known and available in mydat$status, this is specified to truthCols=.

sb <- buildBench(bd, truthCols = "status")

We can inspect the results.

sb

## class: SummarizedBenchmark 
## dim: 15677 3 
## metadata(1): sessionInfo
## assays(1): status
## rownames(15677): ENST00000367770 ENST00000367771 ... ENST00000470694
##   ENST00000499986
## rowData names(1): status
## colnames(3): deseq2 edgeR voom
## colData names(12): func post ... param.group label

3.2 Benchmark Evaluation

By running the code above, the results of three differential expression methods (edgeR, limma-voom and DESeq2 will be stored in a SummarizedBenchmark container. The next step is to define metrics to evaluate the performance of these three methods. This can be done by using the function addPerformanceMetric(), as described before in Section 2. However, in this package there are implementations for several ‘default’ metrics that are commonly used to evaluate methods. The function availableMetrics() returns a data.frame of these metrics.

availableMetrics()

##            functions                                   description
## 1         rejections                          Number of rejections
## 2                TPR                            True Positive Rate
## 3                TNR                            True Negative Rate
## 4                FDR              False Discovery Rate (estimated)
## 5                FNR                           False Negative Rate
## 6        correlation                           Pearson correlation
## 7               sdad Standard Deviation of the Absolute Difference
## 8            hamming                              Hamming distance
## 9             LPnorm                                    L_{p} norm
## 10 adjustedRandIndex                           Adjusted Rand Index
##    requiresTruth
## 1          FALSE
## 2           TRUE
## 3           TRUE
## 4           TRUE
## 5           TRUE
## 6           TRUE
## 7           TRUE
## 8           TRUE
## 9           TRUE
## 10          TRUE

For example, the predefined metrics rejections, TPR, TNR, FDR and FNR can be added to the assay H of our object using the following code,

sb <- addPerformanceMetric( sb, 
                            evalMetric=c("rejections", "TPR", "TNR", "FDR", "FNR"),
                            assay="status" )
names(performanceMetrics(sb)[["status"]])

## [1] "rejections" "TPR"        "TNR"        "FDR"        "FNR"

Having defined the desired performance metrics, the function estimatePerformanceMetrics() will calculate these metrics for each of the three methods.

estimatePerformanceMetrics(
  sb, 
  alpha = c(0.01, 0.05, 0.1, 0.2), 
  tidy = TRUE) %>%
  dplyr:::select(label, value, performanceMetric, alpha) %>%
  tail()

##     label      value performanceMetric alpha
## 55 deseq2 0.10988246               FNR   0.1
## 56  edgeR 0.09381113               FNR   0.1
## 57   voom 0.11381970               FNR   0.1
## 58 deseq2 0.09223211               FNR   0.2
## 59  edgeR 0.07986767               FNR   0.2
## 60   voom 0.08741018               FNR   0.2

Furthermore, the functions plotMethodsOverlap() and plotROC() are helpful to visualize the performance of the different methods, in case these methods output q-values.

plotMethodsOverlap() is a wrapper for the function upset() from the UpSetR package that is helpful to visualize the overlaps between hits of different methods for a given alpha value.

plotMethodsOverlap( sb, assay="status", alpha=0.1, order.by="freq")

From the plot above, it is evident that there is a large number of transcripts that are detected to be differentially expressed by all three methods. There are also smallers sets of transcripts that are detected uniquely by a single method or subsets of methods. Another typical way to compare the performance of different methods are Receiver Operating Characteristic (ROC) curves. The function plotROC() inputs a SummarizeBenchmark object and draws the ROC curves for all methods contained in it.

SummarizedBenchmark::plotROC(sb, assay="status")

4 Advanced Features

Here, we describe several additional features implemented in SummarizedBenchmark for building on the standard workflow described in the previous sections. The features are illustrated using the same differential expression example from above.

4.1 Storing Multiple Outputs

The differential expression case study described above has assumed that we are interested in a single numeric vector for each method, namely, a vector of p-values. These p-values are stored as the sole assay in the SummarizedBenchmark object returned by buildBench(). However, in many cases, there are multiple values of interest to be compared across methods. For example, looking at the estimated log-fold changes in addition to p-values may be informative when comparing methods for differential expression.

The BenchDesign framework supports multiple assays with the post = parameter of the addMethod() call. When zero or one function is specified to post = for all methods, as in the examples above, the results are stored as a single assay. However, if post = is passed a named list of functions, separate assays will be created using the names and functions in each list. Since the assay names are taken from post =, all entries in the list must be named. Furthermore, if more than one assay is desired, the post = parameter must be specified for all methods. This is strictly enforced to avoid ambiguities when combining results across methods.

To track both p-values and log-fold change values for each method, we write new wrapper functions. Separate wrapper functions are written for first returning the primary analysis results, and separate accessor functions for extracting p-values and log-fold changes from the results.

deseq2_run <- function(countData, colData, design, contrast) {
    dds <- DESeqDataSetFromMatrix(countData,
                                  colData = colData,
                                  design = design)
    dds <- DESeq(dds)
    results(dds, contrast = contrast)
}
deseq2_pv <- function(x) {
    x$pvalue
}
deseq2_lfc <- function(x) {
    x$log2FoldChange
}

edgeR_run <- function(countData, group, design) {
    y <- DGEList(countData, group = group)
    y <- calcNormFactors(y)
    des <- model.matrix(design)
    y <- estimateDisp(y, des)
    fit <- glmFit(y, des)
    glmLRT(fit, coef=2)
}
edgeR_pv <- function(x) {
    x$table$PValue
}
edgeR_lfc <- function(x) {
    x$table$logFC
}

voom_run <- function(countData, group, design) {
    y <- DGEList(countData, group = group)
    y <- calcNormFactors(y)
    des <- model.matrix(design)
    y <- voom(y, des)
    eBayes(lmFit(y, des))
}
voom_pv <- function(x) {
    x$p.value[, 2]
}
voom_lfc <- function(x) {
    x$coefficients[, 2]
}

The primary wrapper function and a list of accessor functions are passed to func = and post = respectively.

bd <- BenchDesign(mydat) %>%
    addMethod(label = "deseq2",
              func = deseq2_run,
              post = list(pv = deseq2_pv,
                          lfc = deseq2_lfc),
              params = rlang::quos(countData = cntdat,
                                   colData = coldat, 
                                   design = ~condition,
                                   contrast = c("condition", "2", "1"))
              ) %>%
    addMethod(label = "edgeR",
              func = edgeR_run,
              post = list(pv = edgeR_pv,
                          lfc = edgeR_lfc),
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition)
              ) %>%
    addMethod(label = "voom",
              func = voom_run,
              post = list(pv = voom_pv,
                          lfc = voom_lfc),
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition)
              )

When the BenchDesign is evaluated using buildBench(), the resulting SummarizedBenchmark will be generated with two assays: "pv" and "lfc". As before, the ground truth can be specified using the truthCols = parameter. When multiple assays are used, truthCols = expects a named vector of assay-name = "column-name" pairs.

sb <- buildBench(bd = bd, truthCols = c(pv = "status", lfc = "lfc"))
sb

## class: SummarizedBenchmark 
## dim: 15677 3 
## metadata(1): sessionInfo
## assays(2): pv lfc
## rownames: NULL
## rowData names(2): pv lfc
## colnames(3): deseq2 edgeR voom
## colData names(12): func post ... param.group label

We can verify that the two assays contain the expected values.

head(assay(sb, "pv"))

##            deseq2        edgeR         voom
## [1,] 2.875541e-04 2.074073e-04 6.103023e-03
## [2,] 2.799371e-23 2.121422e-16 1.239445e-04
## [3,] 8.450676e-14 3.109331e-18 5.221157e-06
## [4,] 6.930834e-01 6.621236e-01 3.357040e-01
## [5,] 2.479616e-11 2.468951e-09 3.978989e-04
## [6,] 9.057130e-01 8.947438e-01 5.199404e-01

head(assay(sb, "lfc"))

##          deseq2       edgeR       voom
## [1,]  3.7329019   3.6767779  3.5232524
## [2,] -5.3905346  -5.3773940 -5.5543949
## [3,] -9.7125453 -10.2467488 -8.2533275
## [4,] -0.4706410  -0.4547095 -1.0510456
## [5,]  3.7048724   3.7024426  3.9084940
## [6,] -0.1554938  -0.1504299 -0.7607029

4.2 Parallelizing with BiocParallel

The simple examples considered in this vignette were constructed to be computational manageable with only one core. However, when working with larger data sets, running each method in serial with a single machine is often undesirable. Since constructing a BenchDesign object requires no computation, the bottleneck only appears at the buildBench() step of the process. Parallelization of this step is enabled using the BiocParallel package.

By default, parallel evaluation is disabled, but can easily be enabled by setting parallel = TRUE and optionally specifying the BPPARAM = parameter. If BPPARAM = is not specified, the default back-end will be used. The default back-end can be checked with bpparam().

bpparam()

## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 4; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bptimeout: 2592000; bpprogressbar: FALSE
##   bpRNGseed: 
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

sbp <- buildBench(bd, parallel = TRUE,
                  BPPARAM = BiocParallel::SerialParam())
sbp

## class: SummarizedBenchmark 
## dim: 15677 3 
## metadata(1): sessionInfo
## assays(2): pv lfc
## rownames: NULL
## rowData names(2): pv lfc
## colnames(3): deseq2 edgeR voom
## colData names(12): func post ... param.group label

The results, as expected, are the same as when buildBench() was called without parallelization.

all(assay(sbp) == assay(sb), na.rm = TRUE)

## [1] TRUE

Details on how to specify the parallelization back-end can be found in the Introduction to BiocParallel vignette for the BiocParallel package. Parallelization of buildBench() is carried out across the set of methods specified with addMethod(). There is no benefit to specifying more cores than the number of methods.

4.3 Manually Specifying Method Metadata

Metadata for methods are stored in the colData() of SummarizedBenchmark objects. As metioned above, several default metadata columns are populated in the colData() of the SummarizedBenchmark object generated by a call to buildBench(). Sometimes it may be useful to include additional metadata columns beyond just the default columns. While this can be accomplished manually by modifying the colData() of the SummarizedBenchmark object post hoc, method metadata can also be specified at the addMethod() step using the meta = optional parameter. The meta = parameter accepts a named list of metadata information. Each list entry will be added to the colData() as a new column. To avoid collisions between metadata columns specified with meta = and the default set of columns, metadata specified using meta = will be added to colData() with meta. prefixed to the column name.

As an example, we construct a BenchDesign object again using the differential expression example. The BenchDesign is created with two methods, DESeq2 and edgeR. Each method is specified with the optional meta = parameter. We can verify that the manually defined metadata column (meta.reason) is available in the colData() of the generated SummarizedBenchmark.

BenchDesign(mydat) %>%
    addMethod(label = "deseq2",
              func = deseq2_pvals,
              meta = list(reason = "recommended by friend X"),
              params = rlang::quos(countData = cntdat,
                                   colData = coldat, 
                                   design = ~condition,
                                   contrast = c("condition", "2", "1"))
              ) %>%
    addMethod(label = "edgeR",
              func = edgeR_pvals,
              meta = list(reason = "recommended by friend Y"), 
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition)
              ) %>%
    buildBench() %>%
    colData()

## DataFrame with 2 rows and 13 columns
##                func        post func_anon    vers_src    pkg_name
##         <character> <character> <logical> <character> <character>
## deseq2 deseq2_pvals          NA      TRUE        func          NA
## edgeR   edgeR_pvals          NA      TRUE        func          NA
##           pkg_vers             meta.reason param.countData param.colData
##        <character>             <character>     <character>   <character>
## deseq2          NA recommended by friend X          cntdat        coldat
## edgeR           NA recommended by friend Y          cntdat            NA
##             param.design           param.contrast      param.group
##              <character>              <character>      <character>
## deseq2        ~condition c("condition", "2", "1")               NA
## edgeR  ~coldat$condition                       NA coldat$condition
##              label
##        <character>
## deseq2      deseq2
## edgeR        edgeR

While all methods in this example had the meta = option specified, this is not necessary. It is completely acceptable to specify the meta = parameter for only a subset of methods.

4.4 Manually Modifying Version Metadata

Arguably, two of the most important pieces of metadata stored in the colData() of the SummarizedBenchmark returned by buildBench() are the relevant package name and version (pkg_name, pkg_vers). Determining the package name and version requires the primary “workhorse” function of the method be directly specified as func = in the addMethod() call. In some cases, this may not be possible, e.g. if the “workhorse” function is a wrapper as in the differential expression example above. However, there still might exist an important function for which we would like to track the package name and version. The meta parameter can help.

The meta = parameter will handle the following named list entries as special values: pkg_name, pkg_vers, pkg_func. First, if values are specified for pkg_name and pkg_vers in meta =, these will overwrite the values determined from func =. To trace the source of pkg_name and pkg_vers information, the vers_src column of the colData will be set to "meta_manual" (the default value is "func"). Alternatively, a function can be passed to meta = as pkg_func. This function will be used to determine both pkg_name and pkg_vers, and will take precendence over manually specified pkg_name and pkg_vers values. If pkg_func is specified, it will be included in the colData() as a new column with the same name, and the vers_src column will be set to "meta_func". **Note: the function should be wrapped in rlang::quo to be properly parsed.

The following example illustrates the behavior when using either pkg_func or pkg_name and pkg_vers with the meta optional parameter.

BenchDesign(mydat) %>%
    addMethod(label = "deseq2",
              func = deseq2_pvals,
              meta = list(pkg_func = rlang::quo(DESeq2::DESeq)),
              params = rlang::quos(countData = cntdat,
                                   colData = coldat, 
                                   design = ~condition,
                                   contrast = c("condition", "2", "1"))
              ) %>%
    addMethod(label = "edgeR",
              func = edgeR_pvals,
              meta = list(pkg_name = "edgeR",
                          pkg_vers = as.character(packageVersion("edgeR"))), 
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition)
              ) %>%
    buildBench() %>%
    colData()

## DataFrame with 2 rows and 13 columns
##                func        post func_anon    vers_src    pkg_name
##         <character> <character> <logical> <character> <character>
## deseq2 deseq2_pvals          NA      TRUE   meta_func          NA
## edgeR   edgeR_pvals          NA      TRUE meta_manual       edgeR
##           pkg_vers      pkg_func param.countData param.colData
##        <character>   <character>     <character>   <character>
## deseq2          NA DESeq2::DESeq          cntdat        coldat
## edgeR       3.22.3            NA          cntdat            NA
##             param.design           param.contrast      param.group
##              <character>              <character>      <character>
## deseq2        ~condition c("condition", "2", "1")               NA
## edgeR  ~coldat$condition                       NA coldat$condition
##              label
##        <character>
## deseq2      deseq2
## edgeR        edgeR

4.5 Modifying Methods in a BenchDesign

Modifying the defintion of a method after it has been added to a BenchDesign is supported by the modifyMethod() function. The BenchDesign object created in the differential expression above includes a method called DESeq2. We can check the definition of this method using printMethod().

printMethod(bd, "deseq2")

## deseq2 ----------------------------------------------------- 
## func:
##      deseq2_run 
## post:
##      list(pv = deseq2_pv, lfc = deseq2_lfc) 
## meta:
## parameters:
##      countData : cntdat 
##      colData : coldat 
##      design : ~condition 
##      contrast : c("condition", "2", "1")

Suppose we wish to both flip the order of the contrast, and add a metadata tag. This can be easily accomplished by passing both new parameters to modifyMethod() as a rlang::quos(..) list, similar to how they would be passed to addMethod(). If the func, post, or meta valuesshould be modified, they should be included in the list using the special keywords, bd.func =, bd.post =, or bd.meta =, as shown in the example below.

modifyMethod(bd, label = "deseq2", 
             params = rlang::quos(contrast = c("condition", "1", "2"),
                                  bd.meta = list(note = "modified post hoc"))
             ) %>%
    printMethod("deseq2")

## deseq2 ----------------------------------------------------- 
## func:
##      deseq2_run 
## post:
##      list(pv = deseq2_pv, lfc = deseq2_lfc) 
## meta:
##      note : "modified post hoc" 
## parameters:
##      countData : cntdat 
##      colData : coldat 
##      design : ~condition 
##      contrast : c("condition", "1", "2")

Sometimes it may be desirable to completely overwrite all function parameters for a method, e.g. countData, colData, design, and contrast in the case of DESeq2. This may occur if some parameters were optional and originally specified, but no longer necessary. All function parameters can be overwritten by specifying .overwrite = TRUE.

modifyMethod(bd, label = "deseq2", 
             params = rlang::quos(contrast = c("condition", "1", "2"),
                                  bd.meta = list(note = "modified post hoc")),
             .overwrite = TRUE) %>%
    printMethod("deseq2")

## deseq2 ----------------------------------------------------- 
## func:
##      deseq2_run 
## post:
##      list(pv = deseq2_pv, lfc = deseq2_lfc) 
## meta:
##      note : "modified post hoc" 
## parameters:
##      contrast : c("condition", "1", "2")

Notice that all parameters other than contrast = c("condition", "1", "2") have been dropped.

4.6 Duplicating Methods in a BenchDesign

In addition to comparing multiple methods, a benchmark study may also involve comparing a single method across several parameter settings. The expandMethod() function provides the capability to take a method already defined in the BenchDesign, and expand it to multiple methods with differing parameter values in the BenchDesign object. In the following example, expandMethod() is used to duplicate the DESeq2 method with only the "contrast" parameter modified.

bde <- expandMethod(bd, label = "deseq2", 
                    onlyone = "contrast",
                    params = rlang::quos(deseq2_v1 = c("condition", "1", "2"),
                                         deseq2_v2 = c("condition", "2", "2"))
                    )
printMethod(bde, "deseq2_v1")

## deseq2_v1 -------------------------------------------------- 
## func:
##      deseq2_run 
## post:
##      list(pv = deseq2_pv, lfc = deseq2_lfc) 
## meta:
## parameters:
##      countData : cntdat 
##      colData : coldat 
##      design : ~condition 
##      contrast : c("condition", "1", "2")

printMethod(bde, "deseq2_v2")

## deseq2_v2 -------------------------------------------------- 
## func:
##      deseq2_run 
## post:
##      list(pv = deseq2_pv, lfc = deseq2_lfc) 
## meta:
## parameters:
##      countData : cntdat 
##      colData : coldat 
##      design : ~condition 
##      contrast : c("condition", "2", "2")

Notice that the parameter to be modified is specified with onlyone = (denoting “only one” parameter will be changed), and that method names are taken from names of the quosure list passed to params = in the expandMethod() call. To modify more than a single parameter in the duplicated methods, the onlyone = parameter can be ignored, and the new parameter values specified to params = as a list of quosure lists. Below, both the "contrast" and meta parameters are modified in the expanded methods.

bde <- expandMethod(bd, label = "deseq2", 
                    params = list(deseq2_v1 = rlang::quos(contrast = c("condition", "1", "2"),
                                                          meta = list(note = "filp order")),
                                  deseq2_v2 = rlang::quos(contrast = c("condition", "2", "2"),
                                                          meta = list(note = "nonsensical order"))
                                  )
                    )
printMethod(bde, "deseq2_v1")

## deseq2_v1 -------------------------------------------------- 
## func:
##      deseq2_run 
## post:
##      list(pv = deseq2_pv, lfc = deseq2_lfc) 
## meta:
## parameters:
##      countData : cntdat 
##      colData : coldat 
##      design : ~condition 
##      contrast : c("condition", "1", "2") 
##      meta : list(note = "filp order")

printMethod(bde, "deseq2_v2")

## deseq2_v2 -------------------------------------------------- 
## func:
##      deseq2_run 
## post:
##      list(pv = deseq2_pv, lfc = deseq2_lfc) 
## meta:
## parameters:
##      countData : cntdat 
##      colData : coldat 
##      design : ~condition 
##      contrast : c("condition", "2", "2") 
##      meta : list(note = "nonsensical order")

4.7 Removing Methods in a BenchDesign

After constructing a BenchDesign, it may become clear that a single method or parameter setting is no longer relevant to the comparison. In this case, the method can be easily dropped from the BenchDesign by specifying the BenchDesign and method name to dropMethod().

dropMethod(bd, "deseq2") %>%
    printMethods()

## edgeR ------------------------------------------------------ 
## func:
##      edgeR_run 
## post:
##      list(pv = edgeR_pv, lfc = edgeR_lfc) 
## meta:
## parameters:
##      countData : cntdat 
##      group : coldat$condition 
##      design : ~coldat$condition 
## voom ------------------------------------------------------- 
## func:
##      voom_run 
## post:
##      list(pv = voom_pv, lfc = voom_lfc) 
## meta:
## parameters:
##      countData : cntdat 
##      group : coldat$condition 
##      design : ~coldat$condition

4.8 Reusing a BenchDesign Across Data Sets

When benchmarking several methods, it is generally considered good practice to apply the methods to more than just a single data set. Under the SummarizedBenchmark framework, this naturally translates to recycling the same set of methods defined in a single BenchDesign object across multiple data sets. While the BenchDesign objects in the examples above were all initialized with a data set, this is not necessary.

bdnull <- BenchDesign()
bdnull

## BenchDesign object ----------------------------------------- 
##   benchmark data:
##     NULL
##   benchmark methods:
##     none

As before, methods can be added to the BenchDesign with addMethod(), and the benchmark experiment run using buildBench().

bdnull <- bdnull %>%
    addMethod(label = "bonf",
              func = p.adjust,
              params = rlang::quos(p = pval,
                                   method = "bonferroni")) %>%
    addMethod(label = "BH",
              func = p.adjust,
              params = rlang::quos(p = pval,
                                   method = "BH"))

While not mentioned above, the buildBench() method accepts an optional data = parameter. When specified, this data set is used to run the experiment, taking precedence over the data set specified in (or missing from) the BenchDesign object.

buildBench(bdnull, data = tdat)

## class: SummarizedBenchmark 
## dim: 50 2 
## metadata(1): sessionInfo
## assays(1): bench
## rownames: NULL
## rowData names(1): bench
## colnames(2): bonf BH
## colData names(9): func post ... param.method label

By specifying data during the buildBench() step the exact same benchmark comparison, as defined in the common BenchDesign object, can be carried out consistently across multiple data sets. While this approach works even if the common BenchDesign object contains a default data set, it is recommended that the BenchDesign be created without any data to avoid errors if the design is going to be reunsed across data sets.

References

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological), 289–300.

Dunn, Olive Jean. 1961. “Multiple Comparisons Among Means.” Journal of the American Statistical Association 56 (293):52–64.

Soneson, Charlotte, and Mark D Robinson. 2016. “iCOBRA: Open, Reproducible, Standardized and Live Method Benchmarking.” Nature Methods 13 (4). Springer Nature:283–83. https://doi.org/10.1038/nmeth.3805.

Storey, John D. 2002. “A Direct Approach to False Discovery Rates.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (3):479–98.