An introduction to package goSorensen

1 Introduction

2 Installation

3 Data.

4 Performing the equivalence test

4.1 From a previously built contingency table of join enrichment
4.2 Directly from the gene lists.
4.3 Using a bootstrap aproximation.

5 Accessing to specific fields

6 Other statistics related to the Sorensen-Dice dissimilarity

7 All pairwise tests (or other computations)

8 Alternative representations of contingency tables of joint enrichment

Session information

References

Appendix

An introduction to package goSorensen

Jordi Ocaña1* and Pablo Flores2**

1Department of Genetics, Microbiology and Statistics, Statistics Section, University of Barcelona
2Escuela Superior Politécnica de Chimborazo (ESPOCH), Facultad de Ciencias, Carrera de Estadística.

*jocana@ub.edu
**p_flores@espoch.edu.ec

1 May 2024

Abstract

This vignette provides an introduction to goSorensen package, which was built to determine equivalence between features lists. The method is based on the Sorensen–Dice index and the joint frequencies of GO term enrichment. Starting from an introduction of the asociated technique and a description of the data used, this vignette explain how to: i) perform the equivalence test from contingency tables of joint enrichment or directly from features lists (either using a normal asymptotic or a bootstrap approximation), ii) collect specific fields of the test results like the p-value, the upper limit of the confidence interval or standard errors and iii) Another statistics related to the Sorensen-Dice dissimilarity

Package

goSorensen 1.6.0

library(goSorensen)

1 Introduction

The goal of goSorensen is to implement the equivalence test introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment”, BMC Bioinformatics, 2022 23:207.

Given two gene lists, $L_1$ and $L_2$ , (the data) and a given set of n Gene Ontology (GO) terms (the frame of reference for biological information in these lists), the test is devoted to answer the following question (quite informally stated for the moment): The dissimilarity between the biological information in both lists, is it negligible? To measure the dissimilarity we use the Sorensen-Dice index:

$\hat d_{12} = d(L_1,L_2) = \frac{2n_{11}}{2n_{11} + n_{10} + n_{01}}$

where $n_{11}$ corresponds to the number of GO terms (among the n GO terms under consideration) which are enriched in both gene lists, $n_{10}$ corresponds to the GO terms enriched in $L_1$ but not in $L_2$ and $n_{01}$ the reverse, those enriched in $L_2$ but not in $L_1$ . For notation completeness, $n_{00}$ would correspond to those GO terms not enriched in both lists; it is not considered by the Sorensen-Dice index but would be necessary in some computations. Obviously, $n = n_{11} + n_{10} + n_{01} + n_{00}$ .

More precisely, the above problem can be restated as follows: Given a negligibility threshold $d_0$ for the Sorensen-Dice values, to decide negligibility corresponds to rejecting the null hypothesis $H_0: d \ge d_0$ in favor of the alternative $H_1: d < d_0$ , where $d$ stands for the “true” value of the Sorensen-Dice dissimilarity ( $L_1$ and $L_2$ are samples, and the own process of declaring enrichment of a GO term is random, so $\hat d = d(L_1,L_2)$ is an estimate of $d$ ). Then, a bit more precise statement of the problem is “The dissimilarity between the biological information in two gene lists, is it negligible up to a degree $d_0$ ?” Where this information is expressed by means of the Sorensen-Dice dissimilarity measured on the degree of coincidence and non-coincidence in GO terms enrichment among a given set of GO terms.

For the moment, the reference set of GO terms can be only all those GO terms in a given level of one GO ontology, either BP, CC or MF.

2 Installation

goSorensen package has to be installed with a working R version (>=4.2.0). Installation could take a few minutes on a regular desktop or laptop. Package can be installed from Bioconductor or devtools package, then it needs to be loaded using library(goSorensen)

To install from Bioconductor (recommended):

## Only if BiocManager is not previosly installed:
install.packages("BiocManager")

## otherwise, directly:
BiocManager::install("goSorensen")

To install from Github

devtools::install_github("pablof1988/goSorensen", build_vignettes = TRUE)

3 Data.

The dataset used in this vignette, allOncoGeneLists, is based on the gene lists compiled at http://www.bushmanlab.org/links/genelists, a comprehensive set of gene lists related to cancer. The package goSorensen loads this dataset by means of data(allOncoGeneLists):

data("allOncoGeneLists")

It is a “list” object of length 7. Each one of its elements is a “character” with the gene identifiers of a gene list related to cancer.

4 Performing the equivalence test

4.1 From a previously built contingency table of join enrichment

Function equivTestSorensen performs the equivalence test. One possibility is to build first the mutual enrichment contingency table by means of the function buildEnrichTable and then to perform the equivalence test:

data("humanEntrezIDs")
length(allOncoGeneLists)

## [1] 7

sapply(allOncoGeneLists, length)

##         atlas      cangenes           cis miscellaneous        sanger 
##           991           189           613           187           450 
##    Vogelstein       waldman 
##           419           426

# First 20 gene identifiers of gene lists Vogelstein and sanger:
allOncoGeneLists[["Vogelstein"]][1:20]

##  [1] "10006"  "25"     "27"     "23305"  "91"     "4299"   "3899"   "27125" 
##  [9] "207"    "238"    "139285" "324"    "367"    "23092"  "23365"  "8289"  
## [17] "57492"  "196528" "405"    "79058"

allOncoGeneLists[["sanger"]][1:20]

##  [1] "25"     "27"     "2181"   "57082"  "10962"  "51517"  "27125"  "10142" 
##  [9] "207"    "208"    "217"    "238"    "57714"  "324"    "23365"  "399"   
## [17] "8289"   "405"    "79058"  "171023"

# Build the enrichment contingency table between gene lists Vogelstein and 
# sanger for the MF ontology at GO level 5:
enrichTab <- buildEnrichTable(allOncoGeneLists[["Vogelstein"]],
                              allOncoGeneLists[["sanger"]],
                              geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db",
                              onto = "MF", GOLevel = 5, listNames = c("Vogelstein", "sanger"))
enrichTab

##                       Enriched in sanger
## Enriched in Vogelstein TRUE FALSE
##                  TRUE    33     9
##                  FALSE    1  2080

# Equivalence test for an equivalence (or negligibility) limit 0.2857
testResult <- equivTestSorensen(enrichTab, d0 = 0.2857)
testResult

## 
##  Normal asymptotic test for 2x2 contingency tables based on the
##  Sorensen-Dice dissimilarity
## 
## data:  enrichTab
## (d - d0) / se = -3.6928, p-value = 0.0001109
## alternative hypothesis: true equivalence limit d0 is less than 0.2857
## 95 percent confidence interval:
##  0.0000000 0.2002274
## sample estimates:
## Sorensen dissimilarity 
##              0.1315789 
## attr(,"se")
## standard error 
##      0.0417353

4.2 Directly from the gene lists.

To perform the test directly from the gene lists (internally building the contingency table) is also possible:

equivTestSorensen(allOncoGeneLists[["Vogelstein"]], allOncoGeneLists[["sanger"]], d0 = 0.2857,
                              geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db",
                              onto = "MF", GOLevel = 5, listNames = c("Vogelstein", "sanger"))

## 
##  Normal asymptotic test for 2x2 contingency tables based on the
##  Sorensen-Dice dissimilarity
## 
## data:  tab
## (d - d0) / se = -3.6928, p-value = 0.0001109
## alternative hypothesis: true equivalence limit d0 is less than 0.2857
## 95 percent confidence interval:
##  0.0000000 0.2002274
## sample estimates:
## Sorensen dissimilarity 
##              0.1315789 
## attr(,"se")
## standard error 
##      0.0417353

To save computing time, the first option (building the contingency table separately, first) may be preferable: buildEnrichTable may take some time (many enrichment tests) and it would be advantageous to have the contingency table ready for further computations.

The above tests use a standard normal approximation to the sample distribution of the $(\hat d - d) / \widehat {se}$ statistic, where $\widehat {se}$ stands for the standard error of the sample dissimilarity, $\hat d$ .

4.3 Using a bootstrap aproximation.

Alternatively, it is possible to estimate this distribution by means of bootstrap:

boot.testResult <- equivTestSorensen(enrichTab, d0 = 0.2857, boot = TRUE)
boot.testResult

## 
##  Bootstrap test for 2x2 contingency tables based on the Sorensen-Dice
##  dissimilarity (10000 bootstrap replicates)
## 
## data:  enrichTab
## (d - d0) / se = -3.6928, p-value = 0.006699
## alternative hypothesis: true equivalence limit d0 is less than 0.2857
## 95 percent confidence interval:
##  0.0000000 0.2213802
## sample estimates:
## Sorensen dissimilarity 
##              0.1315789 
## attr(,"se")
## standard error 
##      0.0417353

For low frequencies in the contingency table, bootstrap is a more conservative but preferable approach, with better type I error control.

5 Accessing to specific fields

To access specific fields of the test result:

getDissimilarity(testResult)

## Sorensen dissimilarity 
##              0.1315789 
## attr(,"se")
## standard error 
##      0.0417353

getSE(testResult)

## standard error 
##      0.0417353

getPvalue(testResult)

##      p-value 
## 0.0001108896

getTable(testResult)

##                       Enriched in sanger
## Enriched in Vogelstein TRUE FALSE
##                  TRUE    33     9
##                  FALSE    1  2080

getUpper(testResult)

##    dUpper 
## 0.2002274

# In the bootstrap approach, only these differ:
getPvalue(boot.testResult)

##    p-value 
## 0.00669933

getUpper(boot.testResult)

##    dUpper 
## 0.2213802

# (Only available for bootstrap tests) efective number of bootstrap resamples:
getNboot(boot.testResult)

## [1] 10000

6 Other statistics related to the Sorensen-Dice dissimilarity

Sometimes, it would be interesting not to perform the full equivalence test but to compute other statistics related to the Sorensen-Dice dissimilarity:

# The dissimilarity:
dSorensen(enrichTab)

## [1] 0.1315789

# Or from scratch, directly from both gene lists:
dSorensen(allOncoGeneLists[["Vogelstein"]], allOncoGeneLists[["sanger"]],
                              geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db",
                              onto = "MF", GOLevel = 5, listNames = c("Vogelstein", "sanger"))

## [1] 0.1315789

# The first option is faster, it avoids internally building the enrichment
# contingency table

# Its standard error:
seSorensen(enrichTab)

## [1] 0.0417353

# or:
seSorensen(allOncoGeneLists[["Vogelstein"]], allOncoGeneLists[["sanger"]],
                              geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db",
                              onto = "MF", GOLevel = 5, listNames = c("Vogelstein", "sanger"))

## [1] 0.0417353

# Upper limit of the confidence interval for the true distance:
duppSorensen(enrichTab)

## [1] 0.2002274

duppSorensen(enrichTab, conf.level = 0.90)

## [1] 0.1850649

duppSorensen(enrichTab, conf.level = 0.90, boot = TRUE)

## [1] 0.1956352
## attr(,"eff.nboot")
## [1] 10000

duppSorensen(allOncoGeneLists[["Vogelstein"]], allOncoGeneLists[["sanger"]],
                              geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db",
                              onto = "MF", GOLevel = 5, listNames = c("Vogelstein", "sanger"))

## [1] 0.2002274

7 All pairwise tests (or other computations)

For objects of class list, all these functions (equivTestSorensen, dSorensen, seSorensen, duppSorensen) assume a list of character objects containing gene identifiers and all pairwise computations are performed. For example, to obtain the matrix of all pairwise Sorensen-Dice dissimilarities:

dSorensen(allOncoGeneLists, onto = "MF", GOLevel = 5, 
          geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db")

##                   atlas cangenes       cis miscellaneous    sanger Vogelstein
## atlas         0.0000000        1 0.7500000     0.5000000 0.3658537  0.3555556
## cangenes      1.0000000        0 1.0000000     1.0000000 1.0000000  1.0000000
## cis           0.7500000        1 0.0000000     0.6875000 0.7142857  0.7200000
## miscellaneous 0.5000000        1 0.6875000     0.0000000 0.4482759  0.5151515
## sanger        0.3658537        1 0.7142857     0.4482759 0.0000000  0.1315789
## Vogelstein    0.3555556        1 0.7200000     0.5151515 0.1315789  0.0000000
## waldman       0.3793103        1 0.7446809     0.3015873 0.4520548  0.4814815
##                 waldman
## atlas         0.3793103
## cangenes      1.0000000
## cis           0.7446809
## miscellaneous 0.3015873
## sanger        0.4520548
## Vogelstein    0.4814815
## waldman       0.0000000

Similarly, the following code performs all pairwise tests:

allTests <- equivTestSorensen(allOncoGeneLists, d0 = 0.2857, 
                              onto = "MF", GOLevel = 5, 
                              geneUniverse = humanEntrezIDs, 
                              orgPackg = "org.Hs.eg.db")
getPvalue(allTests)

##           cangenes.atlas.p-value                cis.atlas.p-value 
##                              NaN                     0.9999999990 
##             cis.cangenes.p-value      miscellaneous.atlas.p-value 
##                              NaN                     0.9983684719 
##   miscellaneous.cangenes.p-value        miscellaneous.cis.p-value 
##                              NaN                     0.9998940135 
##             sanger.atlas.p-value          sanger.cangenes.p-value 
##                     0.8993421977                              NaN 
##               sanger.cis.p-value     sanger.miscellaneous.p-value 
##                     0.9999981736                     0.9795225611 
##         Vogelstein.atlas.p-value      Vogelstein.cangenes.p-value 
##                     0.8808625104                              NaN 
##           Vogelstein.cis.p-value Vogelstein.miscellaneous.p-value 
##                     0.9999998726                     0.9986487586 
##        Vogelstein.sanger.p-value            waldman.atlas.p-value 
##                     0.0001108896                     0.9356673786 
##         waldman.cangenes.p-value              waldman.cis.p-value 
##                              NaN                     0.9999999660 
##    waldman.miscellaneous.p-value           waldman.sanger.p-value 
##                     0.5940131945                     0.9905499647 
##       waldman.Vogelstein.p-value 
##                     0.9979674053

getDissimilarity(allTests, simplify = FALSE)

##                   atlas cangenes       cis miscellaneous    sanger Vogelstein
## atlas         0.0000000        1 0.7500000     0.5000000 0.3658537  0.3555556
## cangenes      1.0000000        0 1.0000000     1.0000000 1.0000000  1.0000000
## cis           0.7500000        1 0.0000000     0.6875000 0.7142857  0.7200000
## miscellaneous 0.5000000        1 0.6875000     0.0000000 0.4482759  0.5151515
## sanger        0.3658537        1 0.7142857     0.4482759 0.0000000  0.1315789
## Vogelstein    0.3555556        1 0.7200000     0.5151515 0.1315789  0.0000000
## waldman       0.3793103        1 0.7446809     0.3015873 0.4520548  0.4814815
##                 waldman
## atlas         0.3793103
## cangenes      1.0000000
## cis           0.7446809
## miscellaneous 0.3015873
## sanger        0.4520548
## Vogelstein    0.4814815
## waldman       0.0000000

8 Alternative representations of contingency tables of joint enrichment

Besides admitting objects of class table, character and list, functions equivTestSorensen, dSorensen, seSorensen and duppSorensen are also adequate for contingency tables represented as a plain matrix or a numeric:

enrichMat <- matrix(c(20, 1, 9, 2149), nrow = 2)
enrichMat

##      [,1] [,2]
## [1,]   20    9
## [2,]    1 2149

dSorensen(enrichMat)

## [1] 0.2

enrichVec <- c(20, 1, 9, 2149)
equivTestSorensen(enrichVec)

## 
##  Normal asymptotic test for 2x2 contingency tables based on the
##  Sorensen-Dice dissimilarity
## 
## data:  enrichVec
## (d - d0) / se = -3.8784, p-value = 5.257e-05
## alternative hypothesis: true equivalence limit d0 is less than 0.4444444
## 95 percent confidence interval:
##  0.0000000 0.3036703
## sample estimates:
## Sorensen dissimilarity 
##                    0.2 
## attr(,"se")
## standard error 
##     0.06302709

equivTestSorensen(enrichVec, boot = TRUE)

## 
##  Bootstrap test for 2x2 contingency tables based on the Sorensen-Dice
##  dissimilarity (10000 bootstrap replicates)
## 
## data:  enrichVec
## (d - d0) / se = -3.8784, p-value = 0.0044
## alternative hypothesis: true equivalence limit d0 is less than 0.4444444
## 95 percent confidence interval:
##  0.0000000 0.3255704
## sample estimates:
## Sorensen dissimilarity 
##                    0.2 
## attr(,"se")
## standard error 
##     0.06302709

len3Vec <- c(20, 1, 9)
dSorensen(len3Vec)

## [1] 0.2

seSorensen(len3Vec)

## [1] 0.06302709

duppSorensen(len3Vec)

## [1] 0.3036703

# Error, bootstrapping requires the full (4 values) contingency table:
try(duppSorensen(len3Vec, boot = TRUE), TRUE)

Session information

All software and respective versions used to produce this document are listed below.

sessionInfo()

## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] goSorensen_1.6.0 BiocStyle_2.32.0
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2               gson_0.1.0              shadowtext_0.1.3       
##   [4] gridExtra_2.3           rlang_1.1.3             magrittr_2.0.3         
##   [7] DOSE_3.30.0             compiler_4.4.0          RSQLite_2.3.6          
##  [10] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4         
##  [13] stringr_1.5.1           pkgconfig_2.0.3         crayon_1.5.2           
##  [16] fastmap_1.1.1           XVector_0.44.0          ggraph_2.2.1           
##  [19] utf8_1.2.4              HDO.db_0.99.1           rmarkdown_2.26         
##  [22] enrichplot_1.24.0       UCSC.utils_1.0.0        purrr_1.0.2            
##  [25] bit_4.0.5               xfun_0.43               zlibbioc_1.50.0        
##  [28] cachem_1.0.8            aplot_0.2.2             GenomeInfoDb_1.40.0    
##  [31] jsonlite_1.8.8          blob_1.2.4              BiocParallel_1.38.0    
##  [34] tweenr_2.0.3            parallel_4.4.0          R6_2.5.1               
##  [37] RColorBrewer_1.1-3      bslib_0.7.0             stringi_1.8.3          
##  [40] jquerylib_0.1.4         GOSemSim_2.30.0         Rcpp_1.0.12            
##  [43] bookdown_0.39           knitr_1.46              goProfiles_1.66.0      
##  [46] IRanges_2.38.0          Matrix_1.7-0            splines_4.4.0          
##  [49] igraph_2.0.3            tidyselect_1.2.1        qvalue_2.36.0          
##  [52] yaml_2.3.8              viridis_0.6.5           codetools_0.2-20       
##  [55] lattice_0.22-6          tibble_3.2.1            plyr_1.8.9             
##  [58] treeio_1.28.0           Biobase_2.64.0          withr_3.0.0            
##  [61] KEGGREST_1.44.0         evaluate_0.23           CompQuadForm_1.4.3     
##  [64] gridGraphics_0.5-1      scatterpie_0.2.2        polyclip_1.10-6        
##  [67] Biostrings_2.72.0       ggtree_3.12.0           pillar_1.9.0           
##  [70] BiocManager_1.30.22     stats4_4.4.0            clusterProfiler_4.12.0 
##  [73] ggfun_0.1.4             generics_0.1.3          S4Vectors_0.42.0       
##  [76] ggplot2_3.5.1           tidytree_0.4.6          munsell_0.5.1          
##  [79] scales_1.3.0            glue_1.7.0              lazyeval_0.2.2         
##  [82] tools_4.4.0             data.table_1.15.4       fgsea_1.30.0           
##  [85] fs_1.6.4                graphlayouts_1.1.1      fastmatch_1.1-4        
##  [88] tidygraph_1.3.1         cowplot_1.1.3           grid_4.4.0             
##  [91] ape_5.8                 tidyr_1.3.1             AnnotationDbi_1.66.0   
##  [94] colorspace_2.1-0        nlme_3.1-164            patchwork_1.2.0        
##  [97] GenomeInfoDbData_1.2.12 ggforce_0.4.2           cli_3.6.2              
## [100] fansi_1.0.6             viridisLite_0.4.2       dplyr_1.1.4            
## [103] gtable_0.3.5            yulab.utils_0.1.4       sass_0.4.9             
## [106] digest_0.6.35           BiocGenerics_0.50.0     ggplotify_0.1.2        
## [109] ggrepel_0.9.5           org.Hs.eg.db_3.19.1     farver_2.1.1           
## [112] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
## [115] httr_1.4.7              GO.db_3.19.1            bit64_4.0.5            
## [118] MASS_7.3-60.2

References

Appendix

Flores, P., Salicrú, M., Sánchez-Pla, A. et al. An equivalence test between features lists, based on the Sorensen–Dice index and the joint frequencies of GO term enrichment. BMC Bioinformatics 23, 207 (2022). https://doi.org/10.1186/s12859-022-04739-2