1 Overview

brendaDb aims to make importing and analyzing data from the BRENDA database easier. The main functions include:

  • Read text file downloaded from BRENDA into an R tibble
  • Retrieve information for specific enzymes
  • Query enzymes using their synonyms, gene symbols, etc.
  • Query enzyme information for specific BioCyc pathways

For bug reports or feature requests, please go to the GitHub repository.

2 Installation

brendaDb is a Bioconductor package and can be installed through BiocManager::install().

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("brendaDb", dependencies=TRUE)

Alternatively, install the development version from GitHub.

if(!requireNamespace("brendaDb")) {
  devtools::install_github("y1zhou/brendaDb")
}

After the package is installed, it can be loaded into the R workspace by

library(brendaDb)

3 Getting Started

3.1 Downloading the BRENDA Text File

Download the BRENDA database as a text file here. Alternatively, download the file in R (file updated 2019-04-24):

brenda.filepath <- DownloadBrenda()
#> Please read the license agreement in the link below.
#> 
#> https://www.brenda-enzymes.org/download_brenda_without_registration.php
#> 
#> Found zip file in cache.
#> Extracting zip file...

The function downloads the file to a local cache directory. Now the text file can be loaded into R as a tibble:

df <- ReadBrenda(brenda.filepath)
#> Reading BRENDA text file...
#> Converting text into a list. This might take a while...
#> Converting list to tibble and removing duplicated entries...
#> If you're going to use this data again, consider saving this table using data.table::fwrite().

As suggested in the function output, you may save the df object to a text file using data.table::fwrite() or to an R object using save(df), and load the table using data.table::fread() or load()1 This requires the R package data.table to be installed.. Both methods should be much faster than reading the raw text file again using ReadBrenda().

4 Making Queries

Since BRENDA is a database for enzymes, all final queries are based on EC numbers.

4.1 Query for Multiple Enzymes

If you already have a list of EC numbers in mind, you may call QueryBrenda directly:

brenda_txt <- system.file("extdata", "brenda_download_test.txt",
                          package = "brendaDb")
df <- ReadBrenda(brenda_txt)
#> Reading BRENDA text file...
#> Converting text into a list. This might take a while...
#> Converting list to tibble and removing duplicated entries...
#> If you're going to use this data again, consider saving this table using data.table::fwrite().
res <- QueryBrenda(df, EC = c("1.1.1.1", "6.3.5.8"), n.core = 2)

res
#> A list of 2 brenda.entry object(s) with:
#>  - 1 regular brenda.entry object(s)
#>    1.1.1.1 
#> - 1 transferred or deleted object(s)
#>    6.3.5.8

res[["1.1.1.1"]]
#> Entry 1.1.1.1
#> ├── nomenclature
#> |    ├── ec: 1.1.1.1
#> |    ├── systematic.name: alcohol:NAD+ oxidoreductase
#> |    ├── recommended.name: alcohol dehydrogenase
#> |    ├── synonyms: A tibble with 128 rows
#> |    ├── reaction: A tibble with 2 rows
#> |    └── reaction.type: A tibble with 3 rows
#> ├── interactions
#> |    ├── substrate.product: A tibble with 772 rows
#> |    ├── natural.substrate.product: A tibble with 20 rows
#> |    ├── cofactor: A tibble with 7 rows
#> |    ├── metals.ions: A tibble with 20 rows
#> |    ├── inhibitors: A tibble with 207 rows
#> |    └── activating.compound: A tibble with 22 rows
#> ├── parameters
#> |    ├── km.value: A tibble with 878 rows
#> |    ├── turnover.number: A tibble with 495 rows
#> |    ├── ki.value: A tibble with 34 rows
#> |    ├── pi.value: A tibble with 11 rows
#> |    ├── ph.optimum: A tibble with 55 rows
#> |    ├── ph.range: A tibble with 28 rows
#> |    ├── temperature.optimum: A tibble with 29 rows
#> |    ├── temperature.range: A tibble with 20 rows
#> |    ├── specific.activity: A tibble with 88 rows
#> |    └── ic50: A tibble with 2 rows
#> ├── organism
#> |    ├── organism: A tibble with 159 rows
#> |    ├── source.tissue: A tibble with 63 rows
#> |    └── localization: A tibble with 9 rows
#> ├── molecular
#> |    ├── stability
#> |    |    ├── general.stability: A tibble with 15 rows
#> |    |    ├── storage.stability: A tibble with 15 rows
#> |    |    ├── ph.stability: A tibble with 20 rows
#> |    |    ├── organic.solvent.stability: A tibble with 25 rows
#> |    |    ├── oxidation.stability: A tibble with 3 rows
#> |    |    └── temperature.stability: A tibble with 36 rows
#> |    ├── purification: A tibble with 48 rows
#> |    ├── cloned: A tibble with 46 rows
#> |    ├── engineering: A tibble with 60 rows
#> |    ├── renatured: A tibble with 1 rows
#> |    └── application: A tibble with 5 rows
#> ├── structure
#> |    ├── molecular.weight: A tibble with 119 rows
#> |    ├── subunits: A tibble with 11 rows
#> |    ├── posttranslational.modification: A tibble with 2 rows
#> |    └── crystallization: A tibble with 22 rows
#> └── bibliography
#> |    └── reference: A tibble with 285 rows

4.2 Query Specific Fields

You can also query for certain fields to reduce the size of the returned object.

ShowFields(df)
#> # A tibble: 40 × 2
#>    field                     acronym
#>    <chr>                     <chr>  
#>  1 PROTEIN                   PR     
#>  2 RECOMMENDED_NAME          RN     
#>  3 SYSTEMATIC_NAME           SN     
#>  4 SYNONYMS                  SY     
#>  5 REACTION                  RE     
#>  6 REACTION_TYPE             RT     
#>  7 SOURCE_TISSUE             ST     
#>  8 LOCALIZATION              LO     
#>  9 NATURAL_SUBSTRATE_PRODUCT NSP    
#> 10 SUBSTRATE_PRODUCT         SP     
#> # … with 30 more rows

res <- QueryBrenda(df, EC = "1.1.1.1", fields = c("PROTEIN", "SUBSTRATE_PRODUCT"))
res[["1.1.1.1"]][["interactions"]][["substrate.product"]]
#> # A tibble: 772 × 7
#>    proteinID substrate  product commentarySubstr… commentaryProdu… reversibility
#>    <chr>     <chr>      <chr>   <chr>             <chr>            <chr>        
#>  1 10        n-propano… n-prop… <NA>              <NA>             r            
#>  2 10        2-propano… aceton… <NA>              <NA>             <NA>         
#>  3 10        n-hexanol… n-hexa… <NA>              <NA>             r            
#>  4 10        (S)-2-but… 2-buta… <NA>              <NA>             r            
#>  5 10        ethylengl… ? + NA… <NA>              <NA>             r            
#>  6 10        n-butanol… butyra… <NA>              <NA>             <NA>         
#>  7 10        n-decanol… n-deca… <NA>              <NA>             r            
#>  8 10        Tris + NA… ? + NA… <NA>              <NA>             r            
#>  9 10        isopropan… aceton… <NA>              <NA>             <NA>         
#> 10 10        5-hydroxy… (furan… #10# mutant enzy… <NA>             <NA>         
#> # … with 762 more rows, and 1 more variable: refID <chr>

It should be noted that most fields contain a fieldInfo column and a commentary column. The fieldInfo column is what’s extracted by BRENDA from the literature, and the commentary column is usually some context from the original paper. # symbols in the commentary correspond to the proteinIDs, and <> enclose the corresponding refIDs. For further information, please see the README file from BRENDA.

4.3 Query Specific Organisms

Note the difference in row numbers in the following example and in the one where we queried for all organisms.

res <- QueryBrenda(df, EC = "1.1.1.1", organisms = "Homo sapiens")
res$`1.1.1.1`
#> Entry 1.1.1.1
#> ├── nomenclature
#> |    ├── ec: 1.1.1.1
#> |    ├── systematic.name: alcohol:NAD+ oxidoreductase
#> |    ├── recommended.name: alcohol dehydrogenase
#> |    ├── synonyms: A tibble with 41 rows
#> |    ├── reaction: A tibble with 2 rows
#> |    └── reaction.type: A tibble with 3 rows
#> ├── interactions
#> |    ├── substrate.product: A tibble with 102 rows
#> |    ├── natural.substrate.product: A tibble with 9 rows
#> |    ├── cofactor: A tibble with 2 rows
#> |    ├── metals.ions: A tibble with 2 rows
#> |    └── inhibitors: A tibble with 36 rows
#> ├── parameters
#> |    ├── km.value: A tibble with 163 rows
#> |    ├── turnover.number: A tibble with 64 rows
#> |    ├── ki.value: A tibble with 8 rows
#> |    ├── ph.optimum: A tibble with 15 rows
#> |    ├── ph.range: A tibble with 2 rows
#> |    ├── temperature.optimum: A tibble with 2 rows
#> |    └── specific.activity: A tibble with 5 rows
#> ├── organism
#> |    ├── organism: A tibble with 3 rows
#> |    ├── source.tissue: A tibble with 21 rows
#> |    └── localization: A tibble with 1 rows
#> ├── molecular
#> |    ├── stability
#> |    |    ├── general.stability: A tibble with 1 rows
#> |    |    ├── storage.stability: A tibble with 4 rows
#> |    |    ├── ph.stability: A tibble with 1 rows
#> |    |    ├── organic.solvent.stability: A tibble with 1 rows
#> |    |    └── temperature.stability: A tibble with 2 rows
#> |    ├── purification: A tibble with 7 rows
#> |    ├── cloned: A tibble with 5 rows
#> |    ├── engineering: A tibble with 3 rows
#> |    └── application: A tibble with 1 rows
#> ├── structure
#> |    ├── molecular.weight: A tibble with 12 rows
#> |    ├── subunits: A tibble with 3 rows
#> |    └── crystallization: A tibble with 2 rows
#> └── bibliography
#> |    └── reference: A tibble with 285 rows

4.4 Extract Information in Query Results

To transform the brenda.entries structure into a table, use the helper function ExtractField().

res <- QueryBrenda(df, EC = c("1.1.1.1", "6.3.5.8"), n.core = 2)
ExtractField(res, field = "parameters$ph.optimum")
#> Deprecated entries in the res object will be removed.
#> # A tibble: 158 × 9
#>    ec      organism       proteinID uniprot org.commentary description fieldInfo
#>    <chr>   <chr>          <chr>     <chr>   <chr>          <chr>       <lgl>    
#>  1 1.1.1.1 Acetobacter p… 60        <NA>    <NA>           5.5         NA       
#>  2 1.1.1.1 Acetobacter p… 60        <NA>    <NA>           6           NA       
#>  3 1.1.1.1 Acetobacter p… 60        <NA>    <NA>           8.5         NA       
#>  4 1.1.1.1 Acinetobacter… 28        <NA>    <NA>           5.9         NA       
#>  5 1.1.1.1 Aeropyrum per… 131       Q9Y9P9  <NA>           10.5        NA       
#>  6 1.1.1.1 Aeropyrum per… 131       Q9Y9P9  <NA>           8           NA       
#>  7 1.1.1.1 Arabidopsis t… 20        <NA>    <NA>           10.5        NA       
#>  8 1.1.1.1 Aspergillus n… 14        <NA>    <NA>           8.1         NA       
#>  9 1.1.1.1 Brevibacteriu… 46        <NA>    <NA>           10.4        NA       
#> 10 1.1.1.1 Brevibacteriu… 46        <NA>    <NA>           6           NA       
#> # … with 148 more rows, and 2 more variables: commentary <chr>, refID <chr>

As shown above, the returned table consists of three parts: the EC number, organism-related information (organism, protein ID, uniprot ID, and commentary on the organism), and extracted field information (description, commentary, etc.).

5 Foreign ID Retrieval

5.1 Querying Synonyms

A lot of the times we have a list of gene symbols or enzyme names instead of EC numbers. In this case, a helper function can be used to find the corresponding EC numbers:

ID2Enzyme(brenda = df, ids = c("ADH4", "CD38", "pyruvate dehydrogenase"))
#> # A tibble: 4 × 5
#>   ID                     EC        RECOMMENDED_NAME  SYNONYMS   SYSTEMATIC_NAME 
#>   <chr>                  <chr>     <chr>             <chr>      <chr>           
#> 1 ADH4                   1.1.1.1   <NA>              "aldehyde… <NA>            
#> 2 CD38                   2.4.99.20 <NA>              "#1,3,4,6… <NA>            
#> 3 pyruvate dehydrogenase 1.2.1.51  pyruvate dehydro… "#1,2# py… <NA>            
#> 4 pyruvate dehydrogenase 2.7.11.2  [pyruvate dehydr… "kinase (… ATP:[pyruvate d…

The EC column can be then handpicked and used in QueryBrenda().

5.2 BioCyc Pathways

Often we are interested in the enzymes involved in a specific BioCyc pathway. Functions BioCycPathwayEnzymes() and BiocycPathwayGenes() can be used in this case:

BiocycPathwayEnzymes(org.id = "HUMAN", pathway = "PWY66-400")
#> Found 10 reactions for HUMAN pathway PWY66-400.
#> # A tibble: 11 × 5
#>    RxnID                    EC       ReactionDirection     LHS       RHS        
#>    <chr>                    <chr>    <chr>                 <chr>     <chr>      
#>  1 PGLUCISOM-RXN            5.3.1.9  REVERSIBLE            D-glucop… FRUCTOSE-6P
#>  2 GLUCOKIN-RXN             2.7.1.1  LEFT-TO-RIGHT         Glucopyr… D-glucopyr…
#>  3 GLUCOKIN-RXN             2.7.1.2  LEFT-TO-RIGHT         Glucopyr… D-glucopyr…
#>  4 PEPDEPHOS-RXN            2.7.1.40 PHYSIOL-RIGHT-TO-LEFT PYRUVATE… PROTON + P…
#>  5 2PGADEHYDRAT-RXN         4.2.1.11 REVERSIBLE            2-PG      PHOSPHO-EN…
#>  6 RXN-15513                5.4.2.11 REVERSIBLE            2-PG      G3P        
#>  7 PHOSGLYPHOS-RXN          2.7.2.3  REVERSIBLE            G3P + ATP DPG + ADP  
#>  8 GAPOXNPHOSPHN-RXN        1.2.1.12 REVERSIBLE            GAP + Pi… PROTON + D…
#>  9 TRIOSEPISOMERIZATION-RXN 5.3.1.1  REVERSIBLE            GAP       DIHYDROXY-…
#> 10 F16ALDOLASE-RXN          4.1.2.13 REVERSIBLE            FRUCTOSE… DIHYDROXY-…
#> 11 6PFRUCTPHOS-RXN          2.7.1.11 LEFT-TO-RIGHT         ATP + FR… PROTON + A…
BiocycPathwayGenes(org.id = "HUMAN", pathway = "TRYPTOPHAN-DEGRADATION-1")
#> Found 17 genes in HUMAN pathway TRYPTOPHAN-DEGRADATION-1.
#> # A tibble: 17 × 4
#>    BiocycGene BiocycProtein           Symbol   Ensembl                          
#>    <chr>      <chr>                   <chr>    <chr>                            
#>  1 HS14455    HS14455-MONOMER         ACMSD    ENSG00000153086                  
#>  2 HS04229    ENSG00000118514-MONOMER ALDH8A1  ENSG00000118514                  
#>  3 HS11585    HS11585-MONOMER         DHTKD1   ENSG00000181192                  
#>  4 G66-33844  G66-33844-MONOMER       AFMID    ENSG00000183077,ENST00000409257,…
#>  5 HS04082    HS04082-MONOMER         KMO      ENSG00000117009                  
#>  6 HS03952    HS03952-MONOMER         KYNU     ENSG00000115919                  
#>  7 HS08749    HS08749-MONOMER         HAAO     ENSG00000162882                  
#>  8 HS05502    HS05502-MONOMER         IDO1     ENSG00000131203                  
#>  9 G66-37884  MONOMER66-34407         IDO2     ENSG00000188676                  
#> 10 HS07771    HS07771-MONOMER         TDO2     ENSG00000151790                  
#> 11 HS02769    HS02769-MONOMER         GCDH     ENSG00000105607                  
#> 12 HS01167    HS01167-MONOMER         ACAT1    ENSG00000075239                  
#> 13 HS04399    ENSG00000120437-MONOMER ACAT2    ENSG00000120437                  
#> 14 HS01071    HS01071-MONOMER         HSD17B10 ENSG00000072506                  
#> 15 HS06563    HS06563-MONOMER         HADH     ENSG00000138796                  
#> 16 HS01481    HS01481-MONOMER         HADHA    ENSG00000084754                  
#> 17 HS05132    HS05132-MONOMER         ECHS1    ENSG00000127884

Similarly, the EC numbers returned from BiocycPathwayEnzymes can be used in the function QueryBrenda, and the gene IDs2 Note that sometimes there are multiple Ensembl IDs in one entry. can be used to find corresponding EC numbers with other packages such as biomaRt and clusterProfiler.

Additional Information

By default QueryBrenda uses all available cores, but often limiting n.core could give better performance as it reduces the overhead. The following are results produced on a machine with 40 cores (2 Intel Xeon CPU E5-2640 v4 @ 3.4GHz), and 256G of RAM:

EC.numbers <- head(unique(df$ID), 100)
system.time(QueryBrenda(df, EC = EC.numbers, n.core = 0))  # default
#  user  system elapsed
# 4.528   7.856  34.567
system.time(QueryBrenda(df, EC = EC.numbers, n.core = 1))
#  user  system elapsed 
# 22.080   0.360  22.438
system.time(QueryBrenda(df, EC = EC.numbers, n.core = 2))
#  user  system elapsed 
# 0.552   0.400  13.597 
system.time(QueryBrenda(df, EC = EC.numbers, n.core = 4))
#  user  system elapsed 
# 0.688   0.832   9.517
system.time(QueryBrenda(df, EC = EC.numbers, n.core = 8))
#  user  system elapsed 
# 1.112   1.476  10.000
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] brendaDb_1.8.0   BiocStyle_2.22.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.7          bslib_0.3.1         compiler_4.1.1     
#>  [4] pillar_1.6.4        BiocManager_1.30.16 jquerylib_0.1.4    
#>  [7] dbplyr_2.1.1        tools_4.1.1         digest_0.6.28      
#> [10] bit_4.0.4           tibble_3.1.5        jsonlite_1.7.2     
#> [13] BiocFileCache_2.2.0 RSQLite_2.2.8       evaluate_0.14      
#> [16] memoise_2.0.0       lifecycle_1.0.1     pkgconfig_2.0.3    
#> [19] rlang_0.4.12        cli_3.0.1           DBI_1.1.1          
#> [22] filelock_1.0.2      parallel_4.1.1      curl_4.3.2         
#> [25] yaml_2.2.1          xfun_0.27           fastmap_1.1.0      
#> [28] xml2_1.3.2          httr_1.4.2          stringr_1.4.0      
#> [31] dplyr_1.0.7         knitr_1.36          rappdirs_0.3.3     
#> [34] generics_0.1.1      sass_0.4.0          vctrs_0.3.8        
#> [37] tidyselect_1.1.1    bit64_4.0.5         glue_1.4.2         
#> [40] R6_2.5.1            fansi_0.5.0         BiocParallel_1.28.0
#> [43] rmarkdown_2.11      bookdown_0.24       tidyr_1.1.4        
#> [46] purrr_0.3.4         blob_1.2.2          magrittr_2.0.1     
#> [49] ellipsis_0.3.2      htmltools_0.5.2     assertthat_0.2.1   
#> [52] utf8_1.2.2          stringi_1.7.5       cachem_1.0.6       
#> [55] crayon_1.4.1