1 Introduction

biodbUniprot is a biodb extension package that implements a connector to Uniprot database.

The UniProt Knowledge Base (Consortium 2016) can be searched using its query web service.

We present here the way to contact this web service with this package.

2 Installation

Install using Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install('biodbUniprot')

3 Initialization

The first step in using biodbUniprot, is to create an instance of the biodb class BiodbMain from the main biodb package. This is done by calling the constructor of the class:

mybiodb <- biodb::newInst()

During this step the configuration is set up, the cache system is initialized and extension packages are loaded.

We will see at the end of this vignette that the biodb instance needs to be terminated with a call to the terminate() method.

4 Creating a connector to Uniprot database

In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbUniprot implements a connector to a remote database. Here is the code to instantiate a connector:

conn <- mybiodb$getFactory()$createConn('uniprot')
## Loading required package: biodbUniprot

5 Getting entries

To download entries, run the getEntry(), which returns a list of BiodbEntry objects:

entries <- conn$getEntry(c('P01011', 'P09237'))

To print the information contained in the entry objects as a data frame, run the entriesToDataframe() method attached to the BiodbMain instance:

mybiodb$entriesToDataframe(entries)
##   accession               gene.symbol kegg.genes.id molecular.mass
## 1    P01011 SERPINA3;AACT;GIG24;GIG25        hsa:12          47651
## 2    P09237          MMP7;MPSL1;PUMP1      hsa:4316          29677
##                                                                                                                            name
## 1 AACT_HUMAN;Alpha-1-antichymotrypsin;Cell growth-inhibiting gene 24/25 protein;Serpin A3;Alpha-1-antichymotrypsin His-Pro-less
## 2                             MMP7_HUMAN;Matrilysin;Matrin;Matrix metalloproteinase-7;Pump-1 protease;Uterine metalloproteinase
##   ncbi.gene.id
## 1           12
## 2         4316
##                                                                                                                                                                                                                                                                                                                                                                                                                                    aa.seq
## 1 MERMLPLLALGLLAAGFCPAVLCHPNSPLDEENLTQENQDRGTHVDLGLASANVDFAFSLYKQLVLKAPDKNVIFSPLSISTALAFLSLGAHNTTLTEILKGLKFNLTETSEAEIHQSFQHLLRTLNQSSDELQLSMGNAMFVKEQLSLLDRFTEDAKRLYGSEAFATDFQDSAAAKKLINDYVKNGTRGKITDLIKDLDSQTMMVLVNYIFFKAKWEMPFDPQDTHQSRFYLSKKKWVMVPMMSLHHLTIPYFRDEELSCTVVELKYTGNASALFILPDQDKMEEVEAMLLPETLKRWRDSLEFREIGELYLPKFSISRDYNLNDILLQLGIEEAFTSKADLSGITGARNLAVSQVVHKAVLDVFEEGTEASAATAVKITLLSALVETRTIVRFNRPFLMIIVPTDTQNIFFMSKVTNPKQA
## 2                                                                                                                                                             MRLTVLCAVCLLPGSLALPLPQEAGGMSELQWEQAQDYLKRFYLYDSETKNANSLEAKLKEMQKFFGLPITGMLNSRVIEIMQKPRCGVPDVAEYSLFPNSPKWTSKVVTYRIVSYTRDLPHITVDRLVSKALNMWGKEIPLHFRKVVWGTADIMIGFARGAHGDSYPFDGPGNTLAHAFAPGTGLGGDAHFDEDERWTDGSSLGINFLYAATHELGHSLGMGHSSDPNAVMYPTYGNGDPQNFKLSQDDIKGIQKLYGKRSNSRKK
##   aa.seq.length uniprot.id        ec expasy.enzyme.id
## 1           423     P01011      <NA>             <NA>
## 2           267     P09237 3.4.24.23        3.4.24.23

6 Using the query web service

The method wsQuery() implements the request to the query web service, and the parsing of its output.

To get the raw results returned by the UniProt server, run the following code:

conn$wsQuery('reviewed:yes AND organism:9606', columns=c('id', 'entry name'),
    limit=2, retfmt='plain')
## [1] "Entry\tEntry name\nQ00266\tMETK1_HUMAN\nQ8NB16\tMLKL_HUMAN\n"

The first parameter is the query itself, as required by the web service. To learn how to write a query for UniProt, see a description of the query web service at http://www.uniprot.org/help/api_queries.

The columns parameter is the fields you want back for each entry returned by the database.

The limit parameter is the maximum number of entries the server must return.

The retfmt parameter controls the type of output desired. Here "plain" states that we want the raw output from the server.

To get the output parsed by biodb and get a data frame, run:

conn$wsQuery('reviewed:yes AND organism:9606', columns=c('id', 'entry name'),
    limit=2, retfmt='parsed')
##    Entry  Entry name
## 1 Q00266 METK1_HUMAN
## 2 Q8NB16  MLKL_HUMAN

To get only the list of UniProt identifiers, run:

conn$wsQuery('reviewed:yes AND organism:9606', columns=c('id', 'entry name'),
    limit=2, retfmt='ids')
## [1] "Q00266" "Q8NB16"

And if you are curious to see the URL request that is sent to the server, run:

conn$wsQuery('reviewed:yes AND organism:9606', columns=c('id', 'entry name'),
    limit=2, retfmt='request')
## Biodb request object on https://www.uniprot.org/uniprot/?query=reviewed:yes%20AND%20organism:9606&columns=id,entry%20name&format=tab&limit=2

7 Conversion of gene symbols to UniProt IDs

The method geneSymbolToUniprotIds() uses wsQuery() to search for UniProt entries that reference particular gene symbols.

For instance, if you want to get the UniProt entries that have the gene symbol G-CSF, just run:

ids <- conn$geneSymbolToUniprotIds('G-CSF')
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol'))
##    accession gene.symbol
## 1     Q9GJU0       G-CSF
## 2     Q4H432  GCSF;G-CSF
## 3     Q8MKE0       G-CSF
## 4 A0A679AQ73       g-csf

If you want to match also GCSF (no minus sign character), then run:

ids <- conn$geneSymbolToUniprotIds('G-CSF', ignore.nonalphanum=TRUE)
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol'))
##    accession        gene.symbol
## 1     P09919 CSF3;C17orf33;GCSF
## 2     P35833          CSF3;GCSF
## 3     B8ZHI7    csf3a;csf3;gcsf
## 4 A0A2Z6I9R9               GCSF
## 5     Q9GJU0              G-CSF
## 6     Q4H432         GCSF;G-CSF
## 7     Q8MKE0              G-CSF
## 8 A0A679AQ73              g-csf

If you want to match G-CSFa2 too, run:

ids <- conn$geneSymbolToUniprotIds('G-CSF', partial.match=TRUE)
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol'))
##     accession gene.symbol
## 1      Q9GJU0       G-CSF
## 2      C0STS3     G-CSF 1
## 3      C0STS2     G-CSF 2
## 4      Q4H432  GCSF;G-CSF
## 5      Q8MKE0       G-CSF
## 6  A0A3G2Y303     G-CSFa2
## 7  A0A679AQ73       g-csf
## 8  A0A3G2Y5J9     G-CSFb1
## 9  A0A3G2Y5T6     G-CSFa1
## 10 A0A3G2Y4F6     G-CSFb2

The way this method works is by running wsQuery() to get a first set of entry identifiers, and then download each entry and apply a filtering on them. The downloading of the entries may quite long, wsQuery() returning potentially thousands of entries, each entry being downloaded with a single separate request and the frequency limit being 3 request per second. Entries already in cache or memory will not be downloaded again, so running the same request a second time will be faster, as it is usually the case with biodb.

8 Closing biodb instance

When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):

mybiodb$terminate()
## INFO  [16:56:19.480] Closing BiodbMain instance... 
## INFO  [16:56:19.482] Connector "uniprot" deleted.

9 Session information

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] biodbUniprot_1.0.0 BiocStyle_2.22.0  
## 
## loaded via a namespace (and not attached):
##  [1] progress_1.2.2      tidyselect_1.1.1    xfun_0.27          
##  [4] bslib_0.3.1         purrr_0.3.4         vctrs_0.3.8        
##  [7] generics_0.1.1      htmltools_0.5.2     BiocFileCache_2.2.0
## [10] yaml_2.2.1          utf8_1.2.2          blob_1.2.2         
## [13] XML_3.99-0.8        rlang_0.4.12        jquerylib_0.1.4    
## [16] pillar_1.6.4        withr_2.4.2         glue_1.4.2         
## [19] DBI_1.1.1           rappdirs_0.3.3      bit64_4.0.5        
## [22] dbplyr_2.1.1        lifecycle_1.0.1     plyr_1.8.6         
## [25] stringr_1.4.0       memoise_2.0.0       evaluate_0.14      
## [28] knitr_1.36          fastmap_1.1.0       curl_4.3.2         
## [31] fansi_0.5.0         biodb_1.2.0         Rcpp_1.0.7         
## [34] openssl_1.4.5       filelock_1.0.2      BiocManager_1.30.16
## [37] cachem_1.0.6        jsonlite_1.7.2      bit_4.0.4          
## [40] hms_1.1.1           chk_0.7.0           askpass_1.1        
## [43] digest_0.6.28       stringi_1.7.5       bookdown_0.24      
## [46] dplyr_1.0.7         bitops_1.0-7        tools_4.1.1        
## [49] magrittr_2.0.1      sass_0.4.0          RCurl_1.98-1.5     
## [52] RSQLite_2.2.8       tibble_3.1.5        crayon_1.4.1       
## [55] pkgconfig_2.0.3     ellipsis_0.3.2      prettyunits_1.1.1  
## [58] assertthat_0.2.1    rmarkdown_2.11      httr_1.4.2         
## [61] lgr_0.4.3           R6_2.5.1            compiler_4.1.1

References

Consortium, The UniProt. 2016. “UniProt: the universal protein knowledgebase.” Nucleic Acids Research 45 (D1): D158–D169. https://doi.org/10.1093/nar/gkw1099.