The goal of the rpx package is to provide programmatic access to proteomics data from R, in particular to the ProteomeXchange (PX) central repository (see http://www.proteomexchange.org/ and http://central.proteomexchange.org/).
Vizcaino J.A. et al. ProteomeXchange: globally co-ordinated proteomics data submission and dissemination, Nature Biotechnology 2014, 32, 223 – 226, doi:10.1038/nbt.2839.
Additional repositories are likely to be added in the future.
PXDataset
objectsThe central object that handles data access is the PXDataset
class. Such an instance can be generated by passing a valid PX
experiment identifier to the PXDataset
constructor.
library("rpx")
id <- "PXD000001"
px <- PXDataset(id)
px
## Object of class "PXDataset"
## Id: PXD000001 with 11 files
## [1] 'F063721.dat' ... [11] 'erwinia_carotovora.fasta'
## Use 'pxfiles(.)' to see all files.
Several attributes can be extracted from an PXDataset
instance, as
described below.
The experiment identifier, that was originally used to create the
PXDataset
instance can be extracted with the pxid()
method:
pxid(px)
## [1] "PXD000001"
The file transfer url where the data files can be accessed can be
queried with the pxurl
method:
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
The species the data has been generated the data can be obtain calling
the pxtax
function:
pxtax(px)
## [1] "Erwinia carotovora"
Relevant bibliographic references can be queried with the
pxref
method:
strwrap(pxref(px))
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data"
## [2] "analysis. Biochim Biophys Acta. 2013 May 18. doi:pii:"
## [3] "S1570-9639(13)00186-6. 10.1016/j.bbapap.2013.04.032"
All files available for the PX experiment can be obtained with the
pxfiles
method:
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "README.txt"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [11] "erwinia_carotovora.fasta"
The complete or partial data set can be downloaded with the pxget()
function. The function takes an instance of class PXDataset
as first
mandatory argument.
The next argument, list
, specifies what files to download. If
missing, a menu is printed and the user can select a file. If set to
"all"
, all files of the experiment are downloaded. Alternatively,
numerics or logicals can also be used to subset the relevant files to
be downloaded based on the pxfiles(.)
output.
f <- pxget(px, "PXD000001_mztab.txt")
## Loading PXD000001_mztab.txt from cache.
f
## [1] "~/.cache/rpx/63ac7ca2511d_PXD000001_mztab.txt"
The rpx
package makes use of the BiocFileCache
package to avoid repeatedly dowloading files. When downloaded, file
are cached, i.e. stored centrally in the package's cache
directory. Next time the pxget()
function attempts to get that file,
it will be directly retrieved from the cache instead being downloaded
again.
Finally, a list of recent PX additions and updates can be obtained
using the pxannounced()
function:
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD017953 2021-03-13 17:40:20 New
## 2 PXD020016 2021-03-13 04:15:09 New
## 3 PXD020399 2021-03-13 03:55:33 New
## 4 PXD020714 2021-03-13 03:35:36 New
## 5 PXD018932 2021-03-12 17:15:05 New
## 6 PXD023139 2021-03-12 15:42:59 New
## 7 PXD024706 2021-03-12 14:11:33 New
## 8 PXD024705 2021-03-12 14:08:23 New
## 9 PXD013791 2021-03-12 13:15:21 New
## 10 PXD022739 2021-03-12 12:56:18 New
## 11 PXD020167 2021-03-12 12:01:44 New
## 12 PXD011568 2021-03-12 11:26:46 New
## 13 PXD020167 2021-03-12 09:52:35 New
## 14 PXD024381 2021-03-12 08:00:54 New
## 15 PXD024674 2021-03-12 07:12:38 New
Below, we download the fasta file from the PXD000001 dataset and load it with the Biostrings package.
fas <- grep("fasta", pxfiles(px), value = TRUE)
fas
## [1] "erwinia_carotovora.fasta"
f <- pxget(px, fas)
## Loading erwinia_carotovora.fasta from cache.
f ## files available in the rpx cache
## [1] "~/.cache/rpx/63ac5832dd30_erwinia_carotovora.fasta"
library("Biostrings")
readAAStringSet(f)
## AAStringSet object of length 4499:
## width seq names
## [1] 147 MADITLISGSTLGSAEYVAEHL...QHQIPEDPAEEWLGSWVNLLK ECA0001 putative ...
## [2] 153 VAEIYQIDNLDRGILSALMENA...EIQSTETLISLQNPIMRTIAP ECA0002 AsnC-fami...
## [3] 330 MKKQYIEKQQQISFVKSFFSSQ...IGQVQCGVWPQPLRESVSGLL ECA0003 putative ...
## [4] 492 MITLESLEMLLSIDENELLDDL...WRFDTGLKSRLMRRWQHGKAY ECA0004 conserved...
## [5] 499 MRQTAALAERISRLSHALEHGL...AKIEASLQQVAEQIQQSEQQD ECA0005 conserved...
## ... ... ...
## [4495] 634 MSDKIIHLTDDSFDTDVLKADG...RRKVDPLRVFASDMARRLELL trx-rv3790 trx-rv...
## [4496] 93 MTKMNNKARRTARELKHLGASI...RELRDEFPMGYLGDYKDDDDK TimBlower TimBlower
## [4497] 309 MFSNLSKRWAQRTLSKSFYSTA...KFKWAGIKTRKFVFNPPKPRK sp|P07143|CY1_YEA...
## [4498] 231 FPTDDDDKIVGGYTCAANSIPY...PGVYTKVCNYVNWIQQTIAAN sp|P00761|TRYP_PI...
## [4499] 269 GVSGSCNIDVVCPEGNGHRDVI...DAAGTGAQFIDGLDSTGTPPV sp|Q7M135|LYSC_LY...
Either post questions on the Bioconductor support forum or open a GitHub issue.
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] rpx_1.26.2 Biostrings_2.58.0 XVector_0.30.0
## [4] IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.0
## [7] BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 pillar_1.5.1 compiler_4.0.4
## [4] BiocManager_1.30.10 dbplyr_2.1.0 bitops_1.0-6
## [7] tools_4.0.4 zlibbioc_1.36.0 digest_0.6.27
## [10] bit_4.0.4 debugme_1.1.0 tibble_3.1.0
## [13] lifecycle_1.0.0 BiocFileCache_1.14.0 RSQLite_2.2.4
## [16] evaluate_0.14 memoise_2.0.0 pkgconfig_2.0.3
## [19] rlang_0.4.10 DBI_1.1.1 curl_4.3
## [22] yaml_2.2.1 xfun_0.22 fastmap_1.1.0
## [25] withr_2.4.1 httr_1.4.2 stringr_1.4.0
## [28] dplyr_1.0.5 xml2_1.3.2 knitr_1.31
## [31] rappdirs_0.3.3 generics_0.1.0 vctrs_0.3.6
## [34] tidyselect_1.1.0 bit64_4.0.5 glue_1.4.2
## [37] R6_2.5.0 fansi_0.4.2 rmarkdown_2.7
## [40] purrr_0.3.4 blob_1.2.1 magrittr_2.0.1
## [43] ellipsis_0.3.1 htmltools_0.5.1.1 assertthat_0.2.1
## [46] utf8_1.2.1 stringi_1.5.3 RCurl_1.98-1.2
## [49] cachem_1.0.4 crayon_1.4.1