Introduction

The goal of the rpx package is to provide programmatic access to proteomics data from R, in particular to the ProteomeXchange (PX) central repository (see http://www.proteomexchange.org/ and http://central.proteomexchange.org/).

Vizcaino J.A. et al. ProteomeXchange: globally co-ordinated proteomics data submission and dissemination, Nature Biotechnology 2014, 32, 223 – 226, doi:10.1038/nbt.2839.

Additional repositories are likely to be added in the future.

The rpx package

PXDataset objects

The central object that handles data access is the PXDataset class. Such an instance can be generated by passing a valid PX experiment identifier to the PXDataset constructor.

library("rpx")
id <- "PXD000001"
px <- PXDataset(id)
px
## Object of class "PXDataset"
##  Id: PXD000001 with 12 files
##  [1] 'F063721.dat' ... [12] 'generated'
##  Use 'pxfiles(.)' to see all files.

Data and meta-data

Several attributes can be extracted from an PXDataset instance, as described below.

The experiment identifier, that was originally used to create the \Robject{PXDataset} instance can be extracted with the \Rfunction{pxid} method:

pxid(px)
## [1] "PXD000001"

The file transfer url where the data files can be accessed can be queried with the pxurl method:

pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"

The species the data has been generated the data can be obtain calling the pxtax function:

pxtax(px)
## [1] "Erwinia carotovora"

Relevant bibliographic references can be queried with the pxref method:

strwrap(pxref(px))
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data"
## [2] "analysis. Biochim Biophys Acta. 2013 May 18. doi:pii:"                
## [3] "S1570-9639(13)00186-6. 10.1016/j.bbapap.2013.04.032"

All files available for the PX experiment can be obtained with the pxfiles method:

pxfiles(px)
##  [1] "F063721.dat"                                                         
##  [2] "F063721.dat-mztab.txt"                                               
##  [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"                                  
##  [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"                                    
##  [5] "PXD000001_mztab.txt"                                                 
##  [6] "README.txt"                                                          
##  [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML" 
##  [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
##  [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"         
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"           
## [11] "erwinia_carotovora.fasta"                                            
## [12] "generated"

The complete or partial data set can be downloaded with the pxget function. The function takes an instance of class PXDataset as first mandatory argument.

The next argument, list, specifies what files to download. If missing, a menu is printed and the user can select a file. If set to "all", all files of the experiment are downloaded in the working directory. Alternatively, numerics or logicals can also be used to subset the relevant files to be downloaded based on the pxfiles(.) output.

The last argument, force, can be set to TRUE to force the download of files that already exists in the working directory.

pxget(px, "erwinia_carotovora.fasta")
## Downloading 1 file
dir(pattern = "fasta")
## [1] "erwinia_carotovora.fasta"

By default, pxget will not download and overwrite a file if already available. The last argument of pxget, force, can be set to TRUE to force the download of files that already exists in the working directory.

(i <- grep("fasta", pxfiles(px)))
## [1] 11
pxget(px, i) ## same as above
## Downloading 1 file
## C:/Users/biocbuild/bbs-3.11-bioc/tmpdir/RtmpKoOoUB/Rbuild2b78541b72e9/rpx/vignettes/erwinia_carotovora.fasta already present.

Finally, a list of recent PX additions and updates can be obtained using the pxannounced() function:

pxannounced()
## 15 new ProteomeXchange annoucements
##     Data.Set    Publication.Data Message
## 1  PXD014969 2020-04-28 00:32:56     New
## 2  PXD015100 2020-04-27 15:15:11     New
## 3  PXD018829 2020-04-27 12:50:31     New
## 4  PXD016399 2020-04-27 09:57:40     New
## 5  PXD018692 2020-04-27 09:56:28     New
## 6  PXD010530 2020-04-27 09:17:38     New
## 7  PXD015622 2020-04-27 07:48:29     New
## 8  PXD016616 2020-04-27 07:45:09     New
## 9  PXD018808 2020-04-27 07:40:23     New
## 10 PXD017620 2020-04-27 07:27:13     New
## 11 PXD018061 2020-04-27 07:13:49     New
## 12 PXD017371 2020-04-27 07:12:07     New
## 13 PXD018589 2020-04-27 07:07:15     New
## 14 PXD016735 2020-04-27 07:05:04     New
## 15 PXD018035 2020-04-27 07:03:51     New

A simple use-case

Below, we show how to automate the extraction of files of interest (fasta and mzTab files), download them and read them using appropriate Bioconductor infrastructure. (Note that we read version 0.9 of the MzTab format below. For recent data, the version argument would be omitted.)

(mzt <- grep("F0.+mztab", pxfiles(px), value = TRUE))
## [1] "F063721.dat-mztab.txt"
(fas <- grep("fasta", pxfiles(px), value = TRUE))
## [1] "erwinia_carotovora.fasta"
pxget(px, c(mzt, fas))
## Downloading 2 files
## C:/Users/biocbuild/bbs-3.11-bioc/tmpdir/RtmpKoOoUB/Rbuild2b78541b72e9/rpx/vignettes/erwinia_carotovora.fasta already present.
library("Biostrings")
readAAStringSet(fas)
## AAStringSet object of length 4499:
##        width seq                                            names               
##    [1]   148 MADITLISGSTLGSAEYVAEHL...HQIPEDPAEEWLGSWVNLLK
 ECA0001 putative ...
##    [2]   154 VAEIYQIDNLDRGILSALMENA...IQSTETLISLQNPIMRTIAP
 ECA0002 AsnC-fami...
##    [3]   331 MKKQYIEKQQQISFVKSFFSSQ...GQVQCGVWPQPLRESVSGLL
 ECA0003 putative ...
##    [4]   493 MITLESLEMLLSIDENELLDDL...RFDTGLKSRLMRRWQHGKAY
 ECA0004 conserved...
##    [5]   500 MRQTAALAERISRLSHALEHGL...KIEASLQQVAEQIQQSEQQD
 ECA0005 conserved...
##    ...   ... ...
## [4495]   645 MSDKIIHLTDDSFDTDVLKADG...RKVDPLRVFASDMARRLELL
 trx-rv3790 trx-rv...
## [4496]    95 MTKMNNKARRTARELKHLGASI...ELRDEFPMGYLGDYKDDDDK
 TimBlower TimBlower
## [4497]   315 MFSNLSKRWAQRTLSKSFYSTA...KWAGIKTRKF
VFNPPKPRK
 sp|P07143|CY1_YEA...
## [4498]   235 FPTDDDDKIVGGYTCAANSIPY...GVYTKVCNYVNWIQQTIAAN
 sp|P00761|TRYP_PI...
## [4499]   271 GVSGSCNIDVVCPEGNGHRDVI...DAAGTGAQFIDGLDSTGTPPV sp|Q7M135|LYSC_LY...
library("MSnbase")
(x <- readMzTabData(mzt, "PEP", version = "0.9"))
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: sub[1] sub[2] ... sub[6] (6 total)
##   varLabels: abundance
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 1528 (1528 total)
##   fvarLabels: sequence accession ... uri (14 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## mzTab read: Mon Apr 27 23:29:25 2020 
##  MSnbase version: 2.14.0
head(exprs(x))
##     sub[1]   sub[2]   sub[3]   sub[4]   sub[5]   sub[6]
## 1 10630132 11238708 12424917 10997763  9928972 10398534
## 2 11105690 12403253 13160903 12229367 11061660 10131218
## 3  1183431  1322371  1599088  1243715  1306602  1159064
## 4  5384959  5508454  6883086  6136023  5626680  5213771
## 5 18033537 17926487 21052620 19810368 17381162 17268329
## 6  9873585 10299931 11142071 10258214  9664315  9518271
head(fData(x)[, 1:2])
##    sequence accession
## 1   DGVSVAR   ECA0625
## 2    NVVLDK   ECA0625
## 3 VEDALHATR   ECA0625
## 4 LAGGVAVIK   ECA0625
## 5  LIAEAMEK   ECA0625
## 6 SFGAPTITK   ECA0625

Questions and help

Eithe post questions on the Bioconductor support forum or open a GitHub issue.

Session information

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rpx_1.24.0          MSnbase_2.14.0      ProtGenerics_1.20.0
##  [4] mzR_2.22.0          Rcpp_1.0.4.6        Biobase_2.48.0     
##  [7] Biostrings_2.56.0   XVector_0.28.0      IRanges_2.22.0     
## [10] S4Vectors_0.26.0    BiocGenerics_0.34.0 BiocStyle_2.16.0   
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-41       assertthat_0.2.1      digest_0.6.25        
##  [4] foreach_1.5.0         R6_2.4.1              plyr_1.8.6           
##  [7] mzID_1.26.0           evaluate_0.14         ggplot2_3.3.0        
## [10] pillar_1.4.3          zlibbioc_1.34.0       rlang_0.4.5          
## [13] curl_4.3              preprocessCore_1.50.0 rmarkdown_2.1        
## [16] BiocParallel_1.22.0   stringr_1.4.0         RCurl_1.98-1.2       
## [19] munsell_0.5.0         compiler_4.0.0        xfun_0.13            
## [22] pkgconfig_2.0.3       pcaMethods_1.80.0     htmltools_0.4.0      
## [25] tidyselect_1.0.0      tibble_3.0.1          codetools_0.2-16     
## [28] XML_3.99-0.3          crayon_1.3.4          dplyr_0.8.5          
## [31] MASS_7.3-51.6         bitops_1.0-6          grid_4.0.0           
## [34] gtable_0.3.0          lifecycle_0.2.0       affy_1.66.0          
## [37] magrittr_1.5          scales_1.1.0          ncdf4_1.17           
## [40] stringi_1.4.6         impute_1.62.0         affyio_1.58.0        
## [43] doParallel_1.0.15     limma_3.44.0          xml2_1.3.2           
## [46] ellipsis_0.3.0        vctrs_0.2.4           iterators_1.0.12     
## [49] tools_4.0.0           glue_1.4.0            purrr_0.3.4          
## [52] yaml_2.2.1            colorspace_1.4-1      BiocManager_1.30.10  
## [55] vsn_3.56.0            MALDIquant_1.19.3     knitr_1.28