MSstatsBig Workflow

Devon Kohler (kohler.d@northeastern.edu)

2024-10-29

MSstatsBig Package Description

MSstatsBig is designed to overcome challenges when analyzing very large mass spectrometry (MS)-based proteomics experiments. These experiments are generally (but not always) acquired with DIA and include a large number of MS runs. MSstatsBig leverages software that can work on datasets without loading them into memory. This avoids a major problem where a dataset cannot be loaded into a standard computers RAM.

MSstatsBig includes functions which are designed to replace the converters included in the MSstats package. The goal of these converters is to perform filtering on the PSM files of identified and quantified data to remove data that is not required for differential analysis. Once this data is filtered down it should be able to be loaded into your computer’s memory. After the converters are run, the standard MSstats workflow can be followed.

MSstatsBig currently includes converters for Spectronaut and FragPipe. Beyond these converters, users can manually use the underlying functions by putting their data into MSstats format and running the underlying MSstatsPreprocessBig function. This way, either by using native export format of signal processing tools or by converting raw data chunk by chunk (for example with the readr::read_delim_chunked function), MSstatsBig can be used with other popular tools such as DIA-NN.

library(MSstatsBig)

Dataset description

The dataset included in this package is a small subset of a work by Clark et al. [1]. It is a large DIA dataset that includes over 100 runs. The experimental data was identified and quantified by FragPipe and the included dataset is the msstats.csv output for FragPipe.

head(read.csv(system.file("extdata", "fgexample.csv", package = "MSstatsBig")))
##   ProteinName            PeptideSequence PrecursorCharge FragmentIon
## 1      Q86U42 (UniMod:1)AAAAAAAAAAGAAGGR               1          b4
## 2      Q86U42 (UniMod:1)AAAAAAAAAAGAAGGR               1          y6
## 3      Q86U42 (UniMod:1)AAAAAAAAAAGAAGGR               1          y7
## 4      Q86U42 (UniMod:1)AAAAAAAAAAGAAGGR               1          b5
## 5      Q86U42 (UniMod:1)AAAAAAAAAAGAAGGR               1          y8
## 6      Q86U42 (UniMod:1)AAAAAAAAAAGAAGGR               1          y9
##   ProductCharge IsotopeLabelType Condition BioReplicate
## 1             1                L         T         1522
## 2             1                L         T         1522
## 3             1                L         T         1522
## 4             1                L         T         1522
## 5             1                L         T         1522
## 6             1                L         T         1522
##                                            Run  Intensity
## 1 CPTAC_CCRCC_W_JHU_20190112_LUMOS_C3N-01522_T 3747562.00
## 2 CPTAC_CCRCC_W_JHU_20190112_LUMOS_C3N-01522_T  770585.44
## 3 CPTAC_CCRCC_W_JHU_20190112_LUMOS_C3N-01522_T 1379359.12
## 4 CPTAC_CCRCC_W_JHU_20190112_LUMOS_C3N-01522_T 3706759.50
## 5 CPTAC_CCRCC_W_JHU_20190112_LUMOS_C3N-01522_T   77361.97
## 6 CPTAC_CCRCC_W_JHU_20190112_LUMOS_C3N-01522_T  338863.41

Run MSstatsBig converter

First we run the MSstatsBig converter. The converter will save the dataset to a place on your computer, and will return an arrow object. Once then, you can read the data from text file or load the arrow data.frame into memory by using the dplyr::collect function. The “collected” data can then be treated as a standard R data.frame.

setwd(tempdir())

converted_data = bigFragPipetoMSstatsFormat(
  system.file("extdata", "fgexample.csv", package = "MSstatsBig"),
  "output_file.csv",
  backend="arrow",
  max_feature_count = 20)

# The returned arrow object needs to be collected for the remaining workflow
converted_data = as.data.frame(dplyr::collect(converted_data))

Remaining workflow

Once the converter is run the standard MSstats workflow can be followed:

Details of the MSstats workflow can be found in [2].

library(MSstats)
## 
## Attaching package: 'MSstats'
## The following object is masked from 'package:grDevices':
## 
##     savePlot
# converted_data = read.csv("output_file.csv")
summarized_data = dataProcess(converted_data,
                              use_log_file = FALSE)
## INFO  [2024-10-29 21:47:11] ** Features with one or two measurements across runs are removed.
## INFO  [2024-10-29 21:47:11] ** Fractionation handled.
## INFO  [2024-10-29 21:47:11] ** Updated quantification data to make balanced design. Missing values are marked by NA
## INFO  [2024-10-29 21:47:11] ** Log2 intensities under cutoff = 12.59  were considered as censored missing values.
## INFO  [2024-10-29 21:47:11] ** Log2 intensities = NA were considered as censored missing values.
## INFO  [2024-10-29 21:47:11] ** Use all features that the dataset originally has.
## INFO  [2024-10-29 21:47:11] 
##  # proteins: 1
##  # peptides per protein: 1-1
##  # features per peptide: 11-11
## INFO  [2024-10-29 21:47:11] 
##                     NAT  T
##              # runs  56 50
##     # bioreplicates  56 50
##  # tech. replicates   1  1
## INFO  [2024-10-29 21:47:11]  == Start the summarization per subplot...
##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
## INFO  [2024-10-29 21:47:11]  == Summarization is done.
# Build contrast matrix
comparison = matrix(c(-1, 1),
    nrow=1, byrow=TRUE)
row.names(comparison) <- c("T-NAT")
colnames(comparison) <- c("NAT", "T")

model_results = groupComparison(contrast.matrix = comparison, 
                                data = summarized_data,
                                use_log_file = FALSE)
## INFO  [2024-10-29 21:47:11]  == Start to test and get inference in whole plot ...
##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
## INFO  [2024-10-29 21:47:11]  == Comparisons for all proteins are done.

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MSstats_4.14.0   MSstatsBig_1.4.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6          xfun_0.48             bslib_0.8.0          
##  [4] ggplot2_3.5.1         htmlwidgets_1.6.4     caTools_1.18.3       
##  [7] ggrepel_0.9.6         lattice_0.22-6        vctrs_0.6.5          
## [10] tools_4.4.1           bitops_1.0-9          generics_0.1.3       
## [13] parallel_4.4.1        tibble_3.2.1          fansi_1.0.6          
## [16] pkgconfig_2.0.3       Matrix_1.7-1          KernSmooth_2.23-24   
## [19] data.table_1.16.2     checkmate_2.3.2       assertthat_0.2.1     
## [22] lifecycle_1.0.4       compiler_4.4.1        gplots_3.2.0         
## [25] statmod_1.5.0         munsell_0.5.1         htmltools_0.5.8.1    
## [28] sass_0.4.9            yaml_2.3.10           lazyeval_0.2.2       
## [31] preprocessCore_1.68.0 marray_1.84.0         plotly_4.10.4        
## [34] pillar_1.9.0          nloptr_2.1.1          jquerylib_0.1.4      
## [37] tidyr_1.3.1           MASS_7.3-61           cachem_1.1.0         
## [40] limma_3.62.0          boot_1.3-31           nlme_3.1-166         
## [43] gtools_3.9.5          tidyselect_1.2.1      digest_0.6.37        
## [46] dplyr_1.1.4           purrr_1.0.2           arrow_17.0.0.1       
## [49] splines_4.4.1         fastmap_1.2.0         grid_4.4.1           
## [52] colorspace_2.1-1      cli_3.6.3             magrittr_2.0.3       
## [55] survival_3.7-0        utf8_1.2.4            withr_3.0.2          
## [58] scales_1.3.0          backports_1.5.0       bit64_4.5.2          
## [61] rmarkdown_2.28        httr_1.4.7            bit_4.5.0            
## [64] lme4_1.1-35.5         evaluate_1.0.1        knitr_1.48           
## [67] log4r_0.4.4           MSstatsConvert_1.16.0 viridisLite_0.4.2    
## [70] rlang_1.1.4           Rcpp_1.0.13           glue_1.8.0           
## [73] minqa_1.2.8           jsonlite_1.8.9        R6_2.5.1

References

  1. D. J. Clark, S. M. Dhanasekaran, F. Petralia, P. Wang and H. Zhang, “Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma,” Cell, vol. 179, pp. 964-983, 2019.

  2. D. Kohler et al., “MSstats Version 4.0: Statistical Analyses of Quantitative Mass Spectrometry-Based Proteomic Experiments with Chromatography-Based Quantification at Scale”, J. Proteome Res. 22, 5, pp. 1466–1482, J. Proteome Res. 2023, 22, 5, 1466–1482