Abstract
This vignette describes how to apply fragment matching to MS\(^E\)/HDMS\(^E\) data using the ‘synapter’ package. The fragment matching feature allows one to rescue non unique matches and remove falsely assigned unique matches as well.
synapter is free and open-source software. If you use it, please support the project by citing it in publications:
Nicholas James Bond, Pavel Vyacheslavovich Shliaha, Kathryn S. Lilley, and Laurent Gatto. Improving qualitative and quantitative performance for MS\(^E\)-based label free proteomics. J. Proteome Res., 2013, 12 (6), pp 2340–2353
For bugs, typos, suggestions or other questions, please file an issue
in our tracking system (https://github.com/lgatto/synapter/issues) providing as
much information as possible, a reproducible example and the output of
sessionInfo()
.
If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.
This document assumes familiarity with standard synapter
pipeline described in (Bond et al. 2013)
and in the package synapter
vignette, available online
and with vignette("synapter", package = "synapter")
.
In this vignette we introduce a new fragment matching feature (see
figures 2, 3
and 4) which improves the matching of
identification and the quantitation features. After applying the usual
synergise1
workflow (see ?synergise1
and
?Synapter
for details) a number of multiple matches and
possible false unique matches remain that can be deconvoluted by
comparing common peaks in the identification fragment peaks and the
quantitation spectra.
The example data synobj2
used throughout this document
is available in the synapterdata
package and can be directly load as follows:
library("synapterdata")
synobj2RData()
In the [next section](:synergise] we describe how
synobj2
was generated. The following sections then describe
the new fragment matching functionality.
synergise1
One has to run the synergise1
workflow before fragment
matching can be applied. Please read the general synapter
vignette
for the general use of synergise1
. The additional data
needed for the fragment matching procedure are a
final_fragment.csv
file for the identification run and a
Spectrum.xml
file for the quantitation run.
## Please find the raw data at:
## http://proteome.sysbiol.cam.ac.uk/lgatto/synapter/data/
library("synapter")
inlist <- list(
identpeptide = "fermentor_03_sample_01_HDMSE_01_IA_final_peptide.csv.gz",
identfragments = "fermentor_03_sample_01_HDMSE_01_IA_final_fragment.csv.gz",
quantpeptide = "fermentor_02_sample_01_HDMSE_01_IA_final_peptide.csv.gz",
quantpep3d = "fermentor_02_sample_01_HDMSE_01_Pep3DAMRT.csv.gz",
quantspectra = "fermentor_02_sample_01_HDMSE_01_Pep3D_Spectrum.xml.gz",
fasta = "S.cerevisiae_Uniprot_reference_canonical_18_03_14.fasta")
synobj2 <- Synapter(inlist, master=FALSE)
synobj2 <- synergise1(object=synobj2,
outputdir=tempdir())
This step is optional and allows one to remove low abundance
fragments in the spectra using filterFragments
. Filtering
fragments can remove noise in the spectra and reduce undesired fragment
matches. Prior to filtering, the
plotCumulativeNumberOfFragments
function can be use to
visualise the intensity of all fragments. Both functions have an
argument what
to decide what spectra/fragments to
filter/plot. Choose fragments.ident
for the identification
fragments and spectra.quant
for the quantitation
fragments.
plotCumulativeNumberOfFragments(synobj2,
what = "fragments.ident")
plotCumulativeNumberOfFragments(synobj2,
what = "spectra.quant")
filterFragments(synobj2,
what = "fragments.ident",
minIntensity = 70)
filterFragments(synobj2,
what = "spectra.quant",
minIntensity = 70)
This method, named fragmentMatching
, performs the
matching of the identification fragments vs the quantitation spectra and
counts the number of identical peaks for each combination.
Because the peaks/fragments in the spectra of one run will never be
numerically identical to these in another, a tolerance parameter has to
be set using the setFragmentMatchingPpmTolerance
function.
Peaks/Fragments within this tolerance are treated as identical.
setFragmentMatchingPpmTolerance(synobj2, 25)
fragmentMatching(synobj2)
The plotFragmentMatching
function illustrates the
details of this fragment matching procedure. If it is called without any
additional argument every matched pair (fragment vs spectrum) is
plotted. One can use the key
argument to select a special
value in a column (defined by the column
argument) of the
MatchedEMRTs
data.frame
. E.g. if one wants to
select the fragment matching results with a high number of common peaks,
e.g. 28 common peaks:
plotFragmentMatching(synobj2,
key = 28,
column = "FragmentMatching")
Or, if one is interested in all results for the peptide with the
sequence "TALIDGLAQR"
.
plotFragmentMatching(synobj2,
key = "TALIDGLAQR",
column = "peptide.seq")
Maybe the peptide with a special precursor ID looks interesting.
plotFragmentMatching(synobj2,
key = 12589,
column = "precursor.leID.ident")
The plotFragmentMatchingPerformance
function can be used
to assess the performance of the fragment matching and the result of the
filtering procedure (see below) based on the number of common peaks.
This function invisibly returns a list with matrices containing true
positive, false positive, true negative and false negative matches for
the unique and non unique matches EMRT matches, as illustrated in tables
1 and 2. Both tables could be also
generated by fragmentMatchingPerformance
.
m <- plotFragmentMatchingPerformance(synobj2)
ncommon | tp | fp | tn | fn | all | fdr |
---|---|---|---|---|---|---|
0 | 2176 | 26 | 0 | 0 | 4034 | 0.0118074 |
1 | 1940 | 5 | 21 | 236 | 2765 | 0.0025707 |
2 | 1426 | 4 | 22 | 750 | 1899 | 0.0027972 |
3 | 1039 | 2 | 24 | 1137 | 1370 | 0.0019212 |
4 | 759 | 2 | 24 | 1417 | 1016 | 0.0026281 |
5 | 569 | 0 | 26 | 1607 | 780 | 0.0000000 |
6 | 414 | 0 | 26 | 1762 | 600 | 0.0000000 |
7 | 312 | 0 | 26 | 1864 | 468 | 0.0000000 |
8 | 245 | 0 | 26 | 1931 | 379 | 0.0000000 |
9 | 197 | 0 | 26 | 1979 | 316 | 0.0000000 |
10 | 148 | 0 | 26 | 2028 | 252 | 0.0000000 |
11 | 116 | 0 | 26 | 2060 | 203 | 0.0000000 |
12 | 94 | 0 | 26 | 2082 | 171 | 0.0000000 |
13 | 66 | 0 | 26 | 2110 | 132 | 0.0000000 |
14 | 55 | 0 | 26 | 2121 | 110 | 0.0000000 |
deltacommon | tp | fp | tn | fn | all | fdr |
---|---|---|---|---|---|---|
0 | 334 | 72 | 0 | 0 | 4034 | 0.1773399 |
1 | 334 | 8 | 64 | 0 | 889 | 0.0233918 |
2 | 271 | 4 | 68 | 63 | 698 | 0.0145455 |
3 | 226 | 2 | 70 | 108 | 566 | 0.0087719 |
4 | 198 | 1 | 71 | 136 | 501 | 0.0050251 |
5 | 176 | 1 | 71 | 158 | 453 | 0.0056497 |
6 | 153 | 1 | 71 | 181 | 397 | 0.0064935 |
7 | 129 | 1 | 71 | 205 | 337 | 0.0076923 |
8 | 110 | 1 | 71 | 224 | 294 | 0.0090090 |
9 | 96 | 1 | 71 | 238 | 261 | 0.0103093 |
10 | 84 | 1 | 71 | 250 | 232 | 0.0117647 |
11 | 70 | 1 | 71 | 264 | 201 | 0.0140845 |
12 | 61 | 1 | 71 | 273 | 180 | 0.0161290 |
13 | 50 | 1 | 71 | 284 | 150 | 0.0196078 |
14 | 42 | 1 | 71 | 292 | 130 | 0.0232558 |
From the left panel on figure 7 and table 1 displaying counts for unique matches one can define filtering values for the unique (this section) and multiple matches (next section). In the case of uniquely matching EMRTs, one can easily reduce the number of false matches by requiring that true matches must have at least one peak/fragment in common. Clearly this will also remove some true matches. The question is whether you want to rely on matches that have no (or only a few) peaks/fragments in common?
performance(synobj2)
getEMRTtable(synobj2)
##
## 0 1 2 3 4 5 6 7 12 14
## 241 2865 400 64 21 6 5 1 1 1
filterUniqueMatches(synobj2, minNumber = 1)
performance(synobj2)
getEMRTtable(synobj2)
##
## -1 0 1 2 3 4 5 6 7 12 14
## 625 241 2240 400 64 21 6 5 1 1 1
The largest benefit of fragment matching is for non unique matches. If we assume that true match have a highest number of common peaks/fragments, we can distinguish correct matches among multiple possible matches that could not resolved before (c.f. section [#sec:plotperf]). To do so, we use the difference of common peaks from the highest to the second highest number in the match group. Assuming two cases with multiple matches. In the first case, we have two possible matches: a match with 7 and a match with 2 fragments in common. In the second ambiguous match, there are a matches with 2 and 1 fragments in common respectively. If we decide to accept a difference of at least 2, our first multiple match case be resolved into a unique match as the difference between the best and second matches is 5 and the best match with 7 common fragments will be upgraded to a unique match.
The right panel of figure 7 and table 2 can be used to choose a good threshold for the difference in number of common peaks.
performance(synobj2)
getEMRTtable(synobj2)
##
## -1 0 1 2 3 4 5 6 7 12 14
## 625 241 2240 400 64 21 6 5 1 1 1
filterNonUniqueMatches(synobj2, minDelta = 2)
performance(synobj2)
getEMRTtable(synobj2)
##
## -2 -1 0 1
## 210 625 241 2529
In this example we rescued 289 unique matches out of the non unique ones.
Like in the initial synapter
workflow, it is possible to export the MatchedEMRT
results
using the writeMatchedEMRTs
function. The table has some
new columns that correspond to the fragment matching procedure,
e.g. FragmentMatching
, .
writeMatchedEMRTs(synobj2, file = "MatchedEMRTs.csv")
All software and respective versions used to produce this document are listed below.
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BiocStyle_2.35.0 synapterdata_1.43.0 synapter_2.31.0
## [4] MSnbase_2.33.0 ProtGenerics_1.39.0 S4Vectors_0.45.0
## [7] mzR_2.41.0 Rcpp_1.0.13 Biobase_2.67.0
## [10] BiocGenerics_0.53.0
##
## loaded via a namespace (and not attached):
## [1] rlang_1.1.4 magrittr_2.0.3
## [3] clue_0.3-65 cleaver_1.45.0
## [5] matrixStats_1.4.1 compiler_4.5.0
## [7] vctrs_0.6.5 reshape2_1.4.4
## [9] stringr_1.5.1 pkgconfig_2.0.3
## [11] crayon_1.5.3 fastmap_1.2.0
## [13] XVector_0.47.0 utf8_1.2.4
## [15] rmarkdown_2.28 tzdb_0.4.0
## [17] UCSC.utils_1.3.0 preprocessCore_1.69.0
## [19] purrr_1.0.2 xfun_0.48
## [21] MultiAssayExperiment_1.33.0 zlibbioc_1.53.0
## [23] cachem_1.1.0 GenomeInfoDb_1.43.0
## [25] jsonlite_1.8.9 highr_0.11
## [27] DelayedArray_0.33.0 BiocParallel_1.41.0
## [29] parallel_4.5.0 cluster_2.1.6
## [31] R6_2.5.1 RColorBrewer_1.1-3
## [33] bslib_0.8.0 stringi_1.8.4
## [35] limma_3.63.0 GenomicRanges_1.59.0
## [37] jquerylib_0.1.4 SummarizedExperiment_1.37.0
## [39] iterators_1.0.14 knitr_1.48
## [41] readr_2.1.5 IRanges_2.41.0
## [43] splines_4.5.0 Matrix_1.7-1
## [45] igraph_2.1.1 tidyselect_1.2.1
## [47] qvalue_2.39.0 abind_1.4-8
## [49] yaml_2.3.10 doParallel_1.0.17
## [51] codetools_0.2-20 affy_1.85.0
## [53] lattice_0.22-6 tibble_3.2.1
## [55] plyr_1.8.9 evaluate_1.0.1
## [57] survival_3.7-0 Biostrings_2.75.0
## [59] pillar_1.9.0 affyio_1.77.0
## [61] BiocManager_1.30.25 MatrixGenerics_1.19.0
## [63] foreach_1.5.2 MALDIquant_1.22.3
## [65] ncdf4_1.23 generics_0.1.3
## [67] hms_1.1.3 ggplot2_3.5.1
## [69] munsell_0.5.1 scales_1.3.0
## [71] glue_1.8.0 lazyeval_0.2.2
## [73] tools_4.5.0 mzID_1.45.0
## [75] QFeatures_1.17.0 vsn_3.75.0
## [77] XML_3.99-0.17 grid_4.5.0
## [79] impute_1.81.0 tidyr_1.3.1
## [81] MsCoreUtils_1.19.0 colorspace_2.1-1
## [83] GenomeInfoDbData_1.2.13 PSMatch_1.11.0
## [85] cli_3.6.3 fansi_1.0.6
## [87] S4Arrays_1.7.0 dplyr_1.1.4
## [89] AnnotationFilter_1.31.0 pcaMethods_1.99.0
## [91] gtable_0.3.6 sass_0.4.9
## [93] digest_0.6.37 SparseArray_1.7.0
## [95] multtest_2.63.0 htmltools_0.5.8.1
## [97] lifecycle_1.0.4 httr_1.4.7
## [99] statmod_1.5.0 MASS_7.3-61