MSstatsPTM 1.0.0
MSstatsPTM provides a set of general statistical methods for characterization of quantitative changes in global post-translational modification (PTM) profiling experiments. Typically, the analysis involves the quantification of PTM sites (i.e., modified residues) and their corresponding proteins, as well as the integration of the quantification results.
Quantitative analyses of PTMs are supported by four main functions of MSstatsPTM:
PTMnormalize()
normalizes the quantified peak intensities to correct
systematic variation across MS runs.
PTMsummarize()
summarizes log2-intensities of spectral features (i.e.,
precursor ions in DDA, fragments in DIA, or transitions in SRM) into one value
per PTM site per run or one value per protein per run.
PTMestimate()
takes as input the summarized log2-intensities for each PTM
site, performs statistical modeling for the log2-abundance of the site, and
returns the estimates of model parameters for all PTM sites in all experimental
conditions.
PTMcompareMeans()
performs statistical testing for detecting changes in PTM
mean abundances between conditions.
The development version of MSstatsPTM can be installed from GitHub:
devtools::install_github("tsunghengtsai/MSstatsPTM")
Once installed, MSstatsPTM can be loaded with library()
:
library(MSstatsPTM)
The abundance of a PTM site depends on two factors: (1) the proportion of proteins carrying the PTM, and (2) the underlying protein abundance. Quantification of PTMs alone cannot provide complete information about the degree of the modification. Therefore, a quantitative PTM experiment often involves analyses of enriched samples (for PTMs) and unenriched samples (for proteins).
The MSstatsPTM analysis workflow takes as input a list of two data frames,
named PTM
and PROTEIN
.
We use PTMsimulateExperiment()
to generate an example dataset. The function
takes in account several parameters in a PTM experiment: number of groups
(nGroup
), number of replicates per group (nRep
), number of proteins
(nProtein
), number of sites per protein (nSite
), number of spectral
features per site/protein (nFeature
), mean log2-abundance of PTM and PROTEIN
(mu
), deviation from the mean log2-abundance in each group (delta
),
standard deviation among replicates (sRep
), and standard deviation among
log2-intensities (sPeak
).
# sim <- PTMsimulateExperiment(
# nGroup=2, nRep=2, nProtein=1, nSite=2, nFeature=5,
# mu=list(PTM=25, PROTEIN=25),
# delta=list(PTM=c(0, 1), PROTEIN=c(0, 1)),
# sRep=list(PTM=0.2, PROTEIN=0.2),
# sPeak=list(PTM=0.05, PROTEIN=0.05)
# )
sim <- PTMsimulateExperiment(
nGroup=2, nRep=2, nProtein=1, nSite=2, nFeature=5,
logAbundance=list(
PTM=list(mu=25, delta=c(0, 1), sRep=0.2, sPeak=0.05),
PROTEIN=list(mu=25, delta=c(0, 1), sRep=0.2, sPeak=0.05)
)
)
A list of two data frames named PTM
and PROTEIN
is returned by
PTMsimulateExperiment()
:
str(sim)
#> List of 2
#> $ PTM : tibble [40 × 6] (S3: tbl_df/tbl/data.frame)
#> ..$ protein : chr [1:40] "Protein_1" "Protein_1" "Protein_1" "Protein_1" ...
#> ..$ site : chr [1:40] "S_1" "S_1" "S_1" "S_1" ...
#> ..$ group : chr [1:40] "G_1" "G_1" "G_1" "G_1" ...
#> ..$ run : chr [1:40] "R_1" "R_1" "R_1" "R_1" ...
#> ..$ feature : chr [1:40] "F_1" "F_2" "F_3" "F_4" ...
#> ..$ log2inty: num [1:40] 24.9 24.9 24.9 24.9 24.8 ...
#> $ PROTEIN: tibble [20 × 5] (S3: tbl_df/tbl/data.frame)
#> ..$ protein : chr [1:20] "Protein_1" "Protein_1" "Protein_1" "Protein_1" ...
#> ..$ group : chr [1:20] "G_1" "G_1" "G_1" "G_1" ...
#> ..$ run : chr [1:20] "R_1" "R_1" "R_1" "R_1" ...
#> ..$ feature : chr [1:20] "F_1" "F_2" "F_3" "F_4" ...
#> ..$ log2inty: num [1:20] 24.5 24.5 24.6 24.4 24.6 ...
The PTM
data frame contains 6 columns representing the quantified log2inty
of each feature
, site
and protein
, in the corresponding group
and
run
:
sim[["PTM"]]
#> # A tibble: 40 x 6
#> protein site group run feature log2inty
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 Protein_1 S_1 G_1 R_1 F_1 24.9
#> 2 Protein_1 S_1 G_1 R_1 F_2 24.9
#> 3 Protein_1 S_1 G_1 R_1 F_3 24.9
#> 4 Protein_1 S_1 G_1 R_1 F_4 24.9
#> 5 Protein_1 S_1 G_1 R_1 F_5 24.8
#> 6 Protein_1 S_1 G_1 R_2 F_1 25.3
#> 7 Protein_1 S_1 G_1 R_2 F_2 25.4
#> 8 Protein_1 S_1 G_1 R_2 F_3 25.5
#> 9 Protein_1 S_1 G_1 R_2 F_4 25.5
#> 10 Protein_1 S_1 G_1 R_2 F_5 25.3
#> # … with 30 more rows
The PROTEIN
data frame includes the same columns except site
:
sim[["PROTEIN"]]
#> # A tibble: 20 x 5
#> protein group run feature log2inty
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 Protein_1 G_1 R_1 F_1 24.5
#> 2 Protein_1 G_1 R_1 F_2 24.5
#> 3 Protein_1 G_1 R_1 F_3 24.6
#> 4 Protein_1 G_1 R_1 F_4 24.4
#> 5 Protein_1 G_1 R_1 F_5 24.6
#> 6 Protein_1 G_1 R_2 F_1 25.2
#> 7 Protein_1 G_1 R_2 F_2 25.3
#> 8 Protein_1 G_1 R_2 F_3 25.2
#> 9 Protein_1 G_1 R_2 F_4 25.2
#> 10 Protein_1 G_1 R_2 F_5 25.2
#> 11 Protein_1 G_2 R_3 F_1 26.2
#> 12 Protein_1 G_2 R_3 F_2 26.3
#> 13 Protein_1 G_2 R_3 F_3 26.2
#> 14 Protein_1 G_2 R_3 F_4 26.2
#> 15 Protein_1 G_2 R_3 F_5 26.3
#> 16 Protein_1 G_2 R_4 F_1 26.0
#> 17 Protein_1 G_2 R_4 F_2 26.1
#> 18 Protein_1 G_2 R_4 F_3 26.1
#> 19 Protein_1 G_2 R_4 F_4 26.1
#> 20 Protein_1 G_2 R_4 F_5 26.0
A main distinction of the quantitative PTM analysis from general proteomics analyses is the focus on the PTM sites, where the basic analysis unit is a PTM site, rather than a protein. A site can be spanned by multiple peptides, and quantified with multiple spectral features. Transformation from peptide-level representation to site-level representation requires locating PTM sites on protein sequence, usually provided in a FASTA format.
MSstatsPTM provides several help functions to facilitate the transformation:
tidyFasta()
reads and returns the sequence information in a tidy format.
PTMlocate()
annotates PTM sites with the associated peptides and protein
sequences.
tidyFasta()
takes as input the path to a FASTA file (either a local path or
an URL). For example, we can access to the information of alpha-synuclein
(P37840) on Uniprot and extract the
sequence information as follows:
fas <- tidyFasta("https://www.uniprot.org/uniprot/P37840.fasta")
Normalization of log2-intensities can be performed by equalizing appropriate
summary statistics (such as the median of the log2-intensities) across MS runs,
or by using a reference derived from either spiked-in internal standard or
other orthogonal evidence. The PTM data and the PROTEIN data are normalized
separately, using PTMnormalize()
.
Normalization often relies on the assumption about the distribution of log2-intensities for each MS run. For example, one common assumption is that the abundances of most PTM sites and proteins do not change across conditions. It is therefore reasonable to equalize measures of distribution location across MS runs.
The data can be normalized with PTMnormalize()
. Different summary statistics
can be defined through the method
option, including the median of
log2-intensities (median
, default), the mean of log2-intensities (mean
),
and the log2 of intensity sum (logsum
).
normalized <- PTMnormalize(sim, method="median")
normalized
#> $PTM
#> # A tibble: 40 x 6
#> protein site group run feature log2inty
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 Protein_1 S_1 G_1 R_1 F_1 25.6
#> 2 Protein_1 S_1 G_1 R_1 F_2 25.6
#> 3 Protein_1 S_1 G_1 R_1 F_3 25.7
#> 4 Protein_1 S_1 G_1 R_1 F_4 25.6
#> 5 Protein_1 S_1 G_1 R_1 F_5 25.6
#> 6 Protein_1 S_1 G_1 R_2 F_1 25.5
#> 7 Protein_1 S_1 G_1 R_2 F_2 25.6
#> 8 Protein_1 S_1 G_1 R_2 F_3 25.7
#> 9 Protein_1 S_1 G_1 R_2 F_4 25.7
#> 10 Protein_1 S_1 G_1 R_2 F_5 25.5
#> # … with 30 more rows
#>
#> $PROTEIN
#> # A tibble: 20 x 5
#> protein group run feature log2inty
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 Protein_1 G_1 R_1 F_1 25.6
#> 2 Protein_1 G_1 R_1 F_2 25.6
#> 3 Protein_1 G_1 R_1 F_3 25.6
#> 4 Protein_1 G_1 R_1 F_4 25.5
#> 5 Protein_1 G_1 R_1 F_5 25.7
#> 6 Protein_1 G_1 R_2 F_1 25.6
#> 7 Protein_1 G_1 R_2 F_2 25.8
#> 8 Protein_1 G_1 R_2 F_3 25.6
#> 9 Protein_1 G_1 R_2 F_4 25.6
#> 10 Protein_1 G_1 R_2 F_5 25.6
#> 11 Protein_1 G_2 R_3 F_1 25.6
#> 12 Protein_1 G_2 R_3 F_2 25.7
#> 13 Protein_1 G_2 R_3 F_3 25.6
#> 14 Protein_1 G_2 R_3 F_4 25.6
#> 15 Protein_1 G_2 R_3 F_5 25.7
#> 16 Protein_1 G_2 R_4 F_1 25.6
#> 17 Protein_1 G_2 R_4 F_2 25.6
#> 18 Protein_1 G_2 R_4 F_3 25.6
#> 19 Protein_1 G_2 R_4 F_4 25.7
#> 20 Protein_1 G_2 R_4 F_5 25.6
The assumption made in the summary-based normalization may not be valid in all
scenarios. When experimental design allows to derive a reference for the
normalization (e.g., using internal standard), it can be useful to perform
normalization using the reference. For example, the hypothetical reference
below defines an adjustment of c(2, -2, 0, 0)
for each run in the PTM data
and c(1.5, -1, 0, 0)
in the PROTEIN data.
refs <- list(
PTM=data.frame(run=paste0("R_", 1:4), adjLog2inty=c(2, -2, 0, 0)),
PROTEIN=data.frame(run=paste0("R_", 1:4), adjLog2inty=c(3, -1, 0, 0))
)
PTMnormalize(sim, method="ref", refs=refs)
#> $PTM
#> # A tibble: 40 x 6
#> protein site group run feature log2inty
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 Protein_1 S_1 G_1 R_1 F_1 26.9
#> 2 Protein_1 S_1 G_1 R_1 F_2 26.9
#> 3 Protein_1 S_1 G_1 R_1 F_3 26.9
#> 4 Protein_1 S_1 G_1 R_1 F_4 26.9
#> 5 Protein_1 S_1 G_1 R_1 F_5 26.8
#> 6 Protein_1 S_1 G_1 R_2 F_1 23.3
#> 7 Protein_1 S_1 G_1 R_2 F_2 23.4
#> 8 Protein_1 S_1 G_1 R_2 F_3 23.5
#> 9 Protein_1 S_1 G_1 R_2 F_4 23.5
#> 10 Protein_1 S_1 G_1 R_2 F_5 23.3
#> # … with 30 more rows
#>
#> $PROTEIN
#> # A tibble: 20 x 5
#> protein group run feature log2inty
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 Protein_1 G_1 R_1 F_1 27.5
#> 2 Protein_1 G_1 R_1 F_2 27.5
#> 3 Protein_1 G_1 R_1 F_3 27.6
#> 4 Protein_1 G_1 R_1 F_4 27.4
#> 5 Protein_1 G_1 R_1 F_5 27.6
#> 6 Protein_1 G_1 R_2 F_1 24.2
#> 7 Protein_1 G_1 R_2 F_2 24.3
#> 8 Protein_1 G_1 R_2 F_3 24.2
#> 9 Protein_1 G_1 R_2 F_4 24.2
#> 10 Protein_1 G_1 R_2 F_5 24.2
#> 11 Protein_1 G_2 R_3 F_1 26.2
#> 12 Protein_1 G_2 R_3 F_2 26.3
#> 13 Protein_1 G_2 R_3 F_3 26.2
#> 14 Protein_1 G_2 R_3 F_4 26.2
#> 15 Protein_1 G_2 R_3 F_5 26.3
#> 16 Protein_1 G_2 R_4 F_1 26.0
#> 17 Protein_1 G_2 R_4 F_2 26.1
#> 18 Protein_1 G_2 R_4 F_3 26.1
#> 19 Protein_1 G_2 R_4 F_4 26.1
#> 20 Protein_1 G_2 R_4 F_5 26.0
MSstatsPTM performs the statistical modeling with a split-plot approach,
which summarizes the log2-intensities into one single value per site per run
(in PTM
) or one value per protein per run (in PROTEIN
), and expresses the
summarized values in consideration of experimental conditions and replicates.
The summarization of log2-intensities can be performed using PTMsummarize()
.
summarized <- PTMsummarize(normalized)
summarized
#> $PTM
#> # A tibble: 8 x 5
#> protein site run log2inty group
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 Protein_1 S_1 R_1 25.6 G_1
#> 2 Protein_1 S_1 R_2 25.6 G_1
#> 3 Protein_1 S_1 R_3 25.6 G_2
#> 4 Protein_1 S_1 R_4 25.9 G_2
#> 5 Protein_1 S_2 R_1 25.7 G_1
#> 6 Protein_1 S_2 R_2 25.7 G_1
#> 7 Protein_1 S_2 R_3 25.7 G_2
#> 8 Protein_1 S_2 R_4 25.4 G_2
#>
#> $PROTEIN
#> # A tibble: 4 x 4
#> protein run log2inty group
#> <chr> <chr> <dbl> <chr>
#> 1 Protein_1 R_1 25.6 G_1
#> 2 Protein_1 R_2 25.6 G_1
#> 3 Protein_1 R_3 25.7 G_2
#> 4 Protein_1 R_4 25.6 G_2
There are 5 summarization options available with PTMsummarize()
:
tmp
(default): Tukey’s median polish procedurelogsum
: log2 of the summation of peak intensitiesmean
: mean of the log2-intensitiesmedian
: median of the log2-intensitiesmax
: max of the log2-intensitiesStatistical inference of the abundance of each site (in PTM
) or protein
(in PROTEIN
) in each group is performed by applying PTMestimate()
, which
expresses the summarized log2-intensities using linear models and returns the
parameters of the fitted models. The log2-abundance estimate and its standard
error of each PTM site and protein in each group are returned by
PTMestimate()
.
estimates <- PTMestimate(summarized)
estimates
#> $PTM
#> $PTM$protein
#> [1] "Protein_1" "Protein_1"
#>
#> $PTM$site
#> [1] "S_1" "S_2"
#>
#> $PTM$param
#> $PTM$param[[1]]
#> # A tibble: 2 x 3
#> estimate std.error group
#> <dbl> <dbl> <chr>
#> 1 25.6 0.130 G_1
#> 2 25.7 0.130 G_2
#>
#> $PTM$param[[2]]
#> # A tibble: 2 x 3
#> estimate std.error group
#> <dbl> <dbl> <chr>
#> 1 25.7 0.0916 G_1
#> 2 25.5 0.0916 G_2
#>
#>
#> $PTM$df
#> [1] 2 2
#>
#>
#> $PROTEIN
#> $PROTEIN$protein
#> [1] "Protein_1"
#>
#> $PROTEIN$param
#> $PROTEIN$param[[1]]
#> # A tibble: 2 x 3
#> estimate std.error group
#> <dbl> <dbl> <chr>
#> 1 25.6 0.0170 G_1
#> 2 25.6 0.0170 G_2
#>
#>
#> $PROTEIN$df
#> [1] 2
With the parameter estimates from the previous step, PTMcompareMeans()
detects systematic changes in mean abundances of PTM sites between groups.
PTMcompareMeans(estimates, controls="G_1", cases="G_2")
#> # A tibble: 2 x 9
#> Protein Site Label log2FC SE Tvalue DF pvalue adj.pvalue
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Protein_1 S_1 G_2 vs G_1 0.116 0.184 0.630 2 0.593 0.593
#> 2 Protein_1 S_2 G_2 vs G_1 -0.165 0.130 -1.27 2 0.331 0.593
The first argument to PTMcompareMeans()
is the list of parameter estimates
returned by PTMestimate()
.
The second argument controls
defines the names of control groups (can be
more than one) in the comparison.
The third argument cases
defines the names of case groups in the comparison.
PTMcompareMeans()
returns a summary of the comparison, with names of
proteins (Protein
) and sites (Site
), contrast in the comparison (Label
),
log2-fold change of the mean abundance (log2FC
) and its standard error
(SE
), \(t\)-statistic (Tvalue
), number of degrees of freedom (DF
), and the
resulting \(p\)-value (pvalue
).
By default, PTMcompareMeans()
performs the comparison without taking into
account of changes in the underlying protein abundance. We can incorporate the
adjustment with respect to protein abundance by setting adjProtein = TRUE
:
PTMcompareMeans(estimates, controls="G_1", cases="G_2", adjProtein=TRUE)
#> Joining, by = c("Protein", "Label")
#> # A tibble: 2 x 9
#> Protein Site Label log2FC SE Tvalue DF pvalue adj.pvalue
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Protein_1 S_1 G_2 vs G_1 0.116 0.185 0.624 2.07 0.594 0.594
#> 2 Protein_1 S_2 G_2 vs G_1 -0.165 0.132 -1.25 2.14 0.330 0.594
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MSstatsPTM_1.0.0 BiocStyle_2.18.0
#>
#> loaded via a namespace (and not attached):
#> [1] knitr_1.30 magrittr_1.5 tidyselect_1.1.0
#> [4] R6_2.4.1 rlang_0.4.8 fansi_0.4.1
#> [7] stringr_1.4.0 dplyr_1.0.2 tools_4.0.3
#> [10] broom_0.7.2 xfun_0.18 utf8_1.1.4
#> [13] cli_2.1.0 htmltools_0.5.0 ellipsis_0.3.1
#> [16] assertthat_0.2.1 yaml_2.2.1 digest_0.6.27
#> [19] tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4
#> [22] bookdown_0.21 tidyr_1.1.2 purrr_0.3.4
#> [25] BiocManager_1.30.10 vctrs_0.3.4 glue_1.4.2
#> [28] evaluate_0.14 rmarkdown_2.5 stringi_1.5.3
#> [31] compiler_4.0.3 pillar_1.4.6 backports_1.1.10
#> [34] generics_0.0.2 pkgconfig_2.0.3