phosphonormalizer: Pairwise normalization of phosphoproteomics data

Sohrab Saraei, Tomi Suomi, Otto Kauko,Laura L. Elo

October 11, 2016

Introduction

Global centering-based normalization is a commonly-used normalization approach in mass spectrometry (MS) -based label-free proteomics. It scales the peptide abundances to have the same median intensities, based on an assumption that the majority of abundances remain the same across the samples. However, especially in phosphoproteomics experiments, this assumption can introduce bias, as the enrichment of phosphopeptides during sample preparation can mask large unidirectional biological changes. Therefore, a novel method called pairwise normalization has been introduced that addresses this possible bias by utilizing phosphopeptides quantified in both enriched and non-enriched samples to calculate factors that mitigate the bias (Kauko et al. 2015). The phosphonormalizer package implements the pairwise normalization (Saraei et al., under review ).

The phosphonormalizer package (Saraei et al. under review) normalizes the enriched samples in label-free MS-based phosphoproteomics using phosphopeptides that are present in both enriched and non-enriched data of the same samples. If there are no common phosphopeptides between the enriched and non-enriched data, then the normalization is not possible and an error is generated.

Input data

In order to use phosphonormalizer package, we assume that the experiment have been conducted on both enriched and non-enriched samples. These datasets must have the sequence, modification and abundance columns. The sequence and modification columns in the dataframe must be in character format and the abundance columns in numeric. The algorithm expects that the abundances are pre-normalized with median normalization (Kauko et al. 2015). This package also supports MSnSet data type from MSnbase package which is used in data preprocessing step of bioconductor mass spectrometry proteomics workflow (see more: https://www.bioconductor.org/help/workflows/proteomics/).

    #Load the library
    library(phosphonormalizer)
    #Enriched data overview
    head(enriched.rd)
##                          Sequence
## 1                       SSSPVNVKK
## 2 TTSQKHRDFVAEPMGEKPVGSLAGIGEVLGK
## 3                      RRSPSPYYSR
## 4                     LRLSPSPTSQR
## 5        MEDLDQSPLVSSSDSPPRPQPAFK
## 6                     LRLSPSPTSQR
##                                                 Modification
## 1                  [N-term] Acetyl; (N-Term)|[3] Phospho (S)
## 2 [N-term] Acetyl; (N-Term)|[2] Phospho; (T)|[3] Phospho (S)
## 3                           [3] Phospho; (S)|[5] Phospho (S)
## 4                           [4] Phospho; (S)|[6] Phospho (S)
## 5                 [N-term] Acetyl; (N-Term)|[15] Phospho (S)
## 6                           [4] Phospho; (S)|[6] Phospho (S)
##   gcNorm.ctrl2.1 gcNorm.ctrl2.2 gcNorm.ctrl2.3 gcNorm.ctrl1.1
## 1      259694658      457040590      587004777      230727898
## 2     1119159195     1705615152     1953963165     1078311545
## 3      336584109      193881363      276640765      148976414
## 4      134915349      167734763      197708821      229647325
## 5      720567253      630721302      568929214      606538462
## 6       54067728       68288843       89662099       95882948
##   gcNorm.ctrl1.2 gcNorm.ctrl1.3 gcNorm.CIP2A.1 gcNorm.CIP2A.2
## 1      322501318      440483177      668689914      674423478
## 2     1773146888     1474056582     1190582658      676441788
## 3      174112138      153753909      287782373      287136785
## 4      191052464      166754233      220674695      194132906
## 5      723389269      682481415      874073517     1017771417
## 6       88779229       81051004       89833726       71876578
##   gcNorm.CIP2A.3 gcNorm.RAS.1 gcNorm.RAS.2 gcNorm.RAS.3 gcNorm.OA.1
## 1      625035056    451234167    363667417    309494581   352078454
## 2      769588835   1372965516   1339147824   2078858839  1179294365
## 3      245112678    234516289    283827574    200462805   180598176
## 4      162583218    227425157    320668737    256416084   150749053
## 5     1007191248    427561343    662159439    681176179   436802093
## 6       68494304    114763943    146028525    111620925    83197600
##   gcNorm.OA.2 gcNorm.OA.3
## 1   177105234   227394429
## 2  1167182710  2117750218
## 3   158187327   209212284
## 4   373474922   306284818
## 5   309956463   307752815
## 6   165557524   152691693
    #Non-enriched data overview
    head(non.enriched.rd)
##              Sequence Modification gcNorm.ctrl2.1 gcNorm.ctrl2.2
## 1           LLLPGELAK         <NA>     1134176418      974814910
## 2           AGLQFPVGR         <NA>      826483607      715965777
## 3     AMGIMNSFVNDIFER         <NA>     2528350640     2105237338
## 4 VTIAQGGVLPNIQAVLLPK         <NA>     1263890433     1174074703
## 5 VTIAQGGVLPNIQAVLLPK         <NA>     1715542498     1657401406
## 6          AGFAGDDAPR         <NA>      410016382      334405218
##   gcNorm.ctrl2.3 gcNorm.ctrl1.1 gcNorm.ctrl1.2 gcNorm.ctrl1.3
## 1     1228718539     1394059839     1527009279     2331374945
## 2      830605008      902406245     1107622809     1508853999
## 3     2495554811     1272528331     2944603758     2443519913
## 4     1306359138     1068498506     1116762968     1431064191
## 5     1626124071     1985173704     2236269438     2453708614
## 6      692078515      939263413      409181048      515946375
##   gcNorm.CIP2A.1 gcNorm.CIP2A.2 gcNorm.CIP2A.3 gcNorm.RAS.1 gcNorm.RAS.2
## 1     3262876356     1475250309     1402379678   2331435624   1359941630
## 2     1051523142     1619993343     1374369052   1702157550   1021184670
## 3     3243241407     3981953390     3293949927   2804134445   3542050573
## 4     1586541519     2171400749     1334195938   1514303277   1828763181
## 5     2788681677     3975512588     2547762644   2568498719   3679284319
## 6      122490064      232386439      354316201    338037562    267232025
##   gcNorm.RAS.3 gcNorm.OA.1 gcNorm.OA.2 gcNorm.OA.3
## 1   1337506496  1776762436  2068739879  1619286624
## 2   1094560285  1230922434  1253221279  1234030083
## 3   4403231723  3307460281  2235874137  2344112126
## 4   1619522966  1197144107  1204365265  1428068227
## 5   3297580183  2495127098  2084809203  2121807104
## 6    414121334   275091595   249007012   416661130

Pairwise normalization

The normalization begins by loading the phosphonormalizer package. Here for demonstration, the data used is from “enriched.rd” and “non.enriched.rd” are available with the package. Boxplot of fold change distribution before and after pairwise normalization can also be generated by setting the plot parameter (look at the example).

Installation

To install this package, start R and enter:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("phosphonormalizer")

Example

    #Load the library
    library(phosphonormalizer)
    #Specify the column numbers of abundances in the original data.frame, 
    #from both enriched and non-enriched runs
    samplesCols <- data.frame(enriched=3:17, non.enriched=3:17)
    #Specify the column numbers of sequence and modification in the original data.frame, 
    #from both enriched and non-enriched runs
    modseqCols <- data.frame(enriched = 1:2, non.enriched = 1:2)
    #The samples and their technical replicates
    techRep <- factor(x = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5))
    #If the paramter plot.fc set, the corresponding plots of Sample fold changes is produced
    #Here, for demonstration, the fold change distributions are shown for samples 3 vs 1
    plot.param <- list(control = c(1), samples = c(3))
    #Call the function
    norm <- normalizePhospho(enriched = enriched.rd, non.enriched = non.enriched.rd, 
    samplesCols = samplesCols, modseqCols = modseqCols, techRep = techRep, 
    plot.fc = plot.param)
## The number of peptides in the intersect is: 54
## 1 plots generated. Browse through them.