readSCP
scp 1.4.0
scp
data frameworkOur data structure is relying on two curated data classes: QFeatures
(Gatto (2020)) and SingleCellExperiment
(Amezquita et al. (2019)).
QFeatures
is dedicated to the manipulation and processing of
MS-based quantitative data. It explicitly records the successive steps
to allow users to navigate up and down the different MS levels.
SingleCellExperiment
is another class designed as an efficient data
container that serves as an interface to state-of-the-art methods and
algorithms for single-cell data. Our framework combines the two
classes to inherit from their respective advantages.
Because mass spectrometry (MS)-based single-cell proteomics (SCP) only
captures the proteome of between one and a few tens of single-cells in
a single run, the data is usually acquired across many MS batches.
Therefore, the data for each run should conceptually be stored in its
own container, that we here call an assay. The expected input for
working with the scp
package is quantification data of peptide to
spectrum matches (PSM). These data can then be processed to reconstruct
peptide and protein data. The links between related features across
different assays are stored to facilitate manipulation and
visualization of of PSM, peptide and protein data. This is
conceptually shown below.
There are two input tables required for starting an analysis with
scp
:
The feature data are generated after the identification and
quantification of the MS spectra by a pre-processing software such as
MaxQuant, ProteomeDiscoverer or MSFragger (the
list
of available software is actually much longer). We will here use as an
example a data table that has been generated by MaxQuant. The table is
available from the scp
package and is called mqScpData
(for
MaxQuant generated SCP data).
library(scp)
data("mqScpData")
dim(mqScpData)
#> [1] 1361 149
In this toy example, there are 1361 rows corresponding to features (quantified PSMs) and 149 columns corresponding to different data fields recorded by MaxQuant during the processing of the MS spectra. The columns can be divided into three categories:
The quantification data can be composed of one (in case of label-free
acquisition) up to 16 columns (in case of TMT-16 multiplexing). The
columns holding the quantification start with Reporter.intensity.
followed by a number.
(quantCols <- grep("Reporter.intensity.\\d", colnames(mqScpData),
value = TRUE))
#> [1] "Reporter.intensity.1" "Reporter.intensity.2" "Reporter.intensity.3"
#> [4] "Reporter.intensity.4" "Reporter.intensity.5" "Reporter.intensity.6"
#> [7] "Reporter.intensity.7" "Reporter.intensity.8" "Reporter.intensity.9"
#> [10] "Reporter.intensity.10" "Reporter.intensity.11" "Reporter.intensity.12"
#> [13] "Reporter.intensity.13" "Reporter.intensity.14" "Reporter.intensity.15"
#> [16] "Reporter.intensity.16"
As you may notice, the example data was acquired using a TMT-16 protocol since we retrieve 16 quantification columns. Actually, some runs were acquired using a TMT-11 protocol (11 labels) but we will come back to this later.
head(mqScpData[, quantCols])
#> Reporter.intensity.1 Reporter.intensity.2 Reporter.intensity.3
#> 1 61251 501.71 3731.3
#> 2 58648 1099.80 2837.7
#> 3 150350 3705.00 9361.0
#> 4 27347 405.90 1525.2
#> 5 84035 583.09 4092.3
#> 6 44895 700.23 2283.0
#> Reporter.intensity.4 Reporter.intensity.5 Reporter.intensity.6
#> 1 1643.30 871.84 981.87
#> 2 494.32 349.26 1030.50
#> 3 0.00 1945.40 1188.60
#> 4 0.00 0.00 318.74
#> 5 530.13 718.13 2204.50
#> 6 1109.60 0.00 675.79
#> Reporter.intensity.7 Reporter.intensity.8 Reporter.intensity.9
#> 1 1200.10 939.06 1457.50
#> 2 0.00 1214.10 800.58
#> 3 1574.00 2302.10 2176.10
#> 4 0.00 519.81 0.00
#> 5 960.51 453.77 1188.40
#> 6 0.00 809.38 668.88
#> Reporter.intensity.10 Reporter.intensity.11 Reporter.intensity.12
#> 1 1329.80 981.83 NA
#> 2 807.79 391.38 NA
#> 3 1399.50 1307.50 2192.4
#> 4 507.23 370.79 NA
#> 5 740.99 0.00 NA
#> 6 1467.50 901.38 NA
#> Reporter.intensity.13 Reporter.intensity.14 Reporter.intensity.15
#> 1 NA NA NA
#> 2 NA NA NA
#> 3 1791.4 1727.5 2157.3
#> 4 NA NA NA
#> 5 NA NA NA
#> 6 NA NA NA
#> Reporter.intensity.16
#> 1 NA
#> 2 NA
#> 3 1398
#> 4 NA
#> 5 NA
#> 6 NA
Most columns in the mqScpData
table contain information used or
generated during the identification of the MS spectra. For instance,
you may find the charge of the parent ion, the score and probability
of a correct match between the MS spectrum and a peptide sequence, the
sequence of the best matching peptide, its length, its modifications,
the retention time of the peptide on the LC, the protein(s) the peptide
originates from and much more.
head(mqScpData[, c("Charge", "Score", "PEP", "Sequence", "Length",
"Retention.time", "Proteins")])
#> Charge Score PEP Sequence Length Retention.time
#> 1 2 41.029 5.2636e-04 ATNFLAHEK 9 65.781
#> 2 2 44.349 5.8789e-04 ATNFLAHEK 9 63.787
#> 3 2 51.066 4.0315e-24 SHTILLVQPTK 11 71.884
#> 4 2 63.816 4.7622e-06 SHTILLVQPTK 11 68.633
#> 5 2 74.464 6.8709e-09 SHTILLVQPTK 11 71.946
#> 6 2 41.502 5.3705e-02 SLVIPEK 7 76.204
#> Proteins
#> 1 sp|P29692|EF1D_HUMAN
#> 2 sp|P29692|EF1D_HUMAN
#> 3 sp|P84090|ERH_HUMAN
#> 4 sp|P84090|ERH_HUMAN
#> 5 sp|P84090|ERH_HUMAN
#> 6 sp|P62269|RS18_HUMAN
This type of metadata is related to the MS instrument. In MaxQuant, only the file name generated by the MS instrument is stored. There is one file for each MS run, hence the file name can be used as a batch identifier.
unique(mqScpData$Raw.file)
#> [1] "190321S_LCA10_X_FP97AG" "190222S_LCA9_X_FP94BM"
#> [3] "190914S_LCB3_X_16plex_Set_21" "190321S_LCA10_X_FP97_blank_01"
Next to the quantification data and the feature data, sample metadata contains the experimental design generated by the researcher. The rows of sample metadata correspond to a sample in the experiment and the columns correspond to the available information about the sample. We will here use the second example table:
data("sampleAnnotation")
head(sampleAnnotation)
#> Raw.file Channel SampleType lcbatch sortday digest
#> 1 190222S_LCA9_X_FP94BM Reporter.intensity.1 Carrier LCA9 s8 N
#> 2 190222S_LCA9_X_FP94BM Reporter.intensity.2 Reference LCA9 s8 N
#> 3 190222S_LCA9_X_FP94BM Reporter.intensity.3 Unused LCA9 s8 N
#> 4 190222S_LCA9_X_FP94BM Reporter.intensity.4 Monocyte LCA9 s8 N
#> 5 190222S_LCA9_X_FP94BM Reporter.intensity.5 Blank LCA9 s8 N
#> 6 190222S_LCA9_X_FP94BM Reporter.intensity.6 Monocyte LCA9 s8 N
This table may contain any information about the samples. For example,
useful information could be the type of sample that is analysed, a
phenotype known from the experimental design, the MS batch, the
acquisition date, MS settings used to acquire the sample, the LC
batch, the sample preparation batch, etc… However, scp
requires 2 specific fields in the sample metadata:
Raw.file
in this case).
It must have the same name as the name of the column containing the
MS run names in the quantification table. This will allow scp
to
correctly match and split data that were acquired separately. This
is illustrated below by the linked circles.scp
the names of the columns in the feature
data holds the quantification of the corresponding sample. This is
illustrated below by the arrow.readSCP
readSCP
is the function that converts the sample and the feature
data into a QFeatures
object following the data structure described
above, that is storing the data belonging to each MS batch in a
separate SingleCellExperiment
object. We therefore provide the
feature data, the sample data to the function as well as the name of
the column that holds the batch name in both tables and the name of
the column in the sample data that points to the quantification
columns in the feature data.
suffix
readSCP
automatically assigns names that are unique across all
samples in all assays. This is performed by appending the name of the
batch where a given sample is found in with the name of the
quantification column for that sample. Suppose a sample belongs to
batch 190222S_LCA9_X_FP94BM
and the quantification values in the
feature data are found in the column called Reporter.intensity.3
,
then the sample name will become 190222S_LCA9_X_FP94BMReporter.intensity.3
.
Optionally, to improve the readability of sample names, readSCP
can
take a suffix instead of the quantification column name. For instance,
in the example below, we will provide a short suffix with the TMT
index to remind that samples were multiplexed using TMT.
In some rare cases, it can be beneficial to remove empty samples (all
quantifications are NA
) from the assays. Such samples can occur when
samples that were acquired with different multiplexing labels are
merged in a single table. For instance, the SCoPE2 data we provide as
an example contains runs that were acquired with two TMT protocols.
The 3 first assays were acquired using the TMT-11 protocol and the
last assay was acquired using a TMT-16 protocol. When exporting the
table, the authors combined the data in a single table, were missing
channels in the TMT-11 data are filled with NA
. This is essential
when working in table format, but since scp
keeps the runs separated
we can allow for different numbers of channels per run. When setting
removeEmptyCols = TRUE
, readSCP
automatically detects and removes
columns that contain only NA
s,
readSCP
We convert the sample and the feature data into a QFeatures
object
in a single command thanks to readSCP
.
scp <- readSCP(featureData = mqScpData,
colData = sampleAnnotation,
batchCol = "Raw.file",
channelCol = "Channel",
suffix = paste0("_TMT", 1:16),
removeEmptyCols = TRUE)
#> Loading data as a 'SingleCellExperiment' object
#> Splitting data based on 'Raw.file'
#> Formatting sample metadata (colData)
#> Formatting data as a 'QFeatures' object
As indicated by the output on the console, readSCP
proceeds as follows:
If featureData
is the path to a CSV file, it reads the file using
read.csv
. The table is converted to a SingleCellExperiment
object. readSCP
needs to know in which field(s) the quantitative
data is stored. Those field name(s) is/are provided by the
channelCol
field in the metaData
table. So in this example, the
column names holding the quantitative data in mqScpData
are
stored in the column named Channel
in sampleAnnotation
.
The SingleCellExperiment
object is then split according to
batch. The split is performed depending on the batchCol
field in
featureData
, in this case the data will be split according to the
Raw.file
column in mqScpData
. Raw.file
contains the names of the
acquisition runs that was used by MaxQuant to retrieve the raw data
files.
The sample metadata is generated from the supplied colData
. Note
that in order for readSCP
to correctly match the feature data
with the metadata, colData
must contain the same batchCol
field
with batch names.
Finally, the split feature data and the sample metadata are stored
in a single QFeatures
object.
Here is a compact overview of the data:
scp
#> An instance of class QFeatures containing 4 assays:
#> [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 395 rows and 11 columns
#> [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 487 rows and 11 columns
#> [3] 190321S_LCA10_X_FP97_blank_01: SingleCellExperiment with 109 rows and 11 columns
#> [4] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 370 rows and 16 columns
We can see that the object returned by readSCP
is a QFeatures
object containing 4 SingleCellExperiment
assays that have
been named after the 4 MS batches. Each assay contains either 11 or 16
columns (samples) depending on the TMT labelling strategy and a
variable number of rows (quantified PSMs). Each piece of information
can easily be retrieved thanks to the Qfeatures
architectures.
As mentioned in the previous vignette, sample data is retrieved from
the colData
. Note that unique sample names were automatically
generated by combining the batch name and the suffix provided to
readSCP
:
head(colData(scp))
#> DataFrame with 6 rows and 6 columns
#> Raw.file Channel SampleType lcbatch
#> <character> <character> <character> <character>
#> 190222S_LCA9_X_FP94BM_TMT1 190222S_LC... Reporter.i... Carrier LCA9
#> 190222S_LCA9_X_FP94BM_TMT2 190222S_LC... Reporter.i... Reference LCA9
#> 190222S_LCA9_X_FP94BM_TMT3 190222S_LC... Reporter.i... Unused LCA9
#> 190222S_LCA9_X_FP94BM_TMT4 190222S_LC... Reporter.i... Monocyte LCA9
#> 190222S_LCA9_X_FP94BM_TMT5 190222S_LC... Reporter.i... Blank LCA9
#> 190222S_LCA9_X_FP94BM_TMT6 190222S_LC... Reporter.i... Monocyte LCA9
#> sortday digest
#> <character> <character>
#> 190222S_LCA9_X_FP94BM_TMT1 s8 N
#> 190222S_LCA9_X_FP94BM_TMT2 s8 N
#> 190222S_LCA9_X_FP94BM_TMT3 s8 N
#> 190222S_LCA9_X_FP94BM_TMT4 s8 N
#> 190222S_LCA9_X_FP94BM_TMT5 s8 N
#> 190222S_LCA9_X_FP94BM_TMT6 s8 N
Notice that the sample names were suffixed with TMT indexes.
The feature metadata is retrieved from the rowData
. Since the
feature metadata is specific to each assay, we need to tell from which
assay we want to get the rowData
:
head(rowData(scp[["190222S_LCA9_X_FP94BM"]]))[, 1:5]
#> DataFrame with 6 rows and 5 columns
#> uid Sequence Length Modifications Modified.sequence
#> <character> <character> <integer> <character> <character>
#> PSM2 _(Acetyl (... ATNFLAHEK 9 Acetyl (Pr... _(Acetyl (...
#> PSM4 _(Acetyl (... SHTILLVQPT... 11 Acetyl (Pr... _(Acetyl (...
#> PSM6 _(Acetyl (... SLVIPEK 7 Acetyl (Pr... _(Acetyl (...
#> PSM9 _AAGLALK_ ... AAGLALK 7 Unmodified _AAGLALK_
#> PSM12 _AALSAGK_ ... AALSAGK 7 Unmodified _AALSAGK_
#> PSM15 _AAPEASGTP... AAPEASGTPS... 16 Unmodified _AAPEASGTP...
Finally, we can also retrieve the quantification matrix for an assay of interest:
head(assay(scp, "190222S_LCA9_X_FP94BM"))
#> 190222S_LCA9_X_FP94BM_TMT1 190222S_LCA9_X_FP94BM_TMT2
#> PSM2 58648.0 1099.80
#> PSM4 27347.0 405.90
#> PSM6 44895.0 700.23
#> PSM9 122070.0 1153.50
#> PSM12 58605.0 895.25
#> PSM15 5006.5 517.86
#> 190222S_LCA9_X_FP94BM_TMT3 190222S_LCA9_X_FP94BM_TMT4
#> PSM2 2837.70 494.32
#> PSM4 1525.20 0.00
#> PSM6 2283.00 1109.60
#> PSM9 7361.90 1732.30
#> PSM12 2763.80 867.82
#> PSM15 446.19 458.17
#> 190222S_LCA9_X_FP94BM_TMT5 190222S_LCA9_X_FP94BM_TMT6
#> PSM2 349.26 1030.50
#> PSM4 0.00 318.74
#> PSM6 0.00 675.79
#> PSM9 1515.60 2252.00
#> PSM12 1050.30 1268.70
#> PSM15 467.90 649.50
#> 190222S_LCA9_X_FP94BM_TMT7 190222S_LCA9_X_FP94BM_TMT8
#> PSM2 0.00 1214.10
#> PSM4 0.00 519.81
#> PSM6 0.00 809.38
#> PSM9 444.48 2343.80
#> PSM12 532.30 1073.10
#> PSM15 259.84 533.55
#> 190222S_LCA9_X_FP94BM_TMT9 190222S_LCA9_X_FP94BM_TMT10
#> PSM2 800.58 807.79
#> PSM4 0.00 507.23
#> PSM6 668.88 1467.50
#> PSM9 3100.20 1825.20
#> PSM12 911.30 1300.00
#> PSM15 393.53 463.26
#> 190222S_LCA9_X_FP94BM_TMT11
#> PSM2 391.38
#> PSM4 370.79
#> PSM6 901.38
#> PSM9 2372.50
#> PSM12 1185.90
#> PSM15 353.04
The scp
package is meant for both label-free and multiplexed SCP
data. Like in the example above, the label-free data should contain
the batch names in both the feature data and the sample data.
The sample data must also contain a column that points to the columns
of the feature data that contains the quantifications. Since
label-free SCP acquires one single-cell per run, this sample data
column should point the same column for all samples. Moreover, this
means that each PSM assay will contain a single column.
readSCP
should work with any PSM quantification table that is output
by a pre-processing software. For instance, you can easily import the
PSM tables generated by ProteomeDiscoverer. The batch names are
contained in the File ID
column (that should be supplied as the
batchCol
argument to readSCP
). The quantification columns are
contained in the columns starting with Abundance
, eventually
followed by a multiplexing tag name. These columns should be stored in
a dedicated column of the sample data to be supplied as channelCol
to readSCP
.
If your input cannot be loaded using the procedure described in this vignette, you can submit a feature request (see next section).
You can open an issue on the
GitHub repository in
case of troubles when loading your SCP data with readSCP
. Any
suggestion or feature request about the function or the documentation
are also warmly welcome.
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggplot2_3.3.5 scp_1.4.0
[3] QFeatures_1.4.0 MultiAssayExperiment_1.20.0
[5] SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0
[9] IRanges_2.28.0 S4Vectors_0.32.0
[11] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.61.0 BiocStyle_2.22.0
loaded via a namespace (and not attached):
[1] lattice_0.20-45 assertthat_0.2.1
[3] digest_0.6.28 SingleCellExperiment_1.16.0
[5] utf8_1.2.2 R6_2.5.1
[7] evaluate_0.14 highr_0.9
[9] pillar_1.6.4 zlibbioc_1.40.0
[11] rlang_0.4.12 lazyeval_0.2.2
[13] jquerylib_0.1.4 Matrix_1.3-4
[15] rmarkdown_2.11 labeling_0.4.2
[17] stringr_1.4.0 ProtGenerics_1.26.0
[19] munsell_0.5.0 igraph_1.2.7
[21] RCurl_1.98-1.5 DelayedArray_0.20.0
[23] compiler_4.1.1 xfun_0.27
[25] pkgconfig_2.0.3 htmltools_0.5.2
[27] tidyselect_1.1.1 tibble_3.1.5
[29] GenomeInfoDbData_1.2.7 bookdown_0.24
[31] fansi_0.5.0 withr_2.4.2
[33] crayon_1.4.1 dplyr_1.0.7
[35] MASS_7.3-54 bitops_1.0-7
[37] grid_4.1.1 gtable_0.3.0
[39] jsonlite_1.7.2 lifecycle_1.0.1
[41] DBI_1.1.1 AnnotationFilter_1.18.0
[43] magrittr_2.0.1 scales_1.1.1
[45] MsCoreUtils_1.6.0 impute_1.68.0
[47] stringi_1.7.5 farver_2.1.0
[49] XVector_0.34.0 bslib_0.3.1
[51] ellipsis_0.3.2 generics_0.1.1
[53] vctrs_0.3.8 tools_4.1.1
[55] glue_1.4.2 purrr_0.3.4
[57] fastmap_1.1.0 yaml_2.2.1
[59] colorspace_2.0-2 clue_0.3-60
[61] cluster_2.1.2 BiocManager_1.30.16
[63] knitr_1.36 sass_0.4.0
Amezquita, Robert A, Aaron T L Lun, Etienne Becht, Vince J Carey, Lindsay N Carpp, Ludwig Geistlinger, Federico Martini, et al. 2019. “Orchestrating Single-Cell Analysis with Bioconductor.” Nat. Methods, December, 1–9.
Gatto, Laurent. 2020. “QFeatures: Quantitative Features for Mass Spectrometry Data.”