Pre-processing of LC-MS data
Data import
xcms supports analysis of any LC-MS(/MS) data that can be imported with the
Spectra package. Such data will typically be provided in
(AIA/ANDI) NetCDF, mzXML and mzML format but can, through dedicated extensions
to the Spectra package, also be imported from other sources, e.g. also
directly from raw data files in manufacturer’s formats.
For demonstration purpose we will analyze in this document a small subset of the
data from [4] in which the metabolic consequences of the knock-out
of the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw
data files (in NetCDF format) are provided through the faahKO
data package. The data set consists of samples from the spinal cords of 6
knock-out and 6 wild-type mice. Each file contains data in centroid mode
acquired in positive ion polarity from 200-600 m/z and 2500-4500 seconds. To
speed-up processing of this vignette we will restrict the analysis to only 8
files.
Below we load all required packages, locate the raw CDF files within the
faahKO package and build a phenodata data.frame
describing the
experimental setup. Generally, such data frames should contain all relevant
experimental variables and sample descriptions (including also the names of the
raw data files) and will be imported into R using either the read.table
function (if the file is in csv or tabulator delimited text file format) or
also using functions from the readxl R package if it is in Excel file format.
library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(pheatmap)
library(MsExperiment)
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 4), rep("WT", 4)),
stringsAsFactors = FALSE)
We next load our data using the readMsExperiment
function from the
MsExperiment package.
faahko <- readMsExperiment(spectraFiles = cdfs, sampleData = pd)
faahko
## Object of class MsExperiment
## Spectra: MS1 (10224)
## Experiment data: 8 sample(s)
## Sample data links:
## - spectra: 8 sample(s) to 10224 element(s).
The MS spectra data from our experiment is now available as a Spectra
object
within faahko
. Note that this MsExperiment
container could in addition to
spectra data also contain other types of data or also references to other
files. See the vignette from the MsExperiment for more
details. Also, when loading data from mzML, mzXML or CDF files, by default only
general spectra data is loaded into memory while the actual peaks data,
i.e. the m/z and intensity values are only retrieved on-the-fly from the raw
files when needed (this is similar to the MSnbase on-disk mode described in
[2]). This guarantees a low memory footprint
hence allowing to analyze also large experiments without the need of high
performance computing environments. Note that also different alternative
backends (and hence data representations) could be used for the Spectra
object within faahko
with eventually even lower memory footprint, or higher
performance. See the package vignette from the Spectra package or
the SpectraTutorials tutorial for
more details on Spectra
backends and how to change between them.
Initial data inspection
The MsExperiment
object is a simple and flexible container for MS
experiments. The raw MS data is stored as a Spectra
object that can be
accessed through the spectra
function.
spectra(faahko)
## MSn data (Spectra) with 10224 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 2501.38 1
## 2 1 2502.94 2
## 3 1 2504.51 3
## 4 1 2506.07 4
## 5 1 2507.64 5
## ... ... ... ...
## 10220 1 4493.56 1274
## 10221 1 4495.13 1275
## 10222 1 4496.69 1276
## 10223 1 4498.26 1277
## 10224 1 4499.82 1278
## ... 33 more variables/columns.
##
## file(s):
## ko15.CDF
## ko16.CDF
## ko21.CDF
## ... 5 more files
All spectra are organized sequentially (i.e., not by file) but the fromFile
function can be used to get for each spectrum the information to which of the
data files it belongs. Below we simply count the number of spectra per file.
table(fromFile(faahko))
##
## 1 2 3 4 5 6 7 8
## 1278 1278 1278 1278 1278 1278 1278 1278
Information on samples can be retrieved through the sampleData
function.
sampleData(faahko)
## DataFrame with 8 rows and 3 columns
## sample_name sample_group spectraOrigin
## <character> <character> <character>
## 1 ko15 KO /home/bioc...
## 2 ko16 KO /home/bioc...
## 3 ko21 KO /home/bioc...
## 4 ko22 KO /home/bioc...
## 5 wt15 WT /home/bioc...
## 6 wt16 WT /home/bioc...
## 7 wt21 WT /home/bioc...
## 8 wt22 WT /home/bioc...
Each row in this DataFrame
represents one sample (input file). Using [
it is
possible to subset a MsExperiment
object by sample. Below we subset the
faahko
to the 3rd sample (file) and access its spectra and sample data.
faahko_3 <- faahko[3]
spectra(faahko_3)
## MSn data (Spectra) with 1278 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 2501.38 1
## 2 1 2502.94 2
## 3 1 2504.51 3
## 4 1 2506.07 4
## 5 1 2507.64 5
## ... ... ... ...
## 1274 1 4493.56 1274
## 1275 1 4495.13 1275
## 1276 1 4496.69 1276
## 1277 1 4498.26 1277
## 1278 1 4499.82 1278
## ... 33 more variables/columns.
##
## file(s):
## ko21.CDF
sampleData(faahko_3)
## DataFrame with 1 row and 3 columns
## sample_name sample_group spectraOrigin
## <character> <character> <character>
## 1 ko21 KO /home/bioc...
As a first evaluation of the data we below plot the base peak chromatogram (BPC)
for each file in our experiment. We use the chromatogram
method and set the
aggregationFun
to "max"
to return for each spectrum the maximal intensity
and hence create the BPC from the raw data. To create a total ion chromatogram
we could set aggregationFun
to "sum"
.
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(faahko, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(bpis, col = group_colors[sampleData(faahko)$sample_group])
The chromatogram
method returned a MChromatograms
object that organizes
individual Chromatogram
objects (which in fact contain the chromatographic
data) in a two-dimensional array: columns represent samples and rows
(optionally) m/z and/or retention time ranges. Below we extract the chromatogram
of the first sample and access its retention time and intensity values.
bpi_1 <- bpis[1, 1]
rtime(bpi_1) |> head()
## [1] 2501.378 2502.943 2504.508 2506.073 2507.638 2509.203
intensity(bpi_1) |> head()
## [1] 43888 43960 43392 42632 42200 42288
From the BPC above it seems that after around 4200 seconds no signal is measured
anymore. Thus, we filter below the full data set to a retention time range from
2550 to 4250 seconds using the filterRt
function. Note that at present this
will only subset the spectra within the MsExperiment
. Subsequently we
re-create also the BPC.
faahko <- filterRt(faahko, rt = c(2550, 4250))
## Filter spectra
## creating the BPC on the subsetted data
bpis <- chromatogram(faahko, aggregationFun = "max")
We next create boxplots representing the distribution of the total ion currents
per data file. Such plots can be very useful to spot potentially problematic MS
runs. To extract this information, we use the tic
function on the Spectra
object within faahko
and split the values by file using fromFile
.
## Get the total ion current by file
tc <- spectra(faahko) |>
tic() |>
split(f = fromFile(faahko))
boxplot(tc, col = group_colors[sampleData(faahko)$sample_group],
ylab = "intensity", main = "Total ion current")
In addition, we can also cluster the samples based on similarity of their base
peak chromatograms. Samples would thus be grouped based on similarity of their
LC runs. For that we need however to bin the data along the retention time
axis, since retention times will generally differ between samples. Below we use
the bin
function on the BPC to bin intensities into 2 second wide retention
time bins. The clustering is then performed using complete linkage hierarchical
clustering on the pairwise correlations of the binned base peak chromatograms.
## Bin the BPC
bpis_bin <- bin(bpis, binSize = 2)
## Calculate correlation on the log2 transformed base peak intensities
cormat <- cor(log2(do.call(cbind, lapply(bpis_bin, intensity))))
colnames(cormat) <- rownames(cormat) <- bpis_bin$sample_name
## Define which phenodata columns should be highlighted in the plot
ann <- data.frame(group = bpis_bin$sample_group)
rownames(ann) <- bpis_bin$sample_name
## Perform the cluster analysis
pheatmap(cormat, annotation = ann,
annotation_color = list(group = group_colors))
The samples cluster in a pairwise manner, with the KO and WT samples for the
same sample index having the most similar BPC.
Chromatographic peak detection
Chromatographic peak detection aims at identifying all signal in each sample
created from ions of the same originating compound species. Chromatographic peak
detection can be performed in xcms with the findChromPeaks
function and a
parameter object which defines and configures the algorithm that should be
used (see ?findChromPeaks
for a list of supported algorithms). Before running
any peak detection it is however strongly suggested to first visually inspect
the extracted ion chromatogram of e.g. internal standards or compounds known to
be present in the samples in order to evaluate and adapt the settings of the
peak detection algorithm since the default settings will not be appropriate for
most LC-MS setups.
Below we extract the EIC for one compound using the chromatogram
function by
specifying in addition the m/z and retention time range where we would expect
the signal for that compound.
## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(faahko, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$sample_group])
Note that Chromatogram
objects extracted by the chromatogram
method contain
an NA
value if in a certain scan (i.e. in a spectrum for a specific retention
time) no signal was measured in the respective m/z range. This is reflected by
the lines not being drawn as continuous lines in the plot above.
The peak above has thus a width of about 50 seconds. We can use this information
to define the peakwidth
parameter of the centWave peak detection method
[5] that we will use for chromatographic peak detection on our
data. The peakwidth
parameter allows to define the expected lower and upper
width (in retention time dimension) of chromatographic peaks. For our data we
set it thus to peakwidth = c(20, 80)
. The second important parameter for
centWave is ppm
which defines the expected maximum deviation of m/z values
of the centroids that should be included into one chromatographic peak (note
that this is not the precision of m/z values provided by the MS instrument
manufacturer).
For the ppm
parameter we extract the full MS data (intensity, retention time
and m/z values) corresponding to the above peak. To this end we first filter the
raw object by retention time, then by m/z and finally plot the object with type = "XIC"
to produce the plot below. We use the pipe (|>
) operator to better
illustrate the corresponding workflow.
faahko |>
filterRt(rt = rtr) |>
filterMz(mz = mzr) |>
plot(type = "XIC")
In the present data there is actually no variation in the m/z values. Usually
the m/z values of the individual centroids (lower panel) in the plots above
would scatter around the real m/z value of the compound (in an intensity
dependent manner).
The first step of the centWave algorithm defines regions of interest (ROI)
that are subsequently screened for the presence of chromatographic peaks. These
ROIs are defined based on the difference of m/z values of centroids from
consecutive scans (spectra). In detail, centroids from consecutive scans are
included into a ROI if the difference between their m/z and the mean m/z of the
ROI is smaller than the user defined ppm
parameter. A reasonable choice for
the ppm
could thus be the maximal m/z difference of data points from
neighboring scans/spectra that are part of a chromatographic peak for an
internal standard of known compound. It is suggested to inspect the ranges of
m/z values for several compounds (either internal standards or compounds known
to be present in the sample) and define the ppm
parameter for centWave
according to these. See also this
tutorial for additional
information and examples on choosing and testing peak detection settings.
Chromatographic peak detection can also be performed on extracted ion
chromatograms, which helps testing and tuning peak detection settings. Note
however that peak detection on EICs does not involve the first step of
centWave described above and will thus not consider the ppm
parameter. Also, since less data is available to the algorithms, background
signal estimation is performed differently and different settings for snthresh
will need to be used (generally a lower snthresh
will be used for EICs since
the estimated background signal tends to be higher for data subsets than for the
full data). Below we perform the peak detection with the findChromPeaks
function on the EIC generated above. The submitted parameter object defines
which algorithm will be used and allows to define the settings for this
algorithm. We use a CentWaveParam
parameter object to use and configure the
centWave algorithm with default settings, except for snthresh
.
xchr <- findChromPeaks(chr_raw, param = CentWaveParam(snthresh = 2))
We can access the identified chromatographic peaks with the chromPeaks
function.
chromPeaks(xchr)
## rt rtmin rtmax into intb maxo sn row column
## [1,] 2781.505 2761.160 2809.674 412134.255 355516.374 16856 13 1 1
## [2,] 2786.199 2764.290 2812.803 1496244.206 1391821.332 58736 20 1 2
## [3,] 2734.556 2714.211 2765.855 21579.367 18449.428 899 4 1 3
## [4,] 2797.154 2775.245 2815.933 159058.782 150289.314 6295 12 1 3
## [5,] 2784.635 2761.160 2808.109 54947.545 37923.532 2715 2 1 4
## [6,] 2859.752 2845.668 2878.532 13895.212 13874.868 905 904 1 4
## [7,] 2897.311 2891.051 2898.876 5457.155 5450.895 995 994 1 4
## [8,] 2819.064 2808.109 2834.713 19456.914 19438.134 1347 1576 1 4
## [9,] 2789.329 2762.725 2808.109 174473.311 91114.975 8325 3 1 5
## [10,] 2786.199 2764.290 2812.803 932645.211 569236.246 35856 2 1 6
## [11,] 2792.461 2768.987 2823.760 876585.527 511324.134 27200 2 1 7
## [12,] 2789.329 2773.680 2806.544 89582.569 73871.386 5427 6 1 8
Parallel to the chromPeaks
matrix there is also a chromPeakData
data frame
that allows to add arbitrary annotations to each chromatographic peak, such as
e.g. the MS level in which the peak was detected:
chromPeakData(xchr)
## DataFrame with 12 rows and 4 columns
## ms_level is_filled row column
## <integer> <logical> <integer> <integer>
## 1 1 FALSE 1 1
## 2 1 FALSE 1 2
## 3 1 FALSE 1 3
## 4 1 FALSE 1 3
## 5 1 FALSE 1 4
## ... ... ... ... ...
## 8 1 FALSE 1 4
## 9 1 FALSE 1 5
## 10 1 FALSE 1 6
## 11 1 FALSE 1 7
## 12 1 FALSE 1 8
Below we plot the EIC along with all identified chromatographic peaks using the
plot
function on the result object from above. Additional parameters peakCol
and peakBg
allow to define a foreground and background (fill) color for each
identified chromatographic peak in the provided result object (i.e., we need to
define one color for each row of chromPeaks(xchr)
- column "column"
(or
"sample"
if present) in that peak matrix specifies the sample in which the
peak was identified).
## Define a color for each sample
sample_colors <- group_colors[xchr$sample_group]
## Define the background color for each chromatographic peak
bg <- sample_colors[chromPeaks(xchr)[, "column"]]
## Parameter `col` defines the color of each sample/line, `peakBg` of each
## chromatographic peak.
plot(xchr, col = sample_colors, peakBg = bg)
If we are happy with the settings we can use them for the peak detection on the
full data set (i.e. on the MsExperiment
object with the data for the whole
experiment). Note that below we set the argument prefilter
to c(6, 5000)
and
noise
to 5000
to reduce the run time of this vignette. With this setting we
consider only ROIs with at least 6 centroids with an intensity larger than 5000
for the centWave chromatographic peak detection.
cwp <- CentWaveParam(peakwidth = c(20, 80), noise = 5000,
prefilter = c(6, 5000))
faahko <- findChromPeaks(faahko, param = cwp)
The results of findChromPeaks
on a MsExperiment
object are returned as an
XcmsExperiment
object. This object extends MsExperiment
directly (hence
providing the same access to all raw data) and contains all xcms
pre-processing results. Note also that additional rounds of chromatographic peak
detections could be performed and their results being added to existing peak
detection results by additional calls to findChromPeaks
on the result object
and using parameter add = TRUE
.
The chromPeaks
function can also here be used to access the results from the
chromatographic peak detection. Below we show the first 6 identified
chromatographic peaks.
chromPeaks(faahko) |> head()
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn
## CP0001 594.0 594.0 594.0 2601.535 2581.191 2637.529 161042.2 146073.3 7850 11
## CP0002 577.0 577.0 577.0 2604.665 2581.191 2626.574 136105.2 128067.9 6215 11
## CP0003 307.0 307.0 307.0 2618.750 2592.145 2645.354 284782.4 264907.0 16872 20
## CP0004 302.0 302.0 302.0 2617.185 2595.275 2640.659 687146.6 669778.1 30552 43
## CP0005 370.1 370.1 370.1 2673.523 2643.789 2700.127 449284.6 417225.3 25672 17
## CP0006 427.0 427.0 427.0 2675.088 2643.789 2684.478 283334.7 263943.2 11025 13
## sample
## CP0001 1
## CP0002 1
## CP0003 1
## CP0004 1
## CP0005 1
## CP0006 1
Columns of this chromPeaks
matrix might differ depending on the used peak
detection algorithm. Columns that all algorithms have to provide are: "mz"
,
"mzmin"
, "mzmax"
, "rt"
, "rtmin"
and "rtmax"
that define the m/z and
retention time range of the chromatographic peak (i.e. all mass peaks within
that area are used to integrate the peak signal) as well as the m/z and
retention time of the peak apex. Other mandatory columns are "into"
and
"maxo"
with the absolute integrated peak signal and the maximum peak
intensity. Finally, "sample"
provides the index of the sample in which the
peak was identified.
Additional annotations for each individual peak can be extracted with the
chromPeakData
function. This data frame could also be used to add/store
arbitrary annotations for each detected peak (that don’t necessarily need to be
numeric).
chromPeakData(faahko)
## DataFrame with 2908 rows and 2 columns
## ms_level is_filled
## <integer> <logical>
## CP0001 1 FALSE
## CP0002 1 FALSE
## CP0003 1 FALSE
## CP0004 1 FALSE
## CP0005 1 FALSE
## ... ... ...
## CP2904 1 FALSE
## CP2905 1 FALSE
## CP2906 1 FALSE
## CP2907 1 FALSE
## CP2908 1 FALSE
Peak detection will not always work perfectly for all types of peak shapes
present in the data set leading to peak detection artifacts, such as (partially
or completely) overlapping peaks or artificially split peaks (common issues
especially for centWave). xcms provides the refineChromPeaks
function that
can be called on peak detection results in order to refine (or clean) peak
detection results by either removing identified peaks not passing a certain
criteria or by merging artificially split or partially or completely overlapping
chromatographic peaks. Different algorithms are available that can again be
configured with their respective parameter objects: CleanPeaksParam
and
FilterIntensityParam
allow to remove peaks with their retention time range or
intensity being below a specified threshold, respectively. With
MergeNeighboringPeaksParam
it is possible to merge chromatographic peaks and
hence remove many of the above mentioned (centWave) peak detection
artifacts. See also ?refineChromPeaks
for more information and help on the
different methods.
Below we post-process the peak detection results merging peaks that overlap in a
4 second window per file and for which the signal between them is lower than 75%
of the smaller peak’s maximal intensity. See the ?MergeNeighboringPeaksParam
help page for a detailed description of the settings and the approach.
mpp <- MergeNeighboringPeaksParam(expandRt = 4)
faahko_pp <- refineChromPeaks(faahko, mpp)
An example for a merged peak is given below.
mzr_1 <- 305.1 + c(-0.01, 0.01)
chr_1 <- chromatogram(faahko[1], mz = mzr_1)
## Processing chromatographic peaks
chr_2 <- chromatogram(faahko_pp[1], mz = mzr_1)
## Processing chromatographic peaks
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
centWave thus detected originally 3 chromatographic peaks in the m/z slice
(left panel in the plot above) and peak refinement with
MergeNeighboringPeaksParam
and the specified settings merged the first two
peaks into a single one (right panel in the figure above). Other close peaks,
with a lower intensity between them, were however not merged (see below).
mzr_1 <- 496.2 + c(-0.01, 0.01)
chr_1 <- chromatogram(faahko[1], mz = mzr_1)
## Processing chromatographic peaks
chr_2 <- chromatogram(faahko_pp[1], mz = mzr_1)
## Processing chromatographic peaks
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
It is also possible to perform the peak refinement on extracted ion
chromatograms. This could again be used to test and fine-tune the settings for
the parameter. To illustrate this we perform below a peak refinement on the
extracted ion chromatogram chr_1
reducing the minProp
parameter to force
joining the two peaks.
res <- refineChromPeaks(chr_1, MergeNeighboringPeaksParam(minProp = 0.05))
chromPeaks(res)
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn
## CPM1 496.2 496.19 496.21 3384.012 3294.809 3412.181 45940118 NA 1128960 177
## sample row column
## CPM1 1 1 1
plot(res)
Before proceeding we next replace the faahko
object with the results from the
peak refinement step.
faahko <- faahko_pp
Below we use the data from the chromPeaks
matrix to calculate per-file
summaries of the peak detection results, such as the number of peaks per file as
well as the distribution of the retention time widths.
summary_fun <- function(z)
c(peak_count = nrow(z), rt = quantile(z[, "rtmax"] - z[, "rtmin"]))
T <- chromPeaks(faahko) |>
split.data.frame(f = chromPeaks(faahko)[, "sample"]) |>
lapply(FUN = summary_fun) |>
do.call(what = rbind)
rownames(T) <- basename(fileNames(faahko))
pandoc.table(
T,
caption = paste0("Summary statistics on identified chromatographic",
" peaks. Shown are number of identified peaks per",
" sample and widths/duration of chromatographic ",
"peaks."))
While by default chromPeaks
will return all identified chromatographic peaks
in a result object it is also possible to extract only chromatographic peaks for
a specified m/z and/or rt range:
chromPeaks(faahko, mz = c(334.9, 335.1), rt = c(2700, 2900))
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn
## CP0038 335 335 335 2781.505 2761.160 2809.674 412134.3 383167.4 16856 23
## CP0494 335 335 335 2786.199 2764.290 2812.803 1496244.2 1461187.3 58736 72
## CP1025 335 335 335 2797.154 2775.245 2815.933 159058.8 149229.6 6295 13
## CP1964 335 335 335 2786.199 2764.290 2812.803 932645.2 915333.8 35856 66
## CP2349 335 335 335 2792.461 2768.987 2823.760 876585.5 848569.1 27200 36
## sample
## CP0038 1
## CP0494 2
## CP1025 3
## CP1964 6
## CP2349 7
We can also plot the location of the identified chromatographic peaks in the
m/z - retention time space for one file using the plotChromPeaks
function. Below we plot this information for the third sample.
plotChromPeaks(faahko, file = 3)
As a general overview of the peak detection results we can in addition visualize
the number of identified chromatographic peaks per file along the retention time
axis. Parameter binSize
allows to define the width of the bins in rt dimension
in which peaks should be counted. This number of chromatographic peaks within
each bin is then shown color-coded in the resulting plot.
plotChromPeakImage(faahko, binSize = 10)
Note that extracting ion chromatorams from an xcms result object will in
addition to the chromatographic data also extract any identified chromatographic
peaks within that region. This can thus also be used to validate and verify that
the used peak detection settings identified e.g. peaks for known compounds or
internal standards properly. Below we extract the ion chromatogram for the m/z -
rt region above and access the detected peaks in that region using the
chromPeaks
function.
chr_ex <- chromatogram(faahko, mz = mzr, rt = rtr)
chromPeaks(chr_ex)
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn
## CP0038 335 335 335 2781.505 2761.160 2809.674 412134.3 383167.4 16856 23
## CP0494 335 335 335 2786.199 2764.290 2812.803 1496244.2 1461187.3 58736 72
## CP1025 335 335 335 2797.154 2775.245 2815.933 159058.8 149229.6 6295 13
## CP1964 335 335 335 2786.199 2764.290 2812.803 932645.2 915333.8 35856 66
## CP2349 335 335 335 2792.461 2768.987 2823.760 876585.5 848569.1 27200 36
## sample row column
## CP0038 1 1 1
## CP0494 2 1 2
## CP1025 3 1 3
## CP1964 6 1 6
## CP2349 7 1 7
We can also plot this extracted ion chromatogram which will also visualize all
identified chromatographic peaks in that region.
sample_colors <- group_colors[chr_ex$sample_group]
plot(chr_ex, col = group_colors[chr_raw$sample_group], lwd = 2,
peakBg = sample_colors[chromPeaks(chr_ex)[, "sample"]])
Chromatographic peaks can be visualized also in other ways: by setting peakType = "rectangle"
the they are indicated with a rectangle instead of the default
highlighting option (peakType = "polygon"
) used above. As a third alternative
it would also possible to just indicate the apex position for each identified
chromatographic peak with a single point (peakType = "point"
). Below we plot
the data again using peakType = "rectangle"
.
plot(chr_ex, col = sample_colors, peakType = "rectangle",
peakCol = sample_colors[chromPeaks(chr_ex)[, "sample"]],
peakBg = NA)
Finally we plot also the distribution of peak intensity per sample. This allows
to investigate whether systematic differences in peak signals between samples
are present.
## Extract a list of per-sample peak intensities (in log2 scale)
ints <- split(log2(chromPeaks(faahko)[, "into"]),
f = chromPeaks(faahko)[, "sample"])
boxplot(ints, varwidth = TRUE, col = sample_colors,
ylab = expression(log[2]~intensity), main = "Peak intensities")
grid(nx = NA, ny = NULL)
Over and above the signal of the identified chromatographic peaks is comparable
across samples, with the exception of samples 3, 4 and, to some degree, also 7
that show slightly higher average intensities, but also a lower number of
detected peaks (indicated by the smaller width of the boxes).
Note that in addition to the above described identification of chromatographic
peaks, it is also possible to manually define and add chromatographic peaks
with the manualChromPeaks
function (see ?manualChromPeaks
help page for more
information).
Alignment
The time at which analytes elute in the chromatography can vary between samples
(and even compounds). Such differences were already visible in the BPC and even
the extracted ion chromatogram plot in the previous section. The alignment
step, also referred to as retention time correction, aims to adjust these
differences by shifting signals along the retention time axis and aligning them
between different samples within an experiment.
A plethora of alignment algorithms exist (see [6]), with some of
them being also implemented in xcms. Alignment of LC-MS data can be performed
in xcms using the adjustRtime
method and an algorithm-specific parameter
class (see ?adjustRtime
for an overview of available methods in xcms).
In the example below we use the obiwarp method [7] to align the
samples. We use a binSize = 0.6
which creates warping functions in m/z bins of
0.6. Also here it is advisable to modify and adapt the settings for each
experiment.
faahko <- adjustRtime(faahko, param = ObiwarpParam(binSize = 0.6))
Note that adjustRtime
, besides calculating adjusted retention times for each
spectrum, adjusts also the retention times of the identified chromatographic
peaks in the xcms result object. Adjusted retention times of individual
spectra can be extracted from the result object using either the adjustedRtime
function or using rtime
with parameter adjusted = TRUE
(the default):
## Extract adjusted retention times
adjustedRtime(faahko) |> head()
## [1] 2551.457 2553.089 2554.720 2556.352 2557.983 2559.615
## Or simply use the rtime method
rtime(faahko) |> head()
## [1] 2551.457 2553.089 2554.720 2556.352 2557.983 2559.615
## Get raw (unadjusted) retention times
rtime(faahko, adjusted = FALSE) |> head()
## [1] 2551.457 2553.022 2554.586 2556.151 2557.716 2559.281
To evaluate the impact of the alignment we plot the BPC on the adjusted data. In
addition we plot also the differences between the adjusted and the raw retention
times per sample using the plotAdjustedRtime
function. To disable the
automatic extraction of all identified chromatographic peaks by the
chromatogram
function (which would not make much sense for a BPC) we use
chromPeaks = "none"
below.
## Get the base peak chromatograms.
bpis_adj <- chromatogram(faahko, aggregationFun = "max", chromPeaks = "none")
par(mfrow = c(3, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis, col = sample_colors)
grid()
plot(bpis_adj, col = sample_colors)
grid()
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(faahko, col = sample_colors)
grid()
Too large differences between adjusted and raw retention times could indicate
poorly performing samples or alignment.
At last we evaluate also the impact of the alignment on the test peak.
par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw, col = sample_colors)
grid()
## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(faahko, rt = rtr, mz = mzr)
plot(chr_adj, col = sample_colors, peakType = "none")
grid()
Note: xcms result objects (XcmsExperiment
as well as XCMSnExp
) contain
both the raw and the adjusted retention times for all spectra and subset
operation will in many cases drop adjusted retention times. Thus it might
sometimes be useful to immediately replace the raw retention times in the
data using the applyAdjustedRtime
function.
Subset-based alignment
For some experiments it might be better to perform the alignment based on only a
subset of the available samples, e.g. if pooled QC samples were injected at
regular intervals or if the experiment contains blanks. All alignment methods in
xcms support such a subset-based alignment in which retention time shifts are
estimated on only a specified subset of samples followed by an alignment of the
whole data set based on the aligned subset.
The subset of samples for such an alignment can be specified with the parameter
subset
of the PeakGroupsParam
or ObiwarpParam
object. Parameter
subsetAdjust
allows to specify the method by which the left-out samples
will be adjusted. There are currently two options available:
subsetAdjust = "previous"
: adjust the retention times of a non-subset
sample based on the alignment results of the previous subset sample (e.g. a
QC sample). If samples are e.g. in the order A1, B1, B2, A2, B3,
B4 with A representing QC samples and B study samples, using
subset = c(1, 4)
and subsetAdjust = "previous"
would result in all A
samples to be aligned with each other and non-subset samples B1 and B2
being adjusted based on the alignment result of subset samples A1 and B3
and B4 on those of A2.
subsetAdjust = "average"
: adjust retention times of non-subset samples based
on an interpolation of the alignment results of the previous and subsequent
subset sample. In the example above, B1 would be adjusted based on the
average of adjusted retention times between subset (QC) samples A1 and
A2. Since there is no subset sample after non-subset samples B3 and B4
these will be adjusted based on the alignment results of A2 alone. Note
that a weighted average is used to calculate the adjusted retention time
averages, which uses the inverse of the difference of the index of the
non-subset sample to the subset samples as weights. Thus, if we have a
setup like A1, B1, B2, A2 the adjusted retention times of A1
would get a larger weight than those of A2 in the adjustment of
non-subset sample B1 causing it’s adjusted retention times to be closer
to those of A1 than to A2. See below for examples.
Important: both cases require a meaningful/correct ordering of the samples
within the object (i.e., samples should be ordered by injection index).
The examples below aim to illustrate the effect of these alignment options. We
assume samples 1, 4 and 7 in the faahKO data set to be QC pool samples. We
thus want to perform the alignment based on these samples and subsequently
adjust the retention times of the left-out samples (2, 3, 5, 6 and 8) based on
interpolation of the results from the neighboring subset (QC) samples. After
initial peak grouping we perform the subset-based alignment with the peak
groups method by passing the indices of the QC samples with the subset
parameter to the PeakGroupsParam
function and set subsetAdjust = "average"
to adjust the study samples based on interpolation of the alignment results from
neighboring subset/QC samples.
Note that for any subset-alignment all parameters such as minFraction
are
relative to the subset
, not the full experiment!
Below we first remove any previous alignment results with the
dropAdjustedRtime
function to allow a fresh alignment using the subset-based
option outlined above. In addition to removing adjusted retention times for all
spectra, this function will also restore the original retention times for
identified chromatographic peaks.
faahko <- dropAdjustedRtime(faahko)
## Define the experimental layout
sampleData(faahko)$sample_type <- "study"
sampleData(faahko)$sample_type[c(1, 4, 7)] <- "QC"
For an alignment with the peak groups method an initial peak grouping
(correspondence) analysis is required, because the algorithm estimates retention
times shifts between samples using the retention times of hook peaks
(i.e. chromatographic peaks present in most/all samples). Here we use the
default settings for an peak density method-based correspondence, but it is
strongly advised to adapt the parameters for each data set (details in the next
section). The definition of the sample groups (i.e. assignment of individual
samples to the sample groups in the experiment) is mandatory for the
PeakDensityParam
. If there are no sample groups in the experiment,
sampleGroups
should be set to a single value for each file (e.g. rep(1, length(fileNames(faahko))
).
## Initial peak grouping. Use sample_type as grouping variable
pdp_subs <- PeakDensityParam(sampleGroups = sampleData(faahko)$sample_type,
minFraction = 0.9)
faahko <- groupChromPeaks(faahko, param = pdp_subs)
## Define subset-alignment options and perform the alignment
pgp_subs <- PeakGroupsParam(
minFraction = 0.85,
subset = which(sampleData(faahko)$sample_type == "QC"),
subsetAdjust = "average", span = 0.4)
faahko <- adjustRtime(faahko, param = pgp_subs)
Below we plot the results of the alignment highlighting the subset samples in
green. This nicely shows how the interpolation of the subsetAdjust = "average"
works: retention times of sample 2 are adjusted based on those from subset
sample 1 and 4, giving however more weight to the closer subset sample 1 which
results in the adjusted retention times of 2 being more similar to those of
sample 1. Sample 3 on the other hand gets adjusted giving more weight to the
second subset sample (4).
clrs <- rep("#00000040", 8)
clrs[sampleData(faahko)$sample_type == "QC"] <- c("#00ce0080")
par(mfrow = c(2, 1), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(faahko, aggregationFun = "max", chromPeaks = "none"),
col = clrs)
grid()
plotAdjustedRtime(faahko, col = clrs, peakGroupsPch = 1,
peakGroupsCol = "#00ce0040")
grid()
Option subsetAdjust = "previous"
would adjust the retention times of a
non-subset sample based on a single subset sample (the previous), which results
in most cases in the adjusted retention times of the non-subset sample being
highly similar to those of the subset sample which was used for adjustment.
Correspondence
Correspondence is usually the final step in LC-MS data pre-processing in which
data, presumably representing signal from the same originating ions, is matched
across samples. As a result, chromatographic peaks from different samples with
similar m/z and retention times get grouped into LC-MS features. The function
to perform the correspondence in xcms is called groupChromPeaks
that again
supports different algorithms which can be selected and configured with a
specific parameter object (see ?groupChromPeaks
for an overview). For our
example we will use the peak density method [1] that, within small
slices along the m/z dimension, combines chromatographic peaks depending on the
density of these peaks along the retention time axis. To illustrate this, we
simulate below the peak grouping for an m/z slice containing multiple
chromatoghaphic peaks within each sample using the plotChromPeakDensity
function and a PeakDensityParam
object with parameter minFraction = 0.4
(features are only defined if in at least 40% of samples a chromatographic peak
was present) - parameter sampleGroups
is used to define to which sample group
each sample belongs.
## Define the mz slice.
mzr <- c(305.05, 305.15)
## Extract and plot the chromatograms
chr_mzr <- chromatogram(faahko, mz = mzr)
## Define the parameters for the peak density method
pdp <- PeakDensityParam(sampleGroups = sampleData(faahko)$sample_group,
minFraction = 0.4, bw = 30)
plotChromPeakDensity(chr_mzr, col = sample_colors, param = pdp,
peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakCol = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakPch = 16)
The upper panel in the plot shows the extracted ion chromatogram for each sample
with the detected peaks highlighted. The retention times for the individual
chromatographic peaks in each sample (y-axis being the index of the sample in
the data set) are shown in the lower panel with the solid black line
representing the density estimate for the distribution of detected peaks along
the retention time. Parameter bw
defines the smoothness of this
estimation. The grey rectangles indicate which chromatographic peaks would be
grouped into a feature (each grey rectangle thus representing one feature). This
type of visualization is thus ideal to test, validate and optimize
correspondence settings on manually defined m/z slices before applying them to
the full data set. For the tested m/z slice the settings seemed to be OK and we
are thus applying them to the full data set below. Especially the parameter bw
will be very data set dependent (or more specifically LC-dependent) and should
be adapted to each data set. See the Metabolomics pre-processing with
xcms
tutorial for examples and
more details.
## Perform the correspondence
pdp <- PeakDensityParam(sampleGroups = sampleData(faahko)$sample_group,
minFraction = 0.4, bw = 30)
faahko <- groupChromPeaks(faahko, param = pdp)
Results from the correspondence analysis can be accessed with the
featureDefinitions
and featureValues
function. The former returns a data
frame with general information on each of the defined features, with each row
being one feature and columns providing information on the median m/z and
retention time as well as the indices of the chromatographic peaks assigned to
the feature in column "peakidx"
. Below we show the information on the first 6
features.
featureDefinitions(faahko) |> head()
## mzmed mzmin mzmax rtmed rtmin rtmax npeaks KO WT peakidx
## FT001 200.1 200.1 200.1 2902.177 2882.133 2922.222 2 2 0 458, 1161
## FT002 205.0 205.0 205.0 2789.457 2782.376 2795.823 8 4 4 44, 443,....
## FT003 206.0 206.0 206.0 2788.770 2780.807 2793.606 7 3 4 29, 430,....
## FT004 207.1 207.1 207.1 2718.476 2713.314 2726.487 7 4 3 16, 420,....
## FT005 233.0 233.0 233.1 3023.353 3014.971 3043.942 7 3 4 69, 959,....
## FT006 241.1 241.1 241.2 3680.200 3661.067 3695.533 8 3 4 276, 284....
## ms_level
## FT001 1
## FT002 1
## FT003 1
## FT004 1
## FT005 1
## FT006 1
The featureValues
function returns a matrix
with rows being features and
columns samples. The content of this matrix can be defined using the value
argument which can be any column name in the chromPeaks
matrix. With the
default value = "into"
a matrix with the integrated signal of the peaks
corresponding to a feature in a sample are returned. This is then generally used
as the intensity matrix for downstream analysis. Below we extract the
intensities for the first 6 features.
featureValues(faahko, value = "into") |> head()
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF
## FT001 NA 506848.9 NA 169955.6 NA NA NA
## FT002 1924712.0 1757151.0 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2
## FT003 213659.3 289500.7 NA 178285.7 253825.6 241844.4 240606.0
## FT004 349011.5 451863.7 343897.8 208002.8 364609.8 360908.9 NA
## FT005 286221.4 NA 164009.0 149097.6 255697.7 311296.8 366441.5
## FT006 1160580.5 NA 380970.3 588986.4 1286883.0 1739516.6 639755.3
## wt22.CDF
## FT001 NA
## FT002 1354004.9
## FT003 185399.5
## FT004 221937.5
## FT005 271128.0
## FT006 508546.4
As we can see we have several missing values in this feature matrix. Missing
values are reported if in one sample no chromatographic peak was detected in the
m/z - rt region of the feature. This does however not necessarily mean that
there is no signal for that specific ion in that sample. The chromatographic
peak detection algorithm could also just have failed to identify any peak in
that region, e.g. because the signal was too noisy or too low. Thus it is
advisable to perform, after correspondence, also a gap-filling (see next
section).
The performance of peak detection, alignment and correspondence should always be
evaluated by inspecting extracted ion chromatograms e.g. of known compounds,
internal standards or identified features in general. The featureChromatograms
function allows to extract chromatograms for each feature present in
featureDefinitions
. The returned MChromatograms
object contains an ion
chromatogram for each feature (each row containing the data for one feature) and
sample (each column representing containing data for one sample). Parameter
features
allows to define specific features for which the EIC should be
returned. These can be specified with their index or their ID (i.e. their row
name in the featureDefinitions
data frame. If features
is not defined, EICs
are returned for all features in a data set, which can take also a
considerable amount of time. Below we extract the chromatograms for the first 4
features.
feature_chroms <- featureChromatograms(faahko, features = 1:4)
feature_chroms
## XChromatograms with 4 rows and 8 columns
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF
## <XChromatogram> <XChromatogram> <XChromatogram> <XChromatogram>
## [1,] peaks: 0 peaks: 1 peaks: 0 peaks: 1
## [2,] peaks: 1 peaks: 1 peaks: 1 peaks: 1
## [3,] peaks: 1 peaks: 1 peaks: 0 peaks: 1
## [4,] peaks: 1 peaks: 1 peaks: 1 peaks: 1
## wt15.CDF wt16.CDF wt21.CDF wt22.CDF
## <XChromatogram> <XChromatogram> <XChromatogram> <XChromatogram>
## [1,] peaks: 0 peaks: 0 peaks: 0 peaks: 0
## [2,] peaks: 1 peaks: 1 peaks: 1 peaks: 1
## [3,] peaks: 1 peaks: 1 peaks: 1 peaks: 1
## [4,] peaks: 1 peaks: 1 peaks: 0 peaks: 1
## phenoData with 4 variables
## featureData with 4 variables
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
## method: centWave
## Correspondence:
## method: chromatographic peak density
## 4 feature(s) identified.
And plot the extracted ion chromatograms. We again use the group color for each
identified peak to fill the area.
plot(feature_chroms, col = sample_colors,
peakBg = sample_colors[chromPeaks(feature_chroms)[, "sample"]])
To access the EICs of the second feature we can simply subset the
feature_chroms
object.
eic_2 <- feature_chroms[2, ]
chromPeaks(eic_2)
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn
## CP0048 205 205 205 2791.365 2770.790 2815.117 1924712 1850331 84280 64
## CP0495 205 205 205 2795.185 2773.068 2820.462 1757151 1711473 68384 69
## CP1033 205 205 205 2795.823 2773.753 2821.042 1383417 1334570 47384 54
## CP1255 205 205 205 2788.583 2768.133 2812.172 1180288 1126958 48336 32
## CP1535 205 205 205 2782.376 2761.974 2805.911 2129885 2054677 93312 44
## CP1967 205 205 205 2787.149 2766.797 2812.193 1634342 1566379 67984 53
## CP2350 205 205 205 2790.332 2763.780 2821.562 1623589 1531573 49208 28
## CP2601 205 205 205 2787.209 2766.905 2812.192 1354005 1299188 55712 35
## sample row column
## CP0048 1 1 1
## CP0495 2 1 2
## CP1033 3 1 3
## CP1255 4 1 4
## CP1535 5 1 5
## CP1967 6 1 6
## CP2350 7 1 7
## CP2601 8 1 8
Gap filling
Missing values in LC-MS data are in many cases the result of the chromatographic
peak detection algorithm failing to identify peaks (because of noisy or low
intensity signal). The aim of the gap filling step is to reduce the number of
such missing values by integrating signals from the original data files for
samples in which no chromatographic peak was found from the m/z - rt region
where signal from the ion is expected. Gap filling can be performed in xcms
with the fillChromPeaks
function and a parameter object selecting and
configuring the gap filling algorithm. The method of choice is
ChromPeakAreaParam
that integrates the signal (in samples in which no
chromatographic peak was found for a feature) in the m/z - rt region that is
defined based on the m/z and retention time ranges of all detected
chromatographic peaks of that specific feature. The lower m/z limit of the area
is defined as the lower quartile (25% quantile) of the "mzmin"
values of all
peaks of the feature, the upper m/z value as the upper quartile (75% quantile)
of the "mzmax"
values, the lower rt value as the lower quartile (25% quantile)
of the "rtmin"
and the upper rt value as the upper quartile (75% quantile) of
the "rtmax"
values. Below we perform this gap filling on our test data and
extract the feature values for the first 6 features after gap filling. An NA
is reported if no signal is measured at all for a specific sample.
faahko <- fillChromPeaks(faahko, param = ChromPeakAreaParam())
featureValues(faahko, value = "into") |> head()
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF
## FT001 135422.4 506848.9 111872.0 169955.6 210333.1 141880.3 233271.3
## FT002 1924712.0 1757151.0 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2
## FT003 213659.3 289500.7 167516.7 178285.7 253825.6 241844.4 240606.0
## FT004 349011.5 451863.7 343897.8 208002.8 364609.8 360908.9 219142.3
## FT005 286221.4 286594.4 164009.0 149097.6 255697.7 311296.8 366441.5
## FT006 1160580.5 1145753.0 380970.3 588986.4 1286883.0 1739516.6 639755.3
## wt22.CDF
## FT001 142254.8
## FT002 1354004.9
## FT003 185399.5
## FT004 221937.5
## FT005 271128.0
## FT006 508546.4
Final result
While we can continue using the xcms result set for further analysis
(e.g. also for feature grouping with the MsFeatures package; see
the LC-MS feature grouping vignette for details) we could also extract all
results as a SummarizedExperiment
object. This is the standard data
container for Bioconductor defined in the SummarizedExperiment
package and integration with other Bioconductor packages might thus be easier
using that type of object. Below we use the quantify
function to extract the
xcms pre-processing results as such a SummarizedExperiment
object. Internally, the featureValues
function is used to generate the feature
value matrix. We can pass any parameters from that function to the quantify
call. Below we use value = "into"
and method = "sum"
to report the
integrated peak signal as intensity and to sum these values in samples in which
more than one chromatographic peak was assigned to a feature (for that option it
is important to run refineChromPeaks
like described above to merge overlapping
peaks in each sample).
library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
## Loading required package: GenomicRanges
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:xcms':
##
## distance
## Loading required package: GenomeInfoDb
res <- quantify(faahko, value = "into", method = "sum")
res
## class: SummarizedExperiment
## dim: 351 8
## metadata(6): '' '' ... '' ''
## assays(1): raw
## rownames(351): FT001 FT002 ... FT350 FT351
## rowData names(10): mzmed mzmin ... WT ms_level
## colnames(8): ko15.CDF ko16.CDF ... wt21.CDF wt22.CDF
## colData names(4): sample_name sample_group spectraOrigin sample_type
The information from featureDefinitions
is now stored in the rowData
of this
object. The rowData
provides annotations and information for each row in
the SummarizedExperiment
(which in our case are the features).
rowData(res)
## DataFrame with 351 rows and 10 columns
## mzmed mzmin mzmax rtmed rtmin rtmax npeaks
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001 200.1 200.1 200.1 2902.18 2882.13 2922.22 2
## FT002 205.0 205.0 205.0 2789.46 2782.38 2795.82 8
## FT003 206.0 206.0 206.0 2788.77 2780.81 2793.61 7
## FT004 207.1 207.1 207.1 2718.48 2713.31 2726.49 7
## FT005 233.0 233.0 233.1 3023.35 3014.97 3043.94 7
## ... ... ... ... ... ... ... ...
## FT347 595.25 595.2 595.3 3010.39 2992.58 3014.16 6
## FT348 596.20 596.2 596.2 2997.60 2992.58 3002.62 2
## FT349 596.30 596.3 596.3 3818.98 3811.68 3835.78 4
## FT350 597.40 597.4 597.4 3821.10 3817.96 3825.14 3
## FT351 599.30 599.3 599.3 4070.45 4042.10 4123.52 3
## KO WT ms_level
## <numeric> <numeric> <integer>
## FT001 2 0 1
## FT002 4 4 1
## FT003 3 4 1
## FT004 4 3 1
## FT005 3 4 1
## ... ... ... ...
## FT347 2 3 1
## FT348 0 2 1
## FT349 2 2 1
## FT350 1 2 1
## FT351 1 2 1
Annotations for columns (in our case samples) are stored as
colData
. In this data frame each row contains annotations for one sample (and
hence one column in the feature values matrix).
colData(res)
## DataFrame with 8 rows and 4 columns
## sample_name sample_group spectraOrigin sample_type
## <character> <character> <character> <character>
## ko15.CDF ko15 KO /home/bioc... QC
## ko16.CDF ko16 KO /home/bioc... study
## ko21.CDF ko21 KO /home/bioc... study
## ko22.CDF ko22 KO /home/bioc... QC
## wt15.CDF wt15 WT /home/bioc... study
## wt16.CDF wt16 WT /home/bioc... study
## wt21.CDF wt21 WT /home/bioc... QC
## wt22.CDF wt22 WT /home/bioc... study
Finally, the feature matrix is stored as an assay
within the object. Note that
a SummarizedExperiment
can have multiple assays which have to be numeric
matrices with the number of rows and columns matching the number of features and
samples, respectively. Below we list the names of the available assays.
assayNames(res)
## [1] "raw"
And we can access the actual data using the assay
function, optionally also
providing the name of the assay we want to access. Below we show the first 6
lines of that matrix.
assay(res) |> head()
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF
## FT001 135422.4 506848.9 111872.0 169955.6 210333.1 141880.3 233271.3
## FT002 1924712.0 1757151.0 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2
## FT003 213659.3 289500.7 167516.7 178285.7 253825.6 241844.4 240606.0
## FT004 349011.5 451863.7 343897.8 208002.8 364609.8 360908.9 219142.3
## FT005 286221.4 286594.4 164009.0 149097.6 255697.7 311296.8 366441.5
## FT006 1923307.8 1145753.0 380970.3 588986.4 1286883.0 1739516.6 639755.3
## wt22.CDF
## FT001 142254.8
## FT002 1354004.9
## FT003 185399.5
## FT004 221937.5
## FT005 271128.0
## FT006 508546.4
Since a SummarizedExperiment
supports multiple assays, we in addition add also
the feature value matrix without filled-in values (i.e. feature intensities
that were added by the gap filling step).
assays(res)$raw_nofill <- featureValues(faahko, filled = FALSE, method = "sum")
With that we have now two assays in our result object.
assayNames(res)
## [1] "raw" "raw_nofill"
And we can extract the feature values without gap-filling:
assay(res, "raw_nofill") |> head()
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF
## FT001 NA 506848.9 NA 169955.6 NA NA NA
## FT002 1924712.0 1757151.0 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2
## FT003 213659.3 289500.7 NA 178285.7 253825.6 241844.4 240606.0
## FT004 349011.5 451863.7 343897.8 208002.8 364609.8 360908.9 NA
## FT005 286221.4 NA 164009.0 149097.6 255697.7 311296.8 366441.5
## FT006 1923307.8 NA 380970.3 588986.4 1286883.0 1739516.6 639755.3
## wt22.CDF
## FT001 NA
## FT002 1354004.9
## FT003 185399.5
## FT004 221937.5
## FT005 271128.0
## FT006 508546.4
Finally, a history of the full processing with xcms is available as metadata
in the SummarizedExperiment
.
metadata(res)
## [[1]]
## Object of class "XProcessHistory"
## type: Peak detection
## date: Tue Jan 16 19:50:21 2024
## info:
## fileIndex: 1,2,3,4,5,6,7,8
## Parameter class: CentWaveParam
## MS level(s) 1
##
## [[2]]
## Object of class "XProcessHistory"
## type: Peak refinement
## date: Tue Jan 16 19:50:23 2024
## info:
## fileIndex: 1,2,3,4,5,6,7,8
## Parameter class: MergeNeighboringPeaksParam
## MS level(s) 1
##
## [[3]]
## Object of class "XProcessHistory"
## type: Peak grouping
## date: Tue Jan 16 19:50:39 2024
## info:
## fileIndex: 1,2,3,4,5,6,7,8
## Parameter class: PeakDensityParam
## MS level(s) 1
##
## [[4]]
## Object of class "XProcessHistory"
## type: Retention time correction
## date: Tue Jan 16 19:50:39 2024
## info:
## fileIndex: 1,2,3,4,5,6,7,8
## Parameter class: PeakGroupsParam
## MS level(s) 1
##
## [[5]]
## Object of class "XProcessHistory"
## type: Peak grouping
## date: Tue Jan 16 19:50:47 2024
## info:
## fileIndex: 1,2,3,4,5,6,7,8
## Parameter class: PeakDensityParam
## MS level(s) 1
##
## [[6]]
## Object of class "XProcessHistory"
## type: Missing peak filling
## date: Tue Jan 16 19:50:52 2024
## info:
## fileIndex: 1,2,3,4,5,6,7,8
## Parameter class: ChromPeakAreaParam
## MS level(s) 1
This same information can also be extracted from the xcms result object using
the processHistory
function. Below we extract the information for the first
processing step.
processHistory(faahko)[[1]]
## Object of class "XProcessHistory"
## type: Peak detection
## date: Tue Jan 16 19:50:21 2024
## info:
## fileIndex: 1,2,3,4,5,6,7,8
## Parameter class: CentWaveParam
## MS level(s) 1
These processing steps contain also the individual parameter objects used for
the analysis, hence allowing to exactly reproduce the analysis.
processHistory(faahko)[[1]] |> processParam()
## Object of class: CentWaveParam
## Parameters:
## - ppm: [1] 25
## - peakwidth: [1] 20 80
## - snthresh: [1] 10
## - prefilter: [1] 6 5000
## - mzCenterFun: [1] "wMean"
## - integrate: [1] 1
## - mzdiff: [1] -0.001
## - fitgauss: [1] FALSE
## - noise: [1] 5000
## - verboseColumns: [1] FALSE
## - roiList: list()
## - firstBaselineCheck: [1] TRUE
## - roiScales: numeric(0)
## - extendLengthMSW: [1] FALSE
At last we perform also a principal component analysis to evaluate the grouping
of the samples in this experiment. Note that we did not perform any data
normalization hence the grouping might (and will) also be influenced by
technical biases.
## Extract the features and log2 transform them
ft_ints <- log2(assay(res, "raw"))
## Perform the PCA omitting all features with an NA in any of the
## samples. Also, the intensities are mean centered.
pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)
## Plot the PCA
pcSummary <- summary(pc)
plot(pc$x[, 1], pc$x[,2], pch = 21, main = "",
xlab = paste0("PC1: ", format(pcSummary$importance[2, 1] * 100,
digits = 3), " % variance"),
ylab = paste0("PC2: ", format(pcSummary$importance[2, 2] * 100,
digits = 3), " % variance"),
col = "darkgrey", bg = sample_colors, cex = 2)
grid()
text(pc$x[, 1], pc$x[,2], labels = res$sample_name, col = "darkgrey",
pos = 3, cex = 2)
We can see the expected separation between the KO and WT samples on PC2. On PC1
samples separate based on their ID, samples with an ID <= 18 from samples with
an ID > 18. This separation might be caused by a technical bias
(e.g. measurements performed on different days/weeks) or due to biological
properties of the mice analyzed (sex, age, litter mates etc).