CluMSID — Clustering of MS2 Spectra for Metabolite Identification

Tobias Depke

October 26, 2021

Introduction

This tutorial shows how to use the CluMSID package to help annotate MS2 spectra from untargeted LC-MS/MS data. CluMSID works with MS2 data generated by data-dependent acquisition and requires an mzXML file (like in this example) or any other file that can be parsed by mzR, like mzML, mzTab or netCDF, as input. It can be used both stand-alone and together with the XCMS suite of preprocessing tools.

CluMSID extracts and merges MS2 spectra and generates neutral loss patterns for each feature. Additionally, it can make use of information from the CAMERA package to generate pseudospectra from MS1 level data. The tool uses cosine similarity to generate distance matrices from MS2 spectra, neutral loss patterns and pseudospectra.

These distance matrices are the basis for multivariate statistics methods such as multidimensional scaling, density-based clustering, hierarchical clustering and correlation networks. The CluMSID package provides functions for these methods including (interactive) visualisation but the distance/similarity data can also be analysed with other R functions.

For the demonstrations in this tutorial, we will mainly use data from pooled Pseudomonas aeruginosa cell extracts, measured in ESI-(+) mode with auto-MS/MS on a Bruker maxisHD qTOF after reversed phase separation by UPLC. For details, please refer to the Depke et al. 2017 publication (doi: 10.1016/j.jchromb.2017.06.002.).

To be able to access the example data, we also need the related package CluMSIDdata. The packages can be loaded as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("CluMSIDdata", "CluMSID"))
library(CluMSID)
library(CluMSIDdata)

MS2spectrum and pseudospectrum classes

CluMSID uses a custom S4 class named MS2spectrum to store spectral information in the following slots:

The pseudospectrum class is very similar but it contains no information on precursor m/z and therefore no neutral loss pattern, either. By default, the id slot contains the “pcgroup” number assigned by CAMERA.

The individual slots of MS2spectrum and pseudospectrum objects can be accessed via the standard S4 way using object@slot, e.g. object@annotation or by using an accessor function. These exist for all slots and are called accessFoo(), where Foo is the slot name (not exactly, though, because Bioconductor does not allow to mix snake_case and camelCase in function names):

Extract MS2 spectra from *.mzXML file

The first step in the CluMSID workflow is to extract MS2 spectra from the raw data file (in mzXML format). This is done by the extractMS2spectra function which internally uses several functions from the mzR package. The function offers the possibility to filter spectra that contain less a defined number of peaks and/or do not fall in a defined retention time window. Setting the recalibrate_precursor argument to TRUE activates a correction process for uncalibrated precursor m/z data that existed in older version of Bruker’s Compass Xport (cf. Depke et al. 2017). It is not necessary to use it with files generated by other software but does not corrupt the data, either.

Please be aware that mzR often throws warnings concerning the Rcpp version that can usually be ignored.

ms2list <- extractMS2spectra(system.file("extdata", 
                                            "PoolA_R_SE.mzXML", 
                                            package = "CluMSIDdata"), 
                                min_peaks = 2, 
                                recalibrate_precursor = TRUE, 
                                RTlims = c(0,25))

This operation has now extracted all the MS2 spectra from the raw data file and stored them in a list. Each list entry is an object of class MS2spectrum. The list is quite long because it still contains a lot of spectra that derive from the same chromatographic peak.

length(ms2list)
#> [1] 2290

In our example, the first two spectra in the list derive from the same peak and thus have the same precursor ion and almost the same retention time.

head(ms2list, 4)
#> [[1]]
#> An object of class "MS2spectrum" 
#>  id:  
#>  annotation:  
#>  precursor: 146.1652 
#>  retention time: 56.266 
#>  polarity: positive 
#>  MS2 spectrum with 2 fragment peaks 
#>  neutral loss pattern with 0 neutral losses
#> [[2]]
#> An object of class "MS2spectrum" 
#>  id:  
#>  annotation:  
#>  precursor: 146.1653 
#>  retention time: 57.292 
#>  polarity: positive 
#>  MS2 spectrum with 3 fragment peaks 
#>  neutral loss pattern with 0 neutral losses
#> [[3]]
#> An object of class "MS2spectrum" 
#>  id:  
#>  annotation:  
#>  precursor: 129.1387 
#>  retention time: 57.545 
#>  polarity: positive 
#>  MS2 spectrum with 2 fragment peaks 
#>  neutral loss pattern with 0 neutral losses
#> [[4]]
#> An object of class "MS2spectrum" 
#>  id:  
#>  annotation:  
#>  precursor: 112.1119 
#>  retention time: 57.797 
#>  polarity: positive 
#>  MS2 spectrum with 2 fragment peaks 
#>  neutral loss pattern with 0 neutral losses

From the output above, you also see that the MS2spectrum class has a show() generic that summarises the MS2 spectrum and neutral loss pattern data. To show the default output, use showDefault(). Be aware that neutral loss patterns have not been calculated in this step.

showDefault(ms2list[[2]])
#> An object of class "MS2spectrum"
#> Slot "id":
#> character(0)
#> 
#> Slot "annotation":
#> character(0)
#> 
#> Slot "precursor":
#> [1] 146.1653
#> 
#> Slot "rt":
#> [1] 57.292
#> 
#> Slot "polarity":
#> [1] "positive"
#> 
#> Slot "spectrum":
#>           [,1] [,2]
#> [1,]  72.08064 2448
#> [2,]  84.08077  328
#> [3,] 112.11228  843
#> 
#> Slot "neutral_losses":
#> <0 x 0 matrix>

Merge MS2 spectra that derive from the same peak/feature

To reduce the amount of redundant MS2 spectra, the mergeMS2spectra() function is used to generate consensus spectra from the MS2 spectra that derive from the same precursor. CluMSID offers two possibilities to do so:

Merge spectra without external peaktable

This possibility is the standard method for stand-alone use of CluMSID and is equivalent to what has been described in Depke et al. 2017. It does not need additional input and summarises consecutive spectra that have the same precursor m/z if their retention time fall within a defined threshold (rt_tolerance, defaults to 30s). A retention time difference between consecutive spectra larger than rt_tolerance is interpreted as chromatographic separation and respective spectra will be assigned to a new feature. The mz_tolerance argument should be set according to your instruments m/z precision, the default is 1 * 10-5 (10ppm, equivalent to ±5ppm instrument precision). The peaktable and exclude_unmatched arguments are not used in this method and are to be left at their default.

The total amount of spectra was reduced from 2290 to 518 and as many other, the redundant spectra #1 and #2 in the raw list are now merged to one consensus spectrum (#1 in the merged list).

In this step, neutral loss patterns have been generated that look like this:

Merge spectra with external peaktable, e.g. from XCMS

The second possibility is to supply a peaktable, i.e. a list of picked peaks with their mass-to-charge ratios and retention times. This is particularly useful if you want to annotate a complete metabolomics data set. In our example, we have a metabolomics dataset called “TD035” in which we have measured a range of samples in MS1 mode for relative quantification. Additionally, we have measured a pooled QC sample in MS2 mode for annotation. The MS1 data were analysed using XCMSonline and we want to group the MS2 spectra so that they match the XCMSonline peak picking.

The spectra are extracted as shown above:

The peaklist is imported from the XCMSonline output. The list has to contain at least 3 columns:

Shown below is an easy way of getting from an XCMSonline annotated diffreport to a suitable peaktable using tidyverse functions. Of course, you can achieve the same goal with base R functions or even in Excel. Depending on the retention time format in your *.mzXML file, you might have to convert from minutes to seconds or vice versa. Here, we have minutes in the XCMSonline output but seconds in the MS2 file, so we multiply by 60.

We can now use this peaktable as an argument for mergeMS2spectra(). You can choose whether you want to keep or exclude MS2 spectra that do not match any peak in the peaktable. These can occur in regions of the chromatogramm where there are no clear peaks but the auto-MS/MS still fragments the most abundant ions. These unmatched spectra are merged following the same rules as described above (method without peaktable). In this example, we keep the unmatched spectra. We use the default values for m/z and retention time tolerance and thus do not need to specify them.

Note that the 2nd entry in featlist2 is marked with an ‘x’ which means that it could not be assigned to a feature in the peaktable.

For the sake of simplicity, only the data generated from the stand-alone procedure will be used for the following examples. Be assured that all of them would also work with the data generated with the help of an external peaktable (featlist2).

Add annotations

The next step is to add (external) annotations to the list of features, e.g. from a spectral library that you curate in-house or one that has been supplied by your instrument manufacturer. If you do not (want to) annotate your features at all, this step can be skipped completely, leaving the annotation slot of the MS2spectrum objects empty.

Manual procedure

CluMSID offers several possibilities to add annotations to your feature list. The most basic one first generates a list of features and saves it as *.csv file. For that you use the writeFeaturelist() function and only have to specify your list of spectra and a file name for the output file (here: pre_anno.csv). You can then manually fill in your annotations in a new column in the table, save it (in this example under the name post_anno.csv) and reload it to R:

annotatedSpeclist will then be equivalent to featlist with annotations added to the annotation slot of the list entries.

Alternative procedures

You can add annotations without leaving the R environment, too. addAnnotations() also accepts objects of class data.frame as annolist argument. Be aware that addAnnotations() assigns the annotation based on the position in the feature list. I.e., if the order of the features in your list of features (featlist) and your list of annotations (annolist) is different, you will get nonsense results.

The savest ways to addAnnotations() with a data.frame is to use featureList() to generate a data.frame that is formatted in the same way as the file output from writeFeaturelist() and then match your identifications against this data.frame and use the result as argument for addAnnotations().

Say you have an object called annos that contains feature IDs (the same as in featlist) and annotations in a two-column data.frame with “id” and “annotation” as column names. It could look like this:

addAnnotations(featlist, annos, annotationColumn = 2) will throw an error because featlist and annos are of different length. Instead, you need to do the following:

Now, you can annotate your list of spectra using addAnnotations(featlist, fl_annos, annotationColumn = 4).

An analogous procedure works if you have your annotations stored in a peaktable that you have used for mergeMSspectra(). As the order of spectra in the list will not be same as the order of features in your peaktable, you need to do a matching with the output of featureList() as well.

Generate distance matrices

Once we have a list of MS2spectrum objects containing all the required information with or without annotation, we can generate distance matrices from (product ion) MS2 spectra as well as from neutral loss patterns. These distance matrices serve as the basis for further analysis of the data. Both for MS2 spectra and neutral loss patterns, cosine similarity is used as similarity metric:

\[ cos(\theta) = \frac{\sum_{i}a_i \cdot b_i}{\sqrt{\sum_{i}{a_{i}}^2 \cdot \sum_{i}{b_{i}}^2}} \]

Distance matrix for product ion spectra

For most applications, analysing the similarity of product ion MS2 spectra will be most useful. The generation of the distance matrix is done by just one simple command but it can take some time to calculate.

Distance matrix for neutral loss patterns

Common neutral losses and neutral loss patterns can convey information about structural similarity, as well, e.g. with nucleotides or glykosylated secondary metabolites. CluMSID offers the possibility to study neutral loss patterns independently from product ion spectra. The generation of a distance matrix is analogous, you just need to set the ‘type’ argument to “neutral_losses”:

Visualise distance/similarity data using multidimensional scaling (MDS)

One rather simple possibility to visually analyse the spectral similarity data is multidimensional scaling, a dimension reduction method that simplifies distances in n-dimensional space to those in two-dimensional space (n in this case being the number of consensus spectra or neutral loss patterns that were used to generate the distance matrix in the previous step). CluMSID offers a simple function to produce an MDS plot from the distance matrix with the option to highlight annotated metabolites and the possibility to generate an interactive plot using plotly.

Standard MDS plots are generated as follows:

For MS2 spectra:

MDSplot(distmat, highlight_annotated = TRUE)
Figure 1: Multidimensional scaling plot as a visualisation of MS2 spectra similarities of the example data set. Red dots signify annotated spectra, black dots spectra from unknown metabolites.

Figure 1: Multidimensional scaling plot as a visualisation of MS2 spectra similarities of the example data set. Red dots signify annotated spectra, black dots spectra from unknown metabolites.

For neutral loss patterns:

MDSplot(nlmat, highlight_annotated = TRUE)
Figure 2: Multidimensional scaling plot as a visualisation of neutral loss similarities of the example data set. Red dots signify annotated spectra, black dots spectra from unknown metabolites.

Figure 2: Multidimensional scaling plot as a visualisation of neutral loss similarities of the example data set. Red dots signify annotated spectra, black dots spectra from unknown metabolites.

Interactive plots are zoomable and show feature names upon mouse-over. They are generated like normal MDS plots and can be viewed within RStudio or—after saving as html file using htmlwidgets—displayed in a normal web browser.

my_mds <- MDSplot(distmat, interactive = TRUE, highlight_annotated = TRUE)

htmlwidgets::saveWidget(my_mds, "mds.html")

This is how it looks like if you open the html file in Firefox and mouse over a feature:

**Figure 3:** Screenshot of the interactive version of the Multidimensional scaling plot visualising MS^2^ spectra similarities of the example data set (cf Figure 1). Zoomed image section with tooltip displaying feature information upon mouse-over.

Figure 3: Screenshot of the interactive version of the Multidimensional scaling plot visualising MS2 spectra similarities of the example data set (cf Figure 1). Zoomed image section with tooltip displaying feature information upon mouse-over.

Perform density-based clustering using the OPTICS algorithm

For density-based clustering with CluMSID, the ‘OPTICS’ algorithm and its implementation in the dbscan package is used. Density-based clustering is a useful clustering method that often yields different results than hierarchical clustering and can thus provide additional insight into the data. CluMSID has two functions to perform density-based clustering, one for the reachability plot which is the most useful visualisation of OPTICS results and one that outputs a data.frame containing the cluster assignations for every feature.

Both functions require as arguments a distance matrix as well as three parameters for the underlying functions dbscan::optics and dbscan::extractDBSCAN: eps, minPts and eps_cl. Lowering the eps parameter (default is 10000) limits the size of the epsilon neighbourhood which from experience has very little effect on the results. minPts defaults to 3 in CluMSID. It defines how many points are considered for reachability distance calculation between clusters. The dbscan::optics default for minPts is 5. Users are encourage to experiment with this parameter. eps_cl is the reachability threshold to identify clusters and can be varied based on your data. Lowering eps_cl leads to a larger number of smaller clusters and vice versa for raising the value. In general, it is advisable to chose a higher eps_cl for MS2 spectra than for neutral loss patterns, since the latter tend to show less similarity to each other. For details, please refer to the dbscan help for the dbscan::optics and dbscan::extractDBSCAN functions.

If the default parameters are used, the generation of an OPTICS reachability plots is very simple, shown here for MS2 spectra and neutral loss patterns:

OPTICSplot(distmat)
Figure 4: Reachability distance plot resulting from OPTICS density based clustering of the MS2 spectra similarities of the example data set. Bars represent features in OPTICS order with heights corresponding to the reachability distance to the next feature. The dashed horizontal line marks the reachability threshold that separates clusters. The resulting clusters are colour-coded with black representing noise, i.e. features not assigned to any cluster.

Figure 4: Reachability distance plot resulting from OPTICS density based clustering of the MS2 spectra similarities of the example data set. Bars represent features in OPTICS order with heights corresponding to the reachability distance to the next feature. The dashed horizontal line marks the reachability threshold that separates clusters. The resulting clusters are colour-coded with black representing noise, i.e. features not assigned to any cluster.

OPTICSplot(nlmat, eps_cl = 0.7)
Figure 5: Reachability distance plot resulting from OPTICS density based clustering of the neutral loss similarities of the example data set (cf Figure 4).

Figure 5: Reachability distance plot resulting from OPTICS density based clustering of the neutral loss similarities of the example data set (cf Figure 4).

In the reachability plots, every line represents a feature and the height of the line is the reachability distance to the next feature in the OPTICS order. Thus, valleys represent groups of similar spectra or neutral loss patterns. The order and the cluster assignment can be studied using the OPTICStbl function that outputs a three-column data.frame with feature id, cluster assignment and OPTICS order. The order of features in the data.frame corresponds to the original order in the input distance matrix. Features that were not assigned to a cluster are black in the reachability plot and have the cluster ID 0. OPTICStbl takes the same arguments as OPTICSplot. The two functions have to be run with exactly the same parameters to assure compatibility of results.

OPTICStbl <- OPTICStbl(distmat)

head(OPTICStbl)
#>                                 feature cluster_ID OPTICS_order
#> 1            M146.17T59.35 - spermidine          1            1
#> 2 M129.14T58.57 - spermidine (fragment)          1            3
#> 3  M112.11T57.8 - spermidine (fragment)          1            4
#> 4                         M251.16T60.64          0          185
#> 5                         M212.85T65.02          0          518
#> 6                         M290.85T64.76          0          517

Perform hierarchical clustering

In Depke et al. 2017, hierarchical clustering proved the most useful method to unveil structural similarities between features. analogous to density-based clustering, CluMSID offers two functions, one for plots and one for a data.frame with cluster assignments, both taking a distance matrix as the only compulsory argument. The other two parameters are h (defaults to 0.95), the height where the tree should be cut (see stats::cutree for details) and type that determines the type visualisation:

Create a heatmap

Heatmaps of our example data for MS2 and neutral loss pattern similarity are created as follows (with reduced label font size by changing cexRow and cexCol as well as margins of the underlying heatmap.2 function):

Figure 6: Symmetric heat map of the distance matrix displaying MS2 spectra similarities of the example data set along with dendrograms resulting from hierarchical clustering based on the distance matrix. The colour encoding is shown in the top-left insert.

Figure 6: Symmetric heat map of the distance matrix displaying MS2 spectra similarities of the example data set along with dendrograms resulting from hierarchical clustering based on the distance matrix. The colour encoding is shown in the top-left insert.

Figure 7: Symmetric heat map of the distance matrix displaying neutral loss similarities of the example data set along with dendrograms resulting from hierarchical clustering based on the distance matrix. The colour encoding is shown in the top-left insert.

Figure 7: Symmetric heat map of the distance matrix displaying neutral loss similarities of the example data set along with dendrograms resulting from hierarchical clustering based on the distance matrix. The colour encoding is shown in the top-left insert.

Obviously, it makes sense to export the plots to larger pdf or png files (e.g. 2000 \(\times\) 2000 pixels) to examine them closely. If exported to pdf, the feature names remain searchable (Ctrl+F in Windows).

Create a dendrogram

With the dendrogram, too, it is advisable to export is to pdf in a large format, e.g. as follows:

The plot from our example data looks like this:

**Figure 8:** Circularised dendrogram as a result of agglomerative hierarchical clustering with average linkage as agglomeration criterion based on MS^2^ spectra similarities of the example data set. Each leaf represents one feature and colours encode cluster affiliation of the features. Leaf labels display feature IDs, along with feature annotations, if existent. Distance from the central point is indicative of the height of the dendrogram.

Figure 8: Circularised dendrogram as a result of agglomerative hierarchical clustering with average linkage as agglomeration criterion based on MS2 spectra similarities of the example data set. Each leaf represents one feature and colours encode cluster affiliation of the features. Leaf labels display feature IDs, along with feature annotations, if existent. Distance from the central point is indicative of the height of the dendrogram.

The clusters are colour-coded and if exported to pdf, the tip labels containing feature ID and annotation are searchable.The height of the dendrogram’s branching points serves as another piece of information when interpreting the clustered data as it signifies similarity of features.

For a detailed example of how to interpret, please refer to Depke et al. 2017, where CluMSID helped to identify new members of several classes of secondary metabolites in Pseudomonas aeruginosa.

Like with density-based clustering, it is also possible to generate a list of features with respective cluster assignments using HCtbl. As mentioned above for CluMSID_OPTISplot and OPTICStbl, it is crucial to run HCplot and HCtbl using the same parameters.

Generate a correlation network

As a new functionality, CluMSID offers the possibility to analyse the similarity data using weighted correlation networks. These networks offer some advantages with respect to standard clustering methods, most notably that they do not strictly assign every feature to a distinct cluster but also represent similarities between features that would fall into different clusters in hierarchical or density-based clustering. Thus, correlation networks potentially contain more useful information for data interpretation. On the downside, the interpretation is also complicated by this lack of concrete cluster assignments. E.g., we cannot simply look up which features belong to the same cluster in order to examine their spectra closely but we have to go back to the correlation network visualisation and search for connected features manually.

networkplot requires some arguments:

A standard non-interactive correlation network for the MS2 example data can be plotted like this:

networkplot(distmat, highlight_annotated = TRUE, 
                show_labels = TRUE, interactive = FALSE)
Figure 9: Correlation network plot based on MS2 spectra similarities of the example data set. Grey dots indicate non-identified features, orange dots identified ones. Labels display feature IDs, along with feature annotations, if existent. Edge widths are proportional to spectral similarity of the connected features.

Figure 9: Correlation network plot based on MS2 spectra similarities of the example data set. Grey dots indicate non-identified features, orange dots identified ones. Labels display feature IDs, along with feature annotations, if existent. Edge widths are proportional to spectral similarity of the connected features.

As you can guess from this plot, it makes sense to use the interactive visualisation. Just like with MDSplotplot, you can view the interactive plot within RStudio or save it as html and view it in web browser.

my_net <- networkplot(distmat, interactive = TRUE, 
                            highlight_annotated = TRUE)

htmlwidgets::saveWidget(my_net, "net.html")

This is how it looks like if you open the html file in Firefox, zoom in on a cluster and mouse over a feature:

**Figure 10:** Screenshot of the interactive version of the Correlation network plot based on MS^2^ spectra similarities of the example data set (cf Figure 9). Zoomed image section with tooltip displaying feature information upon mouse-over.

Figure 10: Screenshot of the interactive version of the Correlation network plot based on MS2 spectra similarities of the example data set (cf Figure 9). Zoomed image section with tooltip displaying feature information upon mouse-over.

Please be aware that the spatial arrangement of the data points in the plot has a random component, i.e. while the relative position of the points (the distance to each other) is always the same, the absolute position varies and will not be the same even if the same command is executed twice.

The pairwise similarity of spectra or neutral loss patterns of features expressed by the cosine score is signified by the width of the line connecting the two features. All pairwise similarities greater than min_similarity result in a connecting line in the plot. The spatial proximity in which the features are mapped onto the plot is determined by the multivariate method underlying the network generation.

As we have already noticed after inspection of the heatmaps on p.13–14, the neutral loss patterns show much less similarity to each other than the MS2 spectra data. Thus, we expect quite a few neutral loss patterns that do not show any similarity to another neutral loss pattern. This expectation justifies the exclusion of these ‘singletons’ from the correlation network analysis. To do so, just set exclude_singletons to TRUE:

networkplot(nlmat, highlight_annotated = TRUE, 
                show_labels = TRUE, exclude_singletons = TRUE)
Figure 11: Correlation network plot based on neutral loss similarities of the example data set (cf Figure 9).

Figure 11: Correlation network plot based on neutral loss similarities of the example data set (cf Figure 9).

Additional functionalities

Multidimensional scaling, density-based clustering, hierarchical clustering and correlation network analysis are the main CluMSID tools to analyse MS2 spectra or neutral loss pattern similarity data, however, the package contains some additional functionalities that may facilitate data analysis in some cases and can also be used in other contexts with or without the above-mentioned unsupervised methods.

Access individual spectra from a list of spectra by various slot entries

Accessing S4 objects within lists is not trivial. Therefore, CluMSID offers a function to access individual or several MS2spectrum objects by their slot entries. getSpectrum() requires the following arguments:

Some examples will demonstrate the use of getSpectrum():

1. Accessing a spectrum by its ID. For this, the exact feature ID must be known:

2. Accessing a spectrum by its annotation. For this, the exact annotation has to be known as well, other annotations will produce a message:

3. Accessing spectra by their precursor ion m/z. If the list contains more than one spectrum with a precursor ion m/z within the tolerance, the output is again a list of MS2spectrum objects that meet the specified criterion:

4. Accessing spectra by their precursor retention time. Here, too, we can extract several MS2spectrum objects by setting a larger retention time tolerance. If we want to extract the spectra of all compounds that elute from 6min (360s) to 8min (480s), we proceed as follows:

Find spectra that contain a specific fragment or neutral loss

Another pair of accessory functions is findFragment() and findNL() which are used to find spectra that contain a specific fragment ion or neutral loss. Analogous to getSpectrum(), they need as arguments a list of MS2spectrum objects, the m/z of the fragment or neutral loss of interest and the respective m/z tolerance in ppm (default is 10ppm).

The two functions can be useful in many situation, e.g. when working with lipid data where head groups and fatty acids often give characteristic fragments or neutral losses. In the world of P. aeruginosa secondary metabolites, alkylquinolones (AQs) play an important role and most of the AQ MS2 spectra contain a signature fragment with an m/z of 159.068. Based on this fragment m/z, we can create a list of putative AQs:

An example for common neutral losses are nucleoside monophospates that all loose ribose-5’-monophosphate, resulting in a neutral loss of 212.009 in ESI-(+). Using findNL() we find CMP, UMP, AMP and GMP.

Match one spectrum against a set of spectra

If you are mainly interested in one or a few number of spectra or neutral loss patterns, it may be sufficient to match one feature at a time against a larger set of spectra. This set of spectra can be all spectra contained in one mzXML file like in all the examples in this tutorial or they could be a spectral library, as long as its format in R is a list of MS2spectrum objects.

The getSimilarities() function requires several arguments:

In the first example, we want to find all MS2 spectra in our example data set that are similar to the spectrum of pyocyanin, an important secondary metabolite from Pseudomonas aeruginosa and therefore match the pyocyanin spectrum against our annotatedSpeclist. Because we have already identified pyocyanin in the data set, we can use getSpectrum to extract the MS2spectrum object from annotatedSpeclist. We do not want to search all 518 elements of the result vector, so we set hits_only to TRUE to exclude spectra that have 0 similarity to the pyocyanin spectrum.

pyo <- getSpectrum(annotatedSpeclist, "annotation", "pyocyanin")

sim_pyo <- getSimilarities(pyo, annotatedSpeclist, hits_only = TRUE)
sim_pyo
#>  M110.06T100.45  M123.06T103.31  M332.56T107.48  M332.08T113.21  M182.08T125.93 
#>    0.0235166588    0.0071763662    0.0032575891    0.0035153018    0.0414005385 
#>  M166.09T233.22  M120.08T233.48   M103.05T235.3  M174.06T277.59  M220.12T336.88 
#>    0.0394723541    0.0492390806    0.0826780036    0.0391004892    0.0205482303 
#>  M525.18T352.51   M243.08T362.4  M188.07T371.25   M205.1T370.99  M211.09T382.17 
#>    0.0060019991    0.0145904545    0.0176900909    0.0179895663    1.0000000000 
#>  M187.12T391.55  M188.12T399.72  M254.09T400.89  M160.08T433.66  M291.15T444.85 
#>    0.0210280136    0.0105392131    0.2071528536    0.0489638040    0.0106479317 
#>  M120.04T450.56  M138.06T451.33   M176.07T465.7  M491.29T496.41  M255.08T482.73 
#>    0.0287432023    0.0202198052    0.0275059908    0.0610208210    0.6451546287 
#>  M245.59T495.11  M145.08T508.11  M163.09T512.64  M188.11T535.78    M321.1T537.6 
#>    0.2583432230    0.0473127795    0.0034167239    0.0057005179    0.0293635312 
#>  M243.09T558.67  M136.08T584.09  M118.06T585.64  M215.12T626.24  M224.08T640.69 
#>    0.0116275804    0.0132716679    0.0203921366    0.3252546561    0.0325490977 
#>  M213.07T652.92  M216.14T670.01  M227.08T670.26  M264.18T675.46  M233.13T676.23 
#>    0.0083842257    0.0009299928    0.0034818309    0.0172023762    0.0143332573 
#>  M225.07T698.07   M207.06T699.1   M257.06T704.3  M226.18T703.76  M325.07T739.09 
#>    0.0253940205    0.0230298767    0.0028192053    0.0255571995    0.0010572974 
#>  M181.08T724.14   M330.19T724.4  M255.08T740.91  M446.19T745.32   M321.1T746.09 
#>    0.1283755709    0.5019236568    0.2030839014    0.0750793676    0.1017149711 
#>  M891.36T747.39   M231.1T763.28   M185.1T763.53   M288.2T765.88  M258.15T768.47 
#>    0.0714108632    0.0070294838    0.0094225379    0.0018840838    0.0168976240 
#>  M328.14T772.12   M242.15T789.6  M304.19T786.75  M260.16T796.66   M244.17T796.4 
#>    0.0009052790    0.0071606643    0.0066966285    0.0141834395    0.0106077162 
#>  M159.07T795.61  M314.21T825.99  M326.18T833.85  M286.18T864.04  M270.19T873.15 
#>    0.0124577125    0.0006167122    0.0023434969    0.0205024121    0.0089744139 
#>  M268.17T875.49  M178.05T867.16  M198.09T874.45    M170.1T885.9    M288.2T902.1 
#>    0.0011687551    0.0413446455    0.0064665757    0.0062571323    0.0100270721 
#>   M184.08T908.6   M272.2T910.42   M312.2T929.44  M314.21T944.87   M296.2T958.11 
#>    0.0036517997    0.0086043375    0.0085388744    0.0128215188    0.0054678320 
#>  M298.22T984.33  M500.22T974.72  M304.19T977.83  M303.19T979.65 M340.23T1007.62 
#>    0.0065812089    0.0396920510    0.0059045590    0.0049045093    0.0052002486 
#> M324.23T1025.79 M314.21T1034.15 M326.25T1043.73 M679.43T1051.39 
#>    0.0005826366    0.0030626495    0.0005424581    0.0008290325

We get 84 spectra that have a non-zero similarity to the pyocyanin spectrum, including pyocyanin itself with a similarity of 1. Of course, we can further filter the data by subsetting the result vector in order to exclude spectra that have only minimal similarity, e.g. M679.43T1051.39 with a cosine similarity of only 0.0008 (the last element in the vector).

In the second example, we generate a new speclist, e.g. from a spectral library. We look at the unknown feature that has most similarity to pyocyanin. As pyocyanin is contained in annotatedSpeclist itself, we have to look at the second highest similarity. Again, we use getSpectrum() to extract the object from annotatedSpeclist:

We see that the feature is not annotated. We are interested whether this feature also shows similarity to other members of the phenazine family of P. aeruginosa secondary metabolites. Some phenazines are contained in annotatedSpeclist but some are not, so we make a new speclist called phenazines and add the missing spectra manually from an in-house library:

As a result, we get the interesting information that the MS2 spectra similarity of our unknown feature seems to be specific to pyocyanin (both the experimental and the library spectrum).

Convert MSnbase objects to class MS2spectrum

The MSnbase package—which is commonly used for proteomics applications and is also associated with XCMS3—has two classes for (MS2) spectra, Spectrum and Spectrum2 which contain spectra along with metainformation. These metainformation differ from those contained in MS2spectrum objects and are not very well suited for metabolomics applications. Still, it is possible to use CluMSID functions with objects of those two classes by converting them to MS2spectrum objects using as.MS2spectrum():

Split polarities from polarity-switching runs

As polarity-switching and similar methords are gaining importance in LC-MS/MS metabolomics, CluMSID offers the possibility to process LC-MS/MS data containing spectra of different polarities. As spectra from positive and negative ionisation show different fragmentation mechanisms and patterns, it does not appear to be useful to compare spectra of different polarity to each other. Therefore, CluMSID provides a function to separate positive and negative spectra from each other. This has to be done in the very beginning of the analysis to not interfere with spectral merging. Positive and negative spectra can than be processed independently from each other as shown above.

A schematic workflow would like like this:

… and so on as described in this tutorial.

Use MS1 pseudospectra instead of or in addition to MS2 data

MS1 pseudospectra are groups of peaks/ions that derive or are assumed to derive from the same compound. They consist of peaks for in-source fragment, adducts etc. Pseudospectra can contain structural information about analytes, e.g. about moieties that easily fragment even in MS1 mode without CID. Thus, it might sometimes be useful to study similarities between pseudospectra analogously to those between MS2 spectra. CluMSID makes use of the CAMERA package to assign peaks to pseudospectra. A custom S4 class named pseudospectrum is used which is very similar to the MS2spectrum class. For obvious reasons, it does not contain a precursor ion m/z slot and thus no neutral loss pattern, either. The pcgroup defined by CAMERA is used as ID, an annotation can be added if desired.

Extract pseudospectra

To extract pseudospectra, you first have to process your data using the CAMERA package, either in R or via XCMSonline, where this is done automatically. There are two possibilities to use the extractPseudospectra() function in CluMSID: either with an xsAnnotate object which you generate with CAMERA in R or with a data.frame that contains data on m/z, retention time, intensity and pcgroup, e.g. the results table from XCMSonline.

The latter is demonstrated with the XCMSonline results table already used to generate a peak table. If the column names are not changed, the data.frame can be supplied as-is and intensity_columns does not have to be specified. We want to exclude pseudospectra that have only one peak, so we set min_peaks = 2.

As a result, we get a list with 198 pseudospectra that we can now process further.

Create distance matrix for pseudospectra

The creation of a distance matrix is analogous to the procedure for MS2 spectra:

Generate a correlation network for pseudospectra

The distance matrix can now be used for MDS, clustering and correlation networks just like described above. For demonstration, we generate a correlation network:

Figure 12: Correlation network plot based on similarities of pseudospectra of the example data set (cf Figure 9).

Figure 12: Correlation network plot based on similarities of pseudospectra of the example data set (cf Figure 9).

With the exclusion of singletons, we get a much less busy plot than for MS2 data but we still find quite a few connections that may prove informative.

Session Info

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] magrittr_2.0.1      metaMSdata_1.29.0   metaMS_1.30.0      
#>  [4] CAMERA_1.50.0       xcms_3.16.0         MSnbase_2.20.0     
#>  [7] ProtGenerics_1.26.0 S4Vectors_0.32.0    mzR_2.28.0         
#> [10] Rcpp_1.0.7          BiocParallel_1.28.0 Biobase_2.54.0     
#> [13] BiocGenerics_0.40.0 CluMSIDdata_1.9.0   CluMSID_1.10.0     
#> 
#> loaded via a namespace (and not attached):
#>   [1] backports_1.2.1             Hmisc_4.6-0                
#>   [3] plyr_1.8.6                  igraph_1.2.7               
#>   [5] lazyeval_0.2.2              splines_4.1.1              
#>   [7] GenomeInfoDb_1.30.0         ggplot2_3.3.5              
#>   [9] digest_0.6.28               foreach_1.5.1              
#>  [11] htmltools_0.5.2             fansi_0.5.0                
#>  [13] checkmate_2.0.0             cluster_2.1.2              
#>  [15] doParallel_1.0.16           tzdb_0.1.2                 
#>  [17] limma_3.50.0                readr_2.0.2                
#>  [19] sna_2.6                     matrixStats_0.61.0         
#>  [21] vroom_1.5.5                 MsFeatures_1.2.0           
#>  [23] jpeg_0.1-9                  colorspace_2.0-2           
#>  [25] xfun_0.27                   dplyr_1.0.7                
#>  [27] crayon_1.4.1                RCurl_1.98-1.5             
#>  [29] jsonlite_1.7.2              graph_1.72.0               
#>  [31] impute_1.68.0               survival_3.2-13            
#>  [33] iterators_1.0.13            ape_5.5                    
#>  [35] glue_1.4.2                  gtable_0.3.0               
#>  [37] zlibbioc_1.40.0             XVector_0.34.0             
#>  [39] DelayedArray_0.20.0         DEoptimR_1.0-9             
#>  [41] scales_1.1.1                vsn_3.62.0                 
#>  [43] DBI_1.1.1                   GGally_2.1.2               
#>  [45] viridisLite_0.4.0           htmlTable_2.3.0            
#>  [47] clue_0.3-60                 bit_4.0.4                  
#>  [49] foreign_0.8-81              preprocessCore_1.56.0      
#>  [51] Formula_1.2-4               MsCoreUtils_1.6.0          
#>  [53] htmlwidgets_1.5.4           httr_1.4.2                 
#>  [55] gplots_3.1.1                RColorBrewer_1.1-2         
#>  [57] ellipsis_0.3.2              farver_2.1.0               
#>  [59] pkgconfig_2.0.3             reshape_0.8.8              
#>  [61] XML_3.99-0.8                nnet_7.3-16                
#>  [63] sass_0.4.0                  utf8_1.2.2                 
#>  [65] labeling_0.4.2              tidyselect_1.1.1           
#>  [67] rlang_0.4.12                munsell_0.5.0              
#>  [69] tools_4.1.1                 cli_3.0.1                  
#>  [71] dbscan_1.1-8                generics_0.1.1             
#>  [73] statnet.common_4.5.0        evaluate_0.14              
#>  [75] stringr_1.4.0               fastmap_1.1.0              
#>  [77] mzID_1.32.0                 yaml_2.2.1                 
#>  [79] bit64_4.0.5                 knitr_1.36                 
#>  [81] robustbase_0.93-9           caTools_1.18.2             
#>  [83] purrr_0.3.4                 RANN_2.6.1                 
#>  [85] ncdf4_1.17                  RBGL_1.70.0                
#>  [87] nlme_3.1-153                compiler_4.1.1             
#>  [89] rstudioapi_0.13             plotly_4.10.0              
#>  [91] png_0.1-7                   affyio_1.64.0              
#>  [93] MassSpecWavelet_1.60.0      tibble_3.1.5               
#>  [95] bslib_0.3.1                 stringi_1.7.5              
#>  [97] highr_0.9                   lattice_0.20-45            
#>  [99] Matrix_1.3-4                vctrs_0.3.8                
#> [101] pillar_1.6.4                lifecycle_1.0.1            
#> [103] BiocManager_1.30.16         jquerylib_0.1.4            
#> [105] MALDIquant_1.20             data.table_1.14.2          
#> [107] bitops_1.0-7                GenomicRanges_1.46.0       
#> [109] R6_2.5.1                    latticeExtra_0.6-29        
#> [111] pcaMethods_1.86.0           affy_1.72.0                
#> [113] network_1.17.1              KernSmooth_2.23-20         
#> [115] gridExtra_2.3               IRanges_2.28.0             
#> [117] codetools_0.2-18            MASS_7.3-54                
#> [119] gtools_3.9.2                assertthat_0.2.1           
#> [121] SummarizedExperiment_1.24.0 GenomeInfoDbData_1.2.7     
#> [123] hms_1.1.1                   parallel_4.1.1             
#> [125] grid_4.1.1                  rpart_4.1-15               
#> [127] tidyr_1.1.4                 coda_0.19-4                
#> [129] rmarkdown_2.11              MatrixGenerics_1.6.0       
#> [131] base64enc_0.1-3