fmcsR: Mismatch Tolerant Maximum Common Substructure Detection for Advanced Compound Similarity Searching

Yan Wang, Tyler Backman, Kevin Horan, Thomas Girke

Last update: October 13, 2014

Introduction

Maximum common substructure (MCS) algorithms rank among the most sensitive and accurate methods for measuring structural similarities among small molecules. This utility is critical for many research areas in drug discovery and chemical genomics. The MCS problem is a graph-based similarity concept that is defined as the largest substructure (sub-graph) shared among two compounds (Wang, Backman, Horan, and Girke, 2013; Cao, Jiang, and Girke, 2008). It fundamentally differs from the structural descriptor-based strategies like fingerprints or structural keys. Another strength of the MCS approach is the identification of the actual MCS that can be mapped back to the source compounds in order to pinpoint the common and unique features in their structures. This output is often more intuitive to interpret and chemically more meaningful than the purely numeric information returned by descriptor-based approaches. Because the MCS problem is NP-complete, an efficient algorithm is essential to minimize the compute time of its extremely complex search process. The package implements an efficient backtracking algorithm that introduces a new flexible MCS (FMCS) matching strategy to identify MCSs among compounds containing atom and/or bond mismatches. In contrast to this, other MCS algorithms find only exact MCSs that are perfectly contained in two molecules. The details about the FMCS algorithm are described in the Supplementary Materials Section of the associated publication (Wang, Backman, Horan, et al., 2013). The package provides several utilities to use the FMCS algorithm for pairwise compound comparisons, structure similarity searching and clustering. To maximize performance, the time consuming computational steps of are implemented in C++. Integration with the package provides visualization functionalities of MCSs and consistent structure and substructure data handling routines (Cao, Charisi, Cheng, Jiang, and Girke, 2008; Backman, Cao, and Girke, 2011). The following gives an overview of the most important functionalities provided by .

Installation

The R software for running and can be downloaded from CRAN (http://cran.at.r-project.org/). The package can be installed from an open R session using the install command.

source("http://bioconductor.org/biocLite.R") 
biocLite("fmcsR") 

Quick Overview

To demo the main functionality of the package, one can load its sample data stored as object. The generic function can be used to visualize the corresponding structures.

library(fmcsR) 
data(fmcstest)
plot(fmcstest[1:3], print=FALSE) 
plot of chunk quicktest1

The function computes the MCS/FMCS shared among two compounds, which can be highlighted in their structure with the function.

test <- fmcs(fmcstest[1], fmcstest[2], au=2, bu=1) 
plotMCS(test,regenCoords=TRUE) 
plot of chunk quicktest2

Documentation

library("fmcsR") # Loads the package 

library(help="fmcsR") # Lists functions/classes provided by fmcsR 
library(help="ChemmineR") # Lists functions/classes from ChemmineR 
vignette("fmcsR") # Opens this PDF manual 
vignette("ChemmineR") # Opens ChemmineR PDF manual 

The help documents for the different functions and container classes can be accessed with the standard R help syntax.

?fmcs 
?"MCS-class" 
?"SDFset-class" 

MCS of Two Compounds

Data Import

The following loads the sample data set provided by the package. It contains the SD file (SDF) of molecules stored in an object.

data(fmcstest) 
sdfset <- fmcstest
sdfset 
## An instance of "SDFset" with 3 molecules

Custom compound data sets can be imported and exported with the and functions, respectively. The following demonstrates this by exporting the object to a file named sdfset.sdf. The latter is then reimported into R with the function.

write.SDF(sdfset, file="sdfset.sdf") 
mysdf <- read.SDFset(file="sdfset.sdf") 

Compute MCS

The function accepts as input two molecules provided as or objects. Its output is an S4 object of class . The default printing behavior summarizes the MCS result by providing the number of MCSs it found, the total number of atoms in the query compound \(a\), the total number of atoms in the target compound \(b\), the number of atoms in their MCS \(c\) and the corresponding Tanimoto Coefficient. The latter is a widely used similarity measure that is defined here as \(c/(a+b-c)\). In addition, the Overlap Coefficient is provided, which is defined as \(c/min(a,b)\). This coefficient is often useful for detecting similarities among compounds with large size differences.

mcsa <- fmcs(sdfset[[1]], sdfset[[2]]) 
mcsa 
## An instance of "MCS" 
##  Number of MCSs: 7 
##  CMP1: 14 atoms 
##  CMP2: 33 atoms 
##  MCS: 8 atoms 
##  Tanimoto Coefficient: 0.20513 
##  Overlap Coefficient: 0.57143
mcsb <- fmcs(sdfset[[1]], sdfset[[3]]) 
mcsb 
## An instance of "MCS" 
##  Number of MCSs: 1 
##  CMP1: 14 atoms 
##  CMP2: 25 atoms 
##  MCS: 14 atoms 
##  Tanimoto Coefficient: 0.56 
##  Overlap Coefficient: 1

If is run with then it returns the numeric summary information in a named .

fmcs(sdfset[1], sdfset[2], fast=TRUE)
##           Query_Size          Target_Size             MCS_Size Tanimoto_Coefficient  Overlap_Coefficient 
##           14.0000000           33.0000000            8.0000000            0.2051282            0.5714286

Class Usage

The class contains three components named , and . The slot stores the numeric summary information, while the structural MCS information for the query and target structures is stored in the and slots, respectively. The latter two slots each contain a with two subcomponents: the original query/target structures as objects as well as one or more numeric index vector(s) specifying the MCS information in form of the row positions in the atom block of the corresponding . A call to will often return several index vectors. In those cases the algorithm has identified alternative MCSs of equal size.

slotNames(mcsa) 
## [1] "stats" "mcs1"  "mcs2"

Accessor methods are provided to return the different data components of the class.

stats(mcsa) # or mcsa[["stats"]] 
##           Query_Size          Target_Size             MCS_Size Tanimoto_Coefficient  Overlap_Coefficient 
##           14.0000000           33.0000000            8.0000000            0.2051282            0.5714286
mcsa1 <- mcs1(mcsa) # or mcsa[["mcs1"]] 
mcsa2 <- mcs2(mcsa) # or mcsa[["mcs2"]] 
mcsa1[1] # returns SDFset component
## $query
## An instance of "SDFset" with 1 molecules
mcsa1[[2]][1:2] # return first two index vectors 
## $CMP1_fmcs_1
## [1]  3  8  7  4  9  5 11  1
## 
## $CMP1_fmcs_2
## [1]  3  8  7  4  9  5  1 13

The function can be used to return the substructures stored in an instance as object. If new atom numbers will be assigned to the subsetted SDF, while will maintain the atom numbers from its source. For details consult the help documents and .

mcstosdfset <- mcs2sdfset(mcsa, type="new")
plot(mcstosdfset[[1]], print=FALSE) 
plot of chunk unnamed-chunk-12

To construct an object manually, one can provide the required data components in a .

mylist <- list(stats=stats(mcsa), mcs1=mcs1(mcsa), mcs2=mcs2(mcsa)) 
as(mylist, "MCS") 
## An instance of "MCS" 
##  Number of MCSs: 7 
##  CMP1: 14 atoms 
##  CMP2: 33 atoms 
##  MCS: 8 atoms 
##  Tanimoto Coefficient: 0.20513 
##  Overlap Coefficient: 0.57143

FMCS of Two Compounds

If is run with its default paramenters then it returns the MCS of two compounds, because the mismatch parameters are all set to zero. To identify FMCSs, one has to raise the number of upper bound atom mismates and/or bond mismatches to interger values above zero.

plotMCS(fmcs(sdfset[1], sdfset[2], au=0, bu=0)) 
plot of chunk au0bu0

plotMCS(fmcs(sdfset[1], sdfset[2], au=1, bu=1)) 
plot of chunk au1bu1

plotMCS(fmcs(sdfset[1], sdfset[2], au=2, bu=2)) 
plot of chunk au2bu2

plotMCS(fmcs(sdfset[1], sdfset[3], au=0, bu=0)) 
plot of chunk au0bu013

FMCS Search Functionality

The function provides FMCS search functionality for compound collections stored in objects.

data(sdfsample) # Loads larger sample data set 
sdf <- sdfsample 
fmcsBatch(sdf[1], sdf[1:30], au=0, bu=0) 
##       Query_Size Target_Size MCS_Size Tanimoto_Coefficient Overlap_Coefficient
## CMP1          33          33       33            1.0000000           1.0000000
## CMP2          33          26       11            0.2291667           0.4230769
## CMP3          33          26       10            0.2040816           0.3846154
## CMP4          33          32        9            0.1607143           0.2812500
## CMP5          33          23       14            0.3333333           0.6086957
## CMP6          33          19       13            0.3333333           0.6842105
## CMP7          33          21        9            0.2000000           0.4285714
## CMP8          33          31        8            0.1428571           0.2580645
## CMP9          33          21        9            0.2000000           0.4285714
## CMP10         33          21        8            0.1739130           0.3809524
## CMP11         33          36       15            0.2777778           0.4545455
## CMP12         33          26       12            0.2553191           0.4615385
## CMP13         33          26       11            0.2291667           0.4230769
## CMP14         33          16       12            0.3243243           0.7500000
## CMP15         33          34       15            0.2884615           0.4545455
## CMP16         33          25        8            0.1600000           0.3200000
## CMP17         33          19        8            0.1818182           0.4210526
## CMP18         33          24       10            0.2127660           0.4166667
## CMP19         33          25       14            0.3181818           0.5600000
## CMP20         33          26       10            0.2040816           0.3846154
## CMP21         33          25       15            0.3488372           0.6000000
## CMP22         33          21       11            0.2558140           0.5238095
## CMP23         33          26       11            0.2291667           0.4230769
## CMP24         33          17        6            0.1363636           0.3529412
## CMP25         33          27        9            0.1764706           0.3333333
## CMP26         33          24       13            0.2954545           0.5416667
## CMP27         33          26       11            0.2291667           0.4230769
## CMP28         33          20       10            0.2325581           0.5000000
## CMP29         33          20        8            0.1777778           0.4000000
## CMP30         33          18        7            0.1590909           0.3888889

Clustering with FMCS

The function can be used to compute a similarity matrix for clustering with various algorithms available in R. The following example uses the FMCS algorithm to compute a similarity matrix that is used for hierarchical clustering with the function and the result is plotted in form of a dendrogram.

sdf <- sdf[1:7] 
d <- sapply(cid(sdf), function(x) fmcsBatch(sdf[x], sdf, au=0, bu=0, matching.mode="aromatic")[,"Overlap_Coefficient"]) 
d 
##           CMP1      CMP2      CMP3      CMP4      CMP5      CMP6      CMP7
## CMP1 1.0000000 0.2307692 0.2307692 0.2812500 0.5217391 0.6842105 0.2857143
## CMP2 0.2307692 1.0000000 0.4230769 0.5384615 0.2173913 0.4736842 0.2857143
## CMP3 0.2307692 0.4230769 1.0000000 0.3076923 0.2173913 0.4736842 0.9047619
## CMP4 0.2812500 0.5384615 0.3076923 1.0000000 0.3043478 0.5263158 0.2857143
## CMP5 0.5217391 0.2173913 0.2173913 0.3043478 1.0000000 0.5789474 0.2380952
## CMP6 0.6842105 0.4736842 0.4736842 0.5263158 0.5789474 1.0000000 0.3157895
## CMP7 0.2857143 0.2857143 0.9047619 0.2857143 0.2380952 0.3157895 1.0000000
hc <- hclust(as.dist(1-d), method="complete")
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE) 
plot of chunk tree

The FMCS shared among compound pairs of interest can be visualized with, here for the two most similar compounds from the previous tree:

plotMCS(fmcs(sdf[3], sdf[7], au=0, bu=0, matching.mode="aromatic")) 
plot of chunk au0bu024

Version Information

 sessionInfo()
## R version 3.1.1 Patched (2014-09-25 r66681)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ChemmineOB_1.4.0     knitcitations_1.0-1  fmcsR_1.8.0          ChemmineR_2.18.0     knitrBootstrap_0.9.0
## 
## loaded via a namespace (and not attached):
##  [1] DBI_0.3.1         RCurl_1.95-4.3    RJSONIO_1.3-0     Rcpp_0.11.3       RefManageR_0.8.34 XML_3.98-1.1     
##  [7] bibtex_0.3-6      digest_0.6.4      evaluate_0.5.5    formatR_1.0       httr_0.5          knitr_1.7        
## [13] lubridate_1.3.3   markdown_0.7.4    memoise_0.2.1     parallel_3.1.1    plyr_1.8.1        rjson_0.2.14     
## [19] stringr_0.6.2     tools_3.1.1       zlibbioc_1.12.0

References

[1] T. W. H. Backman, Y. Cao and T. Girke. "ChemMine tools: an online service for analyzing and clustering small molecules". In: Nucleic Acids Research 39.suppl (May. 2011), pp. W486-W491. URL: http://dx.doi.org/10.1093/nar/gkr320.

[2] Y. Cao, A. Charisi, L. Cheng, et al. "ChemmineR: a compound mining framework for R". In: Bioinformatics 24.15 (Jul. 2008), pp. 1733-1734. URL: http://dx.doi.org/10.1093/bioinformatics/btn307.

[3] Y. Cao, T. Jiang and T. Girke. "A maximum common substructure-based algorithm for searching and predicting drug-like compounds". In: Bioinformatics 24.13 (Jun. 2008), pp. i366-i374. URL: http://dx.doi.org/10.1093/bioinformatics/btn186.

[4] Y. Wang, T. W. H. Backman, K. Horan, et al. "fmcsR: mismatch tolerant maximum common substructure searching in R". In: Bioinformatics 29.21 (Aug. 2013), pp. 2792-2794. URL: http://dx.doi.org/10.1093/bioinformatics/btt475.