ChemmineR: Cheminformatics Toolkit for R

Amazing Url Redirector


Authors: Eddie Cao, Tyler BackmanKevin Horan and Thomas Girke, UC Riverside

Contents

  1. 1 Introduction
  2. 2 Recently Added Features
  3. 3 Getting Started
    1. 3.1 Installation
    2. 3.2 Version History
    3. 3.3 Loading the Package and Documentation
    4. 3.4 Five Minute Tutorial
  4. 4 OpenBabel Functions
  5. 5 Overview of Classes and Functions
  6. 6 Import of Compounds
    1. 6.1 SDF Import
    2. 6.2 SMILES Import
  7. 7 Export of Compounds
    1. 7.1 SDF Export
    2. 7.2 SMILES Export
  8. 8 Format Interconversions
  9. 9 Splitting SD Files
  10. 10 Streaming Through Large SD Files
  11. 11 Storing Compounds in SQL Databases 
  12. 12 Working with SDF/SDFset Classes
  13. 13 Molecular Property Functions (Physicochemical Descriptors)
  14. 14 Molecular Properties from OpenBabel
  15. 15 Bond Matrices
  16. 16 Charges and Hydrogens
  17. 17 Ring Perception and Aromaticity Assignment
  18. 18 Rendering Chemical Structure Images
    1. 18.1 R Graphics Device
    2. 18.2 Online with ChemMine Tools
  19. 19 Similarity Comparisons and Searching
    1. 19.1 Maximum Common Substructure (MCS) Searching
    2. 19.2 AP/APset Classes for Storing Atom Pair Descriptors
    3. 19.3 Large SDF and Atom Pair Databases
    4. 19.4 Pairwise Compound Comparisons with Atom Pairs
    5. 19.5 Similarity Searching with Atom Pairs
    6. 19.6 FP/FPset Classes for Storing Fingerprints
    7. 19.7 Atom Pair Fingerprints 
    8. 19.8 Pairwise Compound Comparisons with PubChem Fingerprints
    9. 19.9 Similarity Searching with PubChem Fingerprints
    10. 19.10 Visualize Similarity Search Results
  20. 20 Clustering
    1. 20.1 Clustering Identical or Very Similar Compounds
    2. 20.2 Binning Clustering
    3. 20.3 Jarvis-Patrick Clustering
    4. 20.4 Multi-Dimensional Scaling (MDS)
    5. 20.5 Clustering with Other Algorithms
      1. 20.5.1 Hierarchical Clustering
        1. 20.5.1.1 Using Atom Pairs
        2. 20.5.1.2 Using PubChem Fingerprints
  21. 21 Searching PubChem
    1. 21.1 Get Compounds from PubChem by Id
    2. 21.2 Similarity Search of PubChem with SMILES Query
    3. 21.3 Similarity Search of PubChem with SDF Query
  22. 22 Biological Screen Analysis
  23. 23 Session Information
  24. 24 Exercises
  25. 25 Funding
  26. 26 References

Introduction

 Manual: [ PDF ]   [ R Code ]   Slides: [ PDF ] [ R Code ]

ChemmineR is a cheminformatics package for analyzing drug-like small molecule data in R. Its latest version contains functions for efficient processing of large numbers of molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms. In addition, it offers visualization functions for compound clustering results and chemical structures. The integration of chemoinformatic tools with the R programming environment has many advantages, such as easy access to a wide spectrum of statistical methods, machine learning algorithms and graphic utilities. The first version of this package was published in 2008 in Bioinformatics: 24, 1733-1734 (please cite where appropriate). Since then many additional utilities and add-on packages have been added to the environment and many more are under development for future releases (Backman et al., 2011, Wang et al., 2013).




Figure 1: ChemmineR environment with its add-on packages and selected functionalities


Table of Contents

Recently Added Features



Getting Started

Installation

The R software for running ChemmineR can be downloaded here. The ChemmineR package itself is available at the Bioconductor repository. Due to its heavy development, it is strongly recommended to maintain a recent version of ChemmineR by updating or reinstalling it frequently. To install the package, it is strongly recommended to use the automated bioLite install command in R as shown here:

source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script.
biocLite("ChemmineR") # Installs ChemmineR package.

biocLite("ChemmineOB") # Installs ChemmineOB add-on package.

Alternatively, one can download the package and then install it as shown below. Note: this is not the recommended approach for installing Bioconductor packages.

## OS X
install.packages("ChemmineR_x.x.x.tgz", repos=NULL) # Add 'type="source"' when installing from source.

## Linux
install.packages("ChemmineR_x.x.x.tar.gz", repos=NULL)

## Windows

install.packages("ChemmineR_x.x.x.zip", repos=NULL)

Note: x.x.x needs to be replaced by the current version number of the package.

Table of Contents

Version History

See Version History Page

Table of Contents

Loading the Package and Documentation


## Load the package
library("ChemmineR")

## List all functions and classes available in the package:
library(help="ChemmineR")

## Open the vignette pdf (manual) of the package
vignette("ChemmineR")

Five Minute Tutorial

The following code gives an interactive overview of the most important functionalities provided by ChemmineR. Copy and paste of the commands into the R console will demonstrate these utilities.


## Create Instances of SDFset class
## Sample data set contains first 100 compounds from PubChem SD file "Compound_00650001_00675000.sdf.gz"
## from: ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

data(sdfsample); sdfset <- sdfsample
sdfset; view(sdfset[1:4])
sdfset[[1]]

## The SDFset object is created during the import of an SD file
sdfset <- read.SDFset("
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")

## Miscellaneous accessor methods for SDFset container
header(sdfset[1:4])
atomblock(sdfset[1:4])
atomcount(sdfset[1:4])
bondblock(sdfset[1:4])
datablock(sdfset[1:4])

## Assigning compound IDs and keeping them unique
cid(sdfset); sdfid(sdfset)
unique_ids <- makeUnique(sdfid(sdfset))
cid(sdfset) <- unique_ids

## Converting the data blocks in SDFset to matrix
blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)) # Converts data block to matrix 
numchar <- splitNumChar(blockmatrix=blockmatrix) # Splits to numeric and character matrix
numchar[[1]][1:4,]; numchar[[2]][1:4,]

## Compute atom frequency matrix, molecular weight and formula
propma <- data.frame(MF=MF(sdfset), MW=MW(sdfset), atomcountMA(sdfset))
propma[1:4, ]

## Assign matrix data to data block
datablock(sdfset) <- propma
view(sdfset[1:4])

## String Searching in SDFset
grepSDFset("650001", sdfset, field="datablock", mode="subset") # To return index, set mode="index")

## Export SDFset to SD file
# write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE)

## Plot molecule structure of one or many SDFs
plot(sdfset[1:4]) # plots to R graphics device
sdf.visualize(sdfset[1:4]) # viewing in web browser


## Structure similarity searching and clustering using atom pairs
apset <- sdf2ap(sdfset) # Generate atom pair descriptor database for searching
data(apset) # Loads same data set provided by library.
cmp.search(apset, apset[1], type=3, cutoff = 0.3) # Search apset database with single compound.
cmp.cluster(db=apset, cutoff = c(0.65, 0.5))[1:4,] # Binning clustering using variable similarity cutoffs.

## Structure similarity searching and clustering using PubChem fingerprints
data(sdfsample); sdfset <- sdfsample; cid(sdfset) <- sdfid(sdfset) # Refresh sample data set
fpset <- fp2bit(sdfset) #
Convert base 64 encoded fingerprints to binary matrix
fpSim(fpset[1], fpset[2]) #
Pairwise compound structure comparisons
fpSim(fpset, fpset) # Structure similarity searching
simMA <- sapply(cid(fpset), function(x) fpSim(fpset[x], fpset, sorted=FALSE)) # Compute similarity matrix
hc <- hclust(as.dist(1-simMA), method="single") # Hierarchical clustering with simMA as input
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)
# Plot hierarchical clustering tree

Table of Contents


OpenBabel Functions

ChemmineR integrates now a subset of cheminformatics functionalities implemented in the OpenBabel C++ library. These utilities can be enabled by installing the ChemmineOB package and the OpenBabel software itself. ChemmineR will automatically detect the presence of ChemmineOB and make use of the additinal utilities. The following lists the functions and methods that make use of OpenBabel. References are included to locate the sections in the manual where the utility and usage of these functions is described.

  • Structure format interconversions (see Section 8)
    • smiles2sdf: converts from SMILES to SDF object
    • sdf2smiles: converts from SDF to SMILES object
    • convertFormat: converts strings between two formats
    • convertFormatFile: converts files between two formats
    • propOB: generates several OpenBabel compound properties
    • fingerprintOB: generates fingerprints for compounds


Overview of Classes and Functions

The following list gives an overview of the most important classes, methods and functions available in the ChemmineR package. The help documents of the package provide much more detailed information on each utility. The standard R help documents for these utilities can be accessed with this syntax: ?function_name (e.g. ?cid) and ?class_name-class (e.g. ?"SDFset-class").

  • Molecular Structure Data
    • Classes
      • SDFstr: intermediate string class to facilitate SD file import; not important for end user
      • SDF: container for single molecule imported from an SD file
      • SDFset: container for many SDF objects; most important structure container for end user 
      • SMI: container for a single SMILES string
      • SMIset: container for many SMILES strings
    • Functions/Methods (mainly for SDFset container, SMIset should be coerced with smiles2sdf to SDFset)
      • Accessor methods for SDF/SDFset
        • Object slots: cid, header, atomblock, bondblock, datablock (sdfid, datablocktag)
        • Summary of SDFset: view
        • Matrix conversion of data block: datablock2ma, splitNumChar
        • String search in SDFset: grepSDFset
      • Coerce one class to another
        • See R help with ?"SDFset-class"
      • Utilities
        • Atom frequencies: atomcountMA, atomcount
        • Molecular weight: MW
        • Molecular formula: MF
        • ...
      • Compound structure depictions
        • R graphics device: plot, plotStruc
        • Online: cmp.visualize
  • Structure Descriptor Data
    •  Classes
      • AP: container for atom pair descriptors of a single molecule
      • APset: container for many AP objects; most important structure descriptor container for end user
      • FP: container for fingerprint of a single molecule
      • FPset: container for fingerprints of many molecules, most important structure descriptor container for end user
    • Functions/Methods
      • Create AP/APset instances
        • From SDFset: sdf2ap
        • From SD file: cmp.parse
        • Summary of AP/APset: view, db.explain
      • Accessor methods for AP/APset
        • Object slots: ap, cid
      • Coerce one class to another
        • See R help with ?"APset-class"
      • Structure Similarity comparisons and Searching
        • Compute pairwise similarities : cmp.similarity, fpSim
        • Search APset database: cmp.search, fpSim
            
      • AP-based Structure Similarity Clustering
        • Single-linkage binning clustering: cmp.cluster
        • Visualize clustering result with MDS: cluster.visualize
        • Size distribution of clusters: cluster.sizestat

Import of Compounds

SDF Import

## Import SD file and store as SDFset object
## Sample data set contains first 100 compounds from PubChem SD file "Compound_00650001_00675000.sdf.gz"
## from: ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

sdfset <- read.SDFset("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")
data(sdfsample); sdfset <- sdfsample # The sample SDF set is provided by the library
valid <- validSDF(sdfset); which(!valid) # Identifies invalid SDFs in SDFset objects.
sdfset <- sdfset[valid] # Removes invalid SDFs, if there are any.

## Import SD file and store as SDFstr object
sdfstr <- read.SDFstr("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")

## Create SDFset from SDFstr class
read.SDFset(sdfstr)
as(sdfstr, "SDFset")

Table of Contents

SMILES Import

## Create sample SMILES file for import
data(smisample); smiset <- smisample
write.SMI(smiset[1:4], file="sub.smi")

## Import SMILES file
smiset <- read.SMIset("sub.smi")

## Inspect content
smiset
An instance of "SMIset" with 100 molecules

view(smiset[1:2])
$`650001`
An instance of "SMI"
[1] "O=C(NC1CCCC1)CN(c1cc2OCCOc2cc1)C(=O)CCC(=O)Nc1noc(c1)C"

$`650002`
An instance of "SMI"
[1] "O=c1[nH]c(=O)n(c2nc(n(CCCc3ccccc3)c12)NCCCO)C"


## Accessor functions
cid(smiset[1:4])
[1] "650001" "650002" "650003" "650004"

(smi <- as.character(smiset[1:2]))
                                                  650001
"O=C(NC1CCCC1)CN(c1cc2OCCOc2cc1)C(=O)CCC(=O)Nc1noc(c1)C"
                                                  650002
         "O=c1[nH]c(=O)n(c2nc(n(CCCc3ccccc3)c12)NCCCO)C"


## Construct SMIset object from named vector
as(smi, "SMIset")

An instance of "SMIset" with 2 molecules

Export of Compounds

SDF Export

## Write objects of classes SDFset/SDFstr/SDF to file
write.SDF(sdfset[1:4], file="sub.sdf")
    
## Writing customized SDFset to file containing ChemmineR signature, IDs from SDFset and no data block
write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE, db=NULL)
    
## Example for injecting a custom matrix/data frame into the data block of an SDFset and then writing it to an SD file
props <- data.frame(MF=MF(sdfset), MW=MW(sdfset), atomcountMA(sdfset))
datablock(sdfset) <- props
write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE)

## Indirect export via SDFstr object
sdf2str(sdf=sdfset[[1]], sig=TRUE, cid=TRUE) # Uses default components
sdf2str(sdf=sdfset[[1]], head=letters[1:4], db=NULL) # Uses custom components for header and data block

## Write SDF, SDFset or SDFstr Classes to File
write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE, db=NULL)
write.SDF(sdfstr[1:4], file="sub.sdf")
cat(unlist(as(sdfstr[1:4], "list")), file="sub.sdf", sep="\n")

Table of Contents

SMILES Export

## Sample data set
data(smisample); smiset <- smisample
## Write objects of classes SMIset to file with and without compound identifiers
write.SMI(smiset[1:4], file="sub.smi", cid=TRUE)
write.SMI(smiset[1:4], file="sub.smi", cid=FALSE)

Format Interconversions

The sdf2smiles and smiles2sdf functions provide format interconversion between SMILES strings (Simplified Molecular Input Line Entry Specification) and SDFset containers.
## Convert an SDFset container to a SMILES character string
data(sdfsample); sdfset <- sdfsample[1]
smiles <- sdf2smiles(sdfset)
smiles

                                                  650001
"O=C(NC1CCCC1)CN(c1cc2OCCOc2cc1)C(=O)CCC(=O)Nc1noc(c1)C"
    
## Convert a SMILES character string to an SDFset container
sdf <- smiles2sdf("CC(=O)OC1=CC=CC=C1C(=O)O\tname")
view(sdf)


When the ChemineOB package is installed these conversions are performed with OpenBabel. Otherwise these functions will fall back to using the ChemMine Tools web service for conversion with the OpenBabel Open Source Chemistry Toolbox. They will then require internet connectivity and will be limited to only the first compound given. ChemmineOB provides access to the compound format conversion functions of OpenBabel. Currently, over 160 formats are supported by OpenBabel. The functions convertFormat and convertFormatFile can be used to convert files or strings between any two formats supported by OpenBabel. For example, to convert a SMILES string to an SDF string, one can use the the convertFormat function.

sdfStr <- convertFormat("SMI","SDF","CC(=O)OC1=CC=CC=C1C(=O)O\ttest_name")

This will return the given compound as an SDF formatted string. 2D coordinates are also computed and included in the resulting SDF string. To convert a file with compounds encoded in one format to another format, the convertFormatFile function can be used instead.

convertFormatFile("SMI","SDF","test.smiles","test.sdf")

To see the whole list of file formats supported by OpenBabel, one can run from the command-line "obabel -L formats".


Table of Contents


Splitting SD Files

The following write.SDFsplit function allows to split SD Files into any number of smaller SD Files. This can become important when working with very big SD Files. Users should note that this function can output many files, thus one should run it in a dedicated directory!  
## Create sample SD File with 100 molecules
write.SDF(sdfset, "test.sdf")
 
## Read in sample SD File. Note: reading file into SDFstr is much faster than into SDFset!
sdfstr <- read.SDFstr("test.sdf")
   
## Run export on SDFstr object
write.SDFsplit(x=sdfstr, filetag="myfile", nmol=10)
# nmol defines the number of molecules to write to each file
    
## Run export on SDFset object
write.SDFsplit(x=sdfset, filetag="myfile", nmol=10)

Table of Contents


Streaming Through Large SD Files

The sdfStream function allows to stream through SD Files with millions of molecules without consuming much memory. During this process any set of descriptors, supported by ChemmineR, can be computed (e.g. atom pairs, molecular properties, etc.), as long as they can be returned in tabular format. In addition to descriptor values, the function returns a line index that records the start and end positions of each molecule in the source SD File. This line index can be used by the downstream read.SDFindex function to retrieve specific molecules of interest from the source SD File without importing the entire file into R. The following outlines the typical workflow of this streaming functionality in ChemmineR.
## Create sample SD File with 100 molecules
write.SDF(sdfset, "test.sdf")

## Define descriptor set in a simple function
desc <- function(sdfset) {
        cbind(SDFID=sdfid(sdfset),
              # datablock2ma(datablocklist=datablock(sdfset)),
              MW=MW(sdfset),
              groups(sdfset),
              APFP=desc2fp(x=sdf2ap(sdfset), descnames=1024, type="character"),
              AP=sdf2ap(sdfset, type="character"),
              rings(sdfset, type="count", upper=6, arom=TRUE)
        )
}

## Run 'sdfStream' with 'desc' function and write results to a file called 'matrix.xls'
sdfStream(input="test.sdf", output="matrix.xls", fct=desc, Nlines=1000) # 'Nlines': number of lines to read from input SD File at a time

## Same as before but starting to read at a specific line number in the SD file, here 950. This is useful for restarting and debugging the process.  
sdfStream(input="test.sdf", output="matrix2.xls", append=FALSE, fct=desc, Nlines=1000, startline=950) # With append=TRUE the results can be appended to an existing file.


## Select molecules meeting certain property criteria from SD File using line index generated by previous 'sdfStream' step
indexDF <- read.delim("matrix.xls", row.names=1)[,1:4]
indexDFsub <- indexDF[indexDF$MW < 400, ] # Selects molecules with MW < 400
sdfset <- read.SDFindex(file="test.sdf", index=indexDFsub, type="SDFset") # Collects results in 'SDFset' container

## Write results directly to SD file without storing larger numbers of molecules in memory
read.SDFindex(file="test.sdf", index=indexDFsub, type="file", outfile="sub.sdf")

## Read AP/APFP strings from file into APset or FP object
apset <- read.AP(x="matrix.xls", type="ap", colid="AP")
apfp <- read.AP(x="matrix.xls", type="fp", colid="APFP")

## Alternatively, one can provide the AP/APFP strings in a named character vector
apset <- read.AP(x=sdf2ap(sdfset[1:20], type="character"), type="ap")
fpchar <- desc2fp(sdf2ap(sdfset[1:20]), descnames=1024, type="character")  
fpset <- as(fpchar, "FPset")

Table of Contents


Storing Compounds in SQL Databases 

As As an alternative to sdfStream, there is now also an option to store data in an SQL database, which then allows for fast queries and compound retrieval. This is still an experimental feature. The default database is SQLite, but any other SQL database should work with some minor modifications to the table definitions, which are stored in schema/compounds.SQLite under the ChemmineR package directory. Compounds are stored in their entirety in the databases so there is no need to keep any original data files. Users can define their own set of compound features to compute and store when loading new compounds. Each of these features will be stored in its own, indexed table. Searches can then be performed using these features to quickly find specific compounds. Compounds can always be retrieved quickly because of the database index, no need to scan a large compound file. In addition to user defined features, descriptors can also be computed and stored for each compound. A new database can be created with the initDb function. This takes either an existing database connection, or a filename. If a filename is given then an SQLite database connection is created. It then ensures that the required tables exist and creates them if not. The connection object is then returned. This function can be called safely on the same connection or database many times and will not delete any data.

Loading Data
The function loadSdf can be used to load SDF data, either from a file or SDFset object. The fct parameter should be a function to extract features from the data. It will be handed an SDFset generated from the data being loaded. This may be done in batches, so there is no guarantee that the given SDFSset will contain the whole dataset. This function should return a data frame with a column for each feature and a row for each compound given. The order of the final data frame should be the same as that of the SDFset. The column names will become the feature names. Each of these features will become a new, indexed, table in the database which can be used later to search for compounds. The descriptors parameter can be a function which computes descriptors. This function will also be given an SDFset object, which may be done in batches. It should return a data frame with the following two columns: “descriptor” and “descriptor type”. The “descriptor” column should contain a string representation of the descriptor, and “descriptor type” is the type of the descriptor. Our convention for atom pair is “ap” and “fp” for finger print. The order should also be maintained. When the data has been loaded, loadSdf will return the compound id numbers of each compound loaded. These compound id numbers are computed by the database and are not extracted from the compound data itself. They can be used to quickly retrieve compounds later. New features can also be added using this function. However, all compounds must have all features so if new features are added to a new set of compounds, all existing features must be computable by the fct function given. If new features are detected, all existing compounds will be run through fct in order to compute the new features for them as well. For example, if dataset X is loaded with features F1 and F2, and then at a later time we load dataset Y with new feature F3, the fct function used to load dataset Y must compute and return features F1, F2, and F3. loadSdf will call fct with both datasets X and Y so that all features are available for all compounds. If any features are missing an error will be raised. If just new features are being added, but no new compounds, use the addNewFeatures function.
In th example, we create a new database called “test.db” and load it with data from and SDFset. We also define fct to compute the molecular weight, “MW”, and the number of rings and aromatic rings. The rings function actually returns a data frame with columns “RINGS” and “AROMATIC”, which will be merged into the data frame being created which will also contain the “MW” column. These will be the names used for these features and must be used when searching with them. Finally, the new compound ids are returned and stored in the “ids” variable.

## Import sample molecule set
data(sdfsample)

## Create and initialize a new SQLite database
conn <- initDb("test.db")

## Load data and compute 3 features: molecular weight, with the MW function,
## and counts for RINGS and AROMATIC, as computed by rings, which
## returns a data frame itself.

ids <- loadSdf(conn,sdfsample,
      function(sdfset)
       data.frame(MW = MW(sdfset), 
                  rings(sdfset,type="count",upper=6, arom=TRUE))
      )

Table of Contents

Searching
Compounds can be searched for using the findCompounds function. This function takes a connection object, a vector of feature names used in the tests, and finally, a vector of tests that must all pass for a compound to be included in the result set. Each test should be a boolean expression. For example: “c(”MW <= 400”,”RINGS > 3”)” would return all compounds with a molecular weight of 400 or less and more than 3 rings, assuming these features exist in the database. The syntax for each test is “<feature name> <SQL operator> <value>”. If you know SQL you can go beyond this basic syntax. These tests will simply be concatenated together with “AND” in-between them and tacked on the end of a WHERE clause of an SQL statement. So any SQL that will work in that context is fine. The function will return a list of compound ids, the actual compounds can be fetched with getCompounds. If just the names are needed, the getCompoundNames function can be used. Compounds can also be fetched by name using the findCompoundsByName function.
In the following example we search for compounds with molecular weight less than 300. We then fetch the matching compounds and show their molecular weight.

lightIds <- findCompounds(conn, "MW", c("MW < 300"))
MW(getCompounds(conn,lightIds))

## Names of matching compounds
getCompoundNames(conn, lightIds)

Table of Contents


Working with SDF/SDFset Classes


## Inspecting content of SDF/SDFset objects
view(sdfset[1:4]) # Summary view of several molecules in SDFset object
length(sdfset) # Returns number of molecules stored in object
sdfset[[1]] # Returns single molecule from SDFset as SDF object
sdfset[[1]][[2]] # Returns atom block from first compound as matrix
sdfset[[1]][[2]][1:4,] # Returns first four rows of atom block from first compound
c(sdfset[1:4], sdfset[5:8]) # Concatenation of several SDFsets; is currently limited to two arguments!

## String searching in SDFsets
grepSDFset("650001", sdfset, field="datablock", mode="subset") # To return index, set mode="index")

## Compound IDs
sdfid(sdfset[1:4]) # Retrieves CMP IDs from Molecule Name field in header block.
cid(sdfset[1:4]) # Retrieves CMP IDs from ID slot in SDFset.
unique_ids <- makeUnique(sdfid(sdfset)) # Creates unique IDs by appending a counter to duplicates.
cid(sdfset) <- unique_ids # Assigns uniquified IDs to ID slot

## Subsetting by character, index and logical vectors
view(sdfset[c("650001", "650012")])
view(sdfset[4:1])
mylog <- cid(sdfset) %in% c("650001", "650012"); view(sdfset[mylog])

## Accessing SDF/SDFset components: header, atom, bond and data blocks
atomblock(sdf); sdf[[2]]; sdf[["atomblock"]] # all three methods return the same component
header(sdfset[1:4]); atomblock(sdfset[1:4]); bondblock(sdfset[1:4]); datablock(sdfset[1:4])
header(sdfset[[1]]); atomblock(sdfset[[1]]); bondblock(sdfset[[1]]); datablock(sdfset[[1]])

## Replacement Methods
sdfset[[1]][[2]][1,1] <- 999
atomblock(sdfset)[1] <- atomblock(sdfset)[2]

datablock(sdfset)[1] <- datablock(sdfset)[2]

## Assign matrix data to data block
datablock(sdfset) <- as.matrix(iris[1:100,])
view(sdfset[1:4])


## Class Coercions
## (a) From SDFstr to list, SDF and SDFset
as(sdfstr[1:2], "list") 
as(sdfstr[[1]], "SDF")
as(sdfstr[1:2], "SDFset")

## (b) From SDF to SDFstr, SDFset, list with SDF sub-components
sdfcomplist <- as(sdf, "list")
sdfcomplist <- as(sdfset[1:4], "list"); as(sdfcomplist[[1]], "SDF")
sdflist <- as(sdfset[1:4], "SDF"); as(sdflist, "SDFset")
as(sdfset[[1]], "SDFstr")
as(sdfset[[1]], "SDFset")

## (c) From SDFset to lists with components consisting of SDF or sub-components
as(sdfset[1:4], "SDF")
as(sdfset[1:4], "list")
as(sdfset[1:4], "SDFstr")

Molecular Property Functions (Physicochemical Descriptors)

Several methods and functions are available to compute basic compound descriptors, such as molecular formula (MF), molecular weight (MW) and atom frequencies. In all these functions, it is important to set addH=TRUE in order to include/add hydrogens that are often not specified in an SD file.

## Create atom frequency matrix
propma <- atomcountMA(sdfset, addH=FALSE)
boxplot(propma, main="Atom Frequency")
boxplot(rowSums(propma), main="All Atom Frequency")
  

## Data frame provided by library containing atom names, atom symbols,
## standard atomic weights and group/period numbers
data(atomprop)
atomprop[1:4,]

  Number      Name Symbol Atomic_weight Group Period
1      1  hydrogen      H      1.007940     1      1
2      2    helium     He      4.002602    18      1
3      3   lithium     Li      6.941000     1      2
4      4 beryllium     Be      9.012182     2      2

## Compute MW and formula
MW(sdfset[1:4], addH=TRUE)

    CMP1     CMP2     CMP3     CMP4
456.4916 357.4069 370.4255 461.5346

MF(sdfset[1:4], addH=TRUE)

         CMP1          CMP2          CMP3          CMP4
 "C23H28N4O6"  "C18H23N5O3" "C18H18N4O3S" "C21H27N5O5S"

## Enumerate functional groups
groups(sdfset[1:4], groups="fctgroup", type="countMA")

     RNH2 R2NH R3N ROPO3 ROH RCHO RCOR RCOOH RCOOR ROR RCCH RCN
CMP1    0    2   1     0   0    0    0     0     0   2    0   0
CMP2    0    2   2     0   1    0    0     0     0   0    0   0
CMP3    0    1   1     0   1    0    1     0     0   0    0   0
CMP4    0    1   3     0   0    0    0     0     0   2    0   0


## Combine MW, MF, charges, atom counts, functional group counts and ring counts in one data frame
propma <- data.frame(MF=MF(sdfset, addH=FALSE), MW=MW(sdfset, addH=FALSE),
                     Ncharges=sapply(bonds(sdfset, type="charge"), length),
                     atomcountMA(sdfset, addH=FALSE),
                     groups(sdfset, type="countMA"),
                     rings(sdfset, upper=6, type="count", arom=TRUE))

propma[1:4,]

              MF       MW Ncharges  C  H N O S F Cl RNH2 R2NH R3N ROPO3 ROH RCHO RCOR RCOOH RCOOR ROR RCCH RCN RINGS AROMATIC
CMP1  C23H28N4O6 456.4916        0 23 28 4 6 0 0  0    0    2   1     0   0    0    0     0     0   2    0   0     4        2
CMP2  C18H23N5O3 357.4069        0 18 23 5 3 0 0  0    0    2   2     0   1    0    0     0     0   0    0   0     3        3
CMP3 C18H18N4O3S 370.4255        0 18 18 4 3 1 0  0    0    1   1     0   1    0    1     0     0   0    0   0     4        2
CMP4 C21H27N5O5S 461.5346        0 21 27 5 5 1 0  0    0    1   3     0   0    0    0     0     0   2    0   0     3        3

## Assign properties to data block
datablock(sdfset) <- propma # Works with all SDF components
test <- apply(propma[1:4,], 1, function(x) data.frame(col=colnames(propma), value=x))
sdf.visualize(sdfset[1:4], extra = test)

## Extracting data block elements by tag name
datablocktag(sdfset, tag="PUBCHEM_NIST_INCHI")
datablocktag(sdfset, tag="PUBCHEM_OPENEYE_CAN_SMILES")

## Convert entire data block to matrix
blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)) # Converts data block to matrix 
numchar <- splitNumChar(blockmatrix=blockmatrix)
# Splits matrix to numeric matrix and character matrix
numchar[[1]][1:4,]; numchar[[2]][1:4,] # Splits matrix to numeric matrix and character matrix


Molecular Properties from OpenBabel

Computes logP, HBA, HBD and other useful properties.

## Requires OpenBabel
library(ChemmineOB)
propOB(sdfset[1:4])


     HBA1 HBA2 HBD   logP       MR       MW nF   TPSA
CMP1   37   10   2 3.0288 119.9234 456.4916  0 123.00
CMP2   28    7   3 0.9234 101.3202 357.4069  0 104.94
CMP3   24    8   2 2.9535 101.4132 370.4255  0 125.35
CMP4   35   11   1 1.3127 123.4607 461.5346  0 134.68


Bond Matrices

Bond matrices provide an efficient data structure for many basic computations on small molecules. The function conMA creates this data structure from SDF and SDFset objects. The resulting matrix contains the atom labels in the row/column titles and the bond types in the data part. Their values are defined as follows: 0 is no connection, 1 is a single bond, 2 is a double bond and 3 is a triple bond.

## Create bond matrix for one or many molecules in an SDFset object
conMA(sdfset[1:2], exclude=c("H"))

## Return bond matrix for first molecule and plot its structure with atom numbering
conMA(sdfset[[82]], exclude="H")

     O_1 O_2 N_3 N_4 C_5 C_6 C_7 C_8 C_9 C_10
O_1    0   0   0   0   0   0   0   0   0    1
O_2    0   0   0   0   0   0   0   0   0    2
N_3    0   0   0   0   1   1   1   0   0    0
N_4    0   0   0   0   0   2   0   0   1    0
C_5    0   0   1   0   0   0   0   0   0    1
C_6    0   0   1   2   0   0   0   1   0    0
C_7    0   0   1   0   0   0   0   0   2    0
C_8    0   0   0   0   0   1   0   0   0    0
C_9    0   0   0   1   0   0   2   0   0    0
C_10   1   2   0   0   1   0   0   0   0    0

plot(sdfset[82], atomnum = TRUE, noHbonds=FALSE , no_print_atoms = "")

## Return number of non-H bonds for each atom of a single molecule
rowSums(conMA(sdfset[[82]], exclude=c("H")))


 O_1  O_2  N_3  N_4  C_5  C_6  C_7  C_8  C_9 C_10
   1    2    3    3    2    4    3    1    3    4

##
Return number of non-H bonds for all molecules in sdfset
sapply(cid(sdfset), function(x) rowSums(conMA(sdfset[[x]], exclude=c("H"))))

Table of Contents


Charges and Hydrogens

The function bonds returns information about the number of bonds, charges and missing hydrogens in SDF and SDFset objects. It is used by many other functions (e.g. MW, MF, atomcount, atomcountMA and plot) to correct for missing hydrogens that are often not explicitly specified in SD files.

## Return data frame(s) with bonds and charges
bonds(sdfset[[1]], type="bonds")[1:20,]

   atom Nbondcount Nbondrule charge
1     C          4         4      0
2     C          4         4      0
3     C          4         4      0
4     C          3         4      0
5     C          4         4      0
6     C          3         4      0
7     N          3         3      0
8     C          4         4      0
9     C          4         4      0
10    C          2         4      0
11    C          3         4      0
12    N          2         3      0
13    C          2         4      0
14    N          2         3      0
15    C          4         4      0
16    C          4         4      0
17    C          4         4      0
18    C          3         4      0
19    C          3         4      0
20    O          2         2      0
...


## Returns charged atoms in each molecule
bonds(sdfset[1:2], type="charge")

$CMP1
NULL

$CMP2
NULL

## Returns the number of missing hydrogens in each molecule
bonds(sdfset[1:2], type="addNH")

CMP1 CMP2
   0    0


Table of Contents

Ring Perception and Aromaticity Assignment

The function rings identifies all possible rings in one or many molecules using the exhaustive ring perception algorithm from Hanser et al. (1996). In addition, the function can return all smallest possible rings as well as aromaticity information.

The following example returns all possible rings in a list. The argument upper allows to specify an upper length limit for rings. Choosing smaller length limits will reduce the search space resulting in shortened compute times. Note: each ring is represented by a character vector of atom symbols that are numbered by their position in the atom block of the corresponding SDF/SDFset object.

## Return ring information for one compound
(ringatoms <- rings(sdfset[1], upper=Inf, type="all", arom=FALSE, inner=FALSE))
$ring1
[1] "N_10" "O_6"  "C_32" "C_31" "C_30"
$ring2
[1] "C_12" "C_14" "C_15" "C_13" "C_11"
$ring3
[1] "C_23" "O_2"  "C_27" "C_28" "O_3"  "C_25"
$ring4
[1] "C_23" "C_21" "C_18" "C_22" "C_26" "C_25"
...
## For visual inspection, the corresponding compound structure can be plotted with the ring bonds in color

atomindex <- as.numeric(gsub(".*_", "", unique(unlist(ringatoms))))
plot(sdfset[1], print=FALSE, colbonds=atomindex)



## Alternatively, one can include the atom numbers in the plot
plot(sdfset[1], print=FALSE, atomnum=TRUE, no_print_atoms="H")


## Aromaticity information of rings can be returned in a logical vector by setting arom=TRUE
rings(sdfset[1], upper=Inf, type="all", arom=TRUE, inner=FALSE)

$RINGS
$RINGS$ring1
[1] "N_10" "O_6"  "C_32" "C_31" "C_30"

$RINGS$ring2
[1] "C_12" "C_14" "C_15" "C_13" "C_11"

$RINGS$ring3
[1] "C_23" "O_2"  "C_27" "C_28" "O_3"  "C_25"

$RINGS$ring4
[1] "C_23" "C_21" "C_18" "C_22" "C_26" "C_25"

$RINGS$ring5
 [1] "O_3"  "C_28" "C_27" "O_2"  "C_23" "C_21" "C_18" "C_22" "C_26" "C_25"


$AROMATIC
ring1 ring2 ring3 ring4 ring5
 TRUE FALSE FALSE  TRUE FALSE

## Return rings with no more than 6 atoms that are also aromatic
rings(sdfset[1], upper=6, type="arom", arom=TRUE, inner=FALSE)

$AROMATIC_RINGS
$AROMATIC_RINGS$ring1
[1] "N_10" "O_6"  "C_32" "C_31" "C_30"

$AROMATIC_RINGS$ring4
[1] "C_23" "C_21" "C_18" "C_22" "C_26" "C_25"

## Count shortest possible rings and their aromaticity assignments by setting type=count and inner=TRUE. The inner (smallest possible) rings are identified by first computing all possible rings and then selecting only the inner rings. For more details, consult the help documentation with ?rings.
rings(sdfset[1:4], upper=Inf, type="count", arom=TRUE, inner=TRUE)

     RINGS AROMATIC
CMP1     4        2
CMP2     3        3
CMP3     4        2
CMP4     3        3



Rendering Chemical Structure Images

R Graphics Device

A new plotting function for compound structures has been added to the package recently. This function uses the native R graphics device for generating compound depictions. At this point this function is still in an experimental developmental stage but should become stable soon. 

## Plot CMP Structures with R graphics device
plot(sdfset[1:4]) # Plots one to many molecules
plotStruc(sdfset[[1]]) # Only for single molecule



Figure 2: Structure Rendering with R's Graphics Device

## Customized plots
plot(sdfset[1:4], griddim=c(2,2), print_cid=letters[1:4], print=FALSE, noHbonds=FALSE)
plot(sdfset[1], atomnum = TRUE, noHbonds=F , no_print_atoms = "", atomcex=0.8, sub=paste("MW:", MW(sdfsample["CMP1"])))


Figure 3: Detailed Structure View


## Substructure highlighting by atom numbers
plot(sdfset[1], colbonds=c(22,26,25,3,28,27,2,23,21,18,8,19,20,24))

Figure 4: Substructure Highlighting in Compound Depictions

Online with ChemMine Tools

Alternatively, one can visualize compound structures with a standard web browser using the online ChemMine Tools service. The service allows to display other information next to the structures using the extra argument of the sdf.visualize function. The following examples demonstrate, how one can plot and annotate structures by passing on extra data as vector of character strings, matrices or lists. 

## Plot structures using web service ChemMine Tools
sdf.visualize(sdfset[1:4]

## Add extra annotation as vector
sdf.visualize(sdfset[1:4], extra=month.name[1:4])

## Add extra annotation as matrix
extra <- apply(propma[1:4,], 1, function(x) data.frame(Property=colnames(propma), Value=x))
sdf.visualize(sdfset[1:4], extra=extra)

## Add extra annotation as list
sdf.visualize(sdfset[1:4], extra=bondblock(sdfset[1:4]))





Figure 5: Visualizing Structures and Tabular Data with sdf.visualize

Similarity Comparisons and Searching


Maximum Common Substructure (MCS) Searching

The ChemmineR add-on package fmcsR  provides support for identifying maximum common substructures (MCSs) and flexible MCSs among compounds. The algorithm can be used for pairwise compound comparisons, structure similarity searching and clustering. The manual describing this functionality is available here and the associated publication is available in Wang et al., 2013. The following gives a short preview of some functionalities provided by the fmcsR package.

## fmcsR package overview
library(fmcsR)
data(fmcstest) # Loads test sdfset object
test <- fmcs(fmcstest[1], fmcstest[2], au=2, bu=1) # Searches for MCS with mismatches
plotMCS(test) # Plots both query compounds with FMCS in color


Figure 6: MCS with 2 Atom Mismatches Highlighted in Source Structures

AP/APset Classes for Storing Atom Pair Descriptors

The function sdf2ap computes atom pair descriptors for one or many compounds. It returns a searchable atom pair database stored in a container of class APset, which can be used for structural similarity searching and clustering. An APset object consists of one or many AP entries each storing the atom pairs of a single compound. Note: the deprecated cmp.parse function is still available which also generates atom pair descriptor databases, but directly from an SD file. Since the latter function is less flexible it may be discontinued in the future.

## Samples of SDFset/SDF classes
data(sdfsample)
sdfset <- sdfsample[1:50]
sdf <- sdfsample[[1]]
    
## Compute atom pair library
ap <- sdf2ap(sdf)
# apset <- sdf2ap(sdfset)
data(apset)
view(apset[1:4])
    
## Return main components of APset object
cid(apset[1:4]) # Compound IDs
ap(apset[1:4]) # Atom pair descriptors
    
## Return atom pairs in human readable format
db.explain(apset[1])
    
## Coerce APset to other objects
apset2descdb(apset) # Returns old list-style AP database
tmp <- as(apset, "list") # Returns list
as(tmp, "APset") # Converst list back to APset

## Generate atom pair database directly from SD file

## Note: this approach is less generic and may be discontinued in the future
apold <- cmp.parse("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")


## Construct
APset class from apold
aplist <- apold[[1]]; names(aplist) <- apold[[2]]
apset <- as(aplist, "APset")
view(apset[1:4])

Large SDF and Atom Pair Databases

When working with large data sets it is often desirable to save the SDFset and APset containers as binary R objects to files for later use. This way they can be loaded very quickly into a new R session without recreating them every time from scratch.    

## Save an load of SDFset
save(sdfset, file = "sdfset.rda", compress = TRUE)
load("sdfset.rda")

## Save and load of APset
save(apset, file = "apset.rda", compress = TRUE)
load("apset.rda")



Pairwise Compound Comparisons with Atom Pairs

The cmp.similarity function computes the atom pair similarity between two compounds using the Tanimoto coefficient as similarity measure. This coefficient is defined as c/(a+b+c), which is the proportion of the atom pairs shared among two compounds divided by their union. The variable c is the number of atom pairs common in both compounds, while a and b are the numbers of their unique atom pairs.

## Compute similarities among two compounds
cmp.similarity(apset[1], apset[2])
cmp.similarity(apset[1], apset[1])

## Run cmp.similarity in loop as custom similarity search function
sapply(cid(apset), function(x) cmp.similarity(apset[1], apset[x]))



Similarity Searching with Atom Pairs

The cmp.search function searches an atom pair database for compounds that are similar to a query compound. The following example returns a data frame where the rows are sorted by the Tanimoto similarity score (best to worst). The first column contains the indices of the matching compounds in the database. The argument cutoff can be a similarity cutoff, meaning only compounds with a similarity value larger than this cutoff will be returned; or it can be an integer value restricting how many compounds will be returned. When supplying a cutoff of 0, the function will return the similarity values for every compound in the database.

cmp.search(apset, apset["650065"], type=1, cutoff = 0.3) # Returns index of matches if 'type=1'.

cmp.search(apset, apset["650065"], type=2, cutoff = 0.3)
# Returns matches as named vector with compound IDs if 'type=2'.

cmp.search(apset, apset["650065"], type=3, cutoff = 0.3)
# Returns matches as data frame if 'type=3'.

  index    cid    scores
1    61 650066 1.0000000
2    60 650065 1.0000000
3    67 650072 0.3389831
4    11 650011 0.3190608
5    15 650015 0.3184524
6    86 650092 0.3154270
7    64 650069 0.3010279
Table of Contents


FP/FPset Classes for Storing Fingerprints

The  FPset class stores fingerprints of small molecules in a matrix-like representation where every molecule is encoded as a fingerprint of the same type and length. The FPset container acts as a searchable database that contains the fingerprints of many molecules. The FP container holds only one fingerprint. Several constructor and coerce methods are provided to populate FP/FPset containers with fingerprints, while supporting any type and length of fingerprints. For instance, the function desc2fp generates fingerprints from an atom pair database stored in an APset, and as(matrix, "FPset") and as(character, "FPset") construct an FPset database from objects where the fingerprints are represented as matrix or character objects, respectively.
## Show slots of FPset class
showClass("FPset")
    
## Instance of FPset class
data(apset)
(fpset <- desc2fp(apset))

An instance of a 1024 bit "FPset" with 100 molecules

view(fpset)

$`650001`
An instance of "FP"
<<fingerprint>>
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... length: 1024
...

    
## Class usage
fpset[1:4] # behaves like a list

An instance of a 1024 bit "FPset" with 4 molecules

fpset[[1]] # returns FP object

$`650001`
An instance of "FP"
<<fingerprint>>
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... length: 1024


length(fpset) # number of compounds

[1] 100

cid(fpset) # returns compound ids

[1] "650001" "650002" "650003" "650004" "650005" ...


fpset[10] <- 0 # replacement of 10th fingerprint to all zeros 
cid(fpset) <- 1:length(fpset) # replaces compound ids
c(fpset[1:4], fpset[11:14]) # concatenation of several FPset objects

An instance of a 1024 bit "FPset" with 8 molecules
    
## Construct FPset class form matrix
fpma <- as.matrix(fpset) # coerces FPset to matrix
as(fpma, "FPset")

An instance of a 1024 bit "FPset" with 100 molecules

## Construct FPset class form character vector
fpchar <- as.character(fpset) # coerces FPset to character strings
as(fpchar, "FPset") # construction of FPset class from character vector

An instance of a 1024 bit "FPset" with 100 molecules
    
## Compound similarity searching with FPset
fpSim(fpset[1], fpset, method="Tanimoto", cutoff=0.4, top=4)

   650001    650102    650072    650015
1.0000000 0.4707207 0.4277344 0.4264706



Atom Pair Fingerprints 

Atom pairs can be converted into binary atom pair fingerprints of fixed length. Computations on this compact data structure are more time and memory efficient than on their relatively complex atom pair counterparts. The function desc2fp generates fingerprints from descriptor vectors of variable length such as atom pairs stored in APset  or list containers. The obtained fingerprints can be used for structure similarity comparisons, searching and clustering.

## Sample data set
data(sdfsample)
sdfset <- sdfsample[1:10]
apset <- sdf2ap(sdfset)

## Compute atom pair fingerprint matrix using internal atom pair selection containing
## the 4096 most common atom pairs in DrugBank.
For details see ?apfp. The following example
## uses from this
set the 1024 most frequent atom pairs.
fpset <- desc2fp(apset, descnames=1024, type="FPset")
fpset

An instance of a 1024 bit "FPset" with 10 molecules
    
## Alternatively, one can provide any custom atom pair selection. Here,
## the 1024 most common ones in the current apset object.
fpset1024 <- names(rev(sort(table(unlist(as(apset, "list")))))[1:1024])
fpset <- desc2fp(apset, descnames=fpset1024, type="FPset")
    
## A more compact way of storing fingerprints is as character values
fpchar <- desc2fp(x=apset, descnames=1024, type="character")
fpchar <- as.character(fpset)
   
## Converting a fingerprint database to a matrix and vice versa
fpma <- as.matrix(fpset)
fpset <- as(fpma, "FPset")
    
## Similarity searching and returning Tanimoto similarity coefficients
fpSim(x=fpset[1], y=fpset, method="Tanimoto")

## Under 'method' one can choose from several predefined similarity measures including 'Tanimoto' (default), 'Euclidean', 'Tversky' or 'Dice'. Alternatively, one can pass on custom similarity functions.
fpSim(x=fpset[1], y=fpset, method="Tversky", cutoff=0.4, top=4, alpha=0.5, beta=1)

## Example for using a custom similarity function
myfct <- function(a, b, c, d) c/(a+b+c+d)
fpSim(x=fpset[1], y=fpset, method=myfct)            
    
## Clustering example
simMAap <- sapply(cid(fpset), function(x) fpSim(fpset[x], fpset, sorted=FALSE))
hc <- hclust(as.dist(1-simMAap), method="single")
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)


Pairwise Compound Comparisons with PubChem Fingerprints

The fpSim function computes the similarity coefficients (e.g. Tanimoto) for pairwise comparisons of binary fingerprints. For this data type, c is the number of "on-bits" common in both compounds, and a and b are the numbers of their unique "on-bits". Currently, the PubChem fingerprints need to be provided (here PubChem's SD files) and cannot be computed from scratch in ChemmineR. The PubChem fingerprint specifications are available here. They can be loaded into R with: data(pubchemFPencoding).

## Convert base 64 encoded fingerprints to character vector, matrix or FPset object
fpset <- fp2bit(sdfset, type=1)
fpset <- fp2bit(sdfset, type=2)
fpset <- fp2bit(sdfset, type=3)
fpset

An instance of a 881 bit "FPset" with 10 molecules    

## Pairwise compound structure comparisons
fpSim(fpset[1], fpset[2])

[1] 0.5344828



Similarity Searching with PubChem Fingerprints

Similarly, the fpSim function provides search functionality for PubChem fingerprints.

fpSim(fpset["650065"], fpset, method="Tanimoto", cutoff=0.6, top=6)

   650065    650066    650035    650019    650012    650046
1.0000000 0.9944444 0.7422680 0.7420814 0.7216981 0.7129187



Visualize Similarity Search Results

The cmp.search function allows to visualize the chemical structures for the search results. Similar but more flexible chemical structure rendering functions are plot and sdf.visualize described above. By setting the visualize argument in cmp.search to TRUE, the matching compounds and their scores can be visualized with a standard web browser. Depending on the visualize.browse argument, an URL will be printed or a webpage will be opened showing the structures of the matching compounds along with their scores.

## R Graphics device
plot(sdfset[names(cmp.search(apset, apset[6], type=2, cutoff=4))])

## Online on Chemmine Tools

similarities <- cmp.search(apset, apset[1], type=3, cutoff = 10)
sdf.visualize(sdfset[similarities[,1]], extra=similarities[,3])

## Same result in one step
## Note works only with old AP object; this will be fixed asap!!
apold <- cmp.parse("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")
similarities <- cmp.search(apold, apold[[1]][[2]], cutoff = 10, quiet = TRUE, visualize = TRUE)



Figure 7: Structure Similarity Search Result

Clustering

Clustering Identical or Very Similar Compounds

Often it is of interest to identify very similar or identical compounds in a compound set. The cmp.duplicated function can be used to quickly identify very similar compounds in compound sets, which will be frequently, but not necessarily, identical compounds.

## Identify compounds with identical AP sets
cmp.duplicated(apset, type=1) # Returns AP duplicate information in logical vector
cmp.duplicated(apset, type=2) # Returns AP duplicate information as data frame

## Remove AP duplicates from SDFset and APset objects
apdups <- cmp.duplicated(apset, type=1)
sdfset[which(!apdups)]; apset[which(!apdups)]


Alternatively, one can identify duplicates via other descriptor types if they are provided in the data block of an imported SD file. For instance, one can use here fingerprints, InChI, SMILES or other molecular representations.

table(datablocktag(sdfset, tag="PUBCHEM_NIST_INCHI")) # Enumerate identical InChI strings
table(datablocktag(sdfset, tag="PUBCHEM_OPENEYE_CAN_SMILES")) # Enumerate identical SMILES strings
table(datablocktag(sdfset, tag="PUBCHEM_MOLECULAR_FORMULA")) # Enumerate identical molecular formulas

Binning Clustering

Compound libraries can be clustered into discrete similarity groups with the binning clustering function cmp.cluster. The function requires as input a descriptor database as well as a similarity threshold. The binning clustering result is returned in form of a data frame. Single linkage is used for cluster joining. The function calculates the required compound-to-compound distance information on the fly, while a memory-intensive distance matrix is only created upon user request via the save.distances argument (see below). Because an optimum similarity threshold is often not known, the cmp.cluster function can calculate cluster results for multiple cutoffs in one step with almost the same speed as for a single cutoff. This can be achieved by providing several cutoffs under the cutoff argument. The clustering results for the different cutoffs will be stored in one data frame.

## Single-linkage binning clustering with multiple cutoffs of Tanimoto coefficient
clusters <- cmp.cluster(apset, cutoff = c(0.7, 0.8, 0.9)) # Cutoffs can be changed under the 'cutoff' argument
clusters[1:12,]

      ids CLSZ_0.7 CLID_0.7 CLSZ_0.8 CLID_0.8 CLSZ_0.9 CLID_0.9
48 650049        2       48        2       48        2       48
49 650050        2       48        2       48        2       48
54 650059        2       54        2       54        2       54
55 650060        2       54        2       54        2       54
56 650061        2       56        2       56        2       56
57 650062        2       56        2       56        2       56
58 650063        2       58        2       58        2       58
59 650064        2       58        2       58        2       58
60 650065        2       60        2       60        2       60
61 650066        2       60        2       60        2       60
1  650001        1        1        1        1        1        1
2  650002        1        2        1        2        1        2


## Clustering of 'FPset' objects with multiple cutoffs. This method allows to call various
## similarity methods provided by the 'fpSim' function. For details consult '?fpSim'.

fpset <- desc2fp(apset)
clusters2 <- cmp.cluster(fpset, cutoff=c(
0.5, 0.7, 0.9), method="Tanimoto") 
clusters2[1:12,]

      ids CLSZ_0.5 CLID_0.5 CLSZ_0.7 CLID_0.7 CLSZ_0.9 CLID_0.9
69 650074       14       11        2       69        1       69
79 650085       14       11        2       69        1       79
11 650011       14       11        1       11        1       11
15 650015       14       11        1       15        1       15
45 650046       14       11        1       45        1       45
47 650048       14       11        1       47        1       47
51 650054       14       11        1       51        1       51
53 650058       14       11        1       53        1       53
64 650069       14       11        1       64        1       64
65 650070       14       11        1       65        1       65
67 650072       14       11        1       67        1       67
86 650092       14       11        1       86        1       86


## Sames as above, but using Tversky similarity measure
clusters3 <- cmp.cluster(fpset, cutoff=c(
0.5, 0.7, 0.9), method="Tversky", alpha=0.3, beta=0.7)

## Return cluster size distributions for each cutoff
cluster.sizestat(clusters, cluster.result=1)
cluster.sizestat(clusters, cluster.result=2)
cluster.sizestat(clusters, cluster.result=3)

## Force calculation of distance matrix
clusters <- cmp.cluster(db=apset, cutoff = c(0.65, 0.5, 0.3), save.distances="distmat.rda") # Saves distance matrix to file "
distmat.rda" in current working directory.
load("distmat.rda") # Loads distance matrix into workspace


One may force the cmp.cluster function to calculate and store the distance matrix by supplying a file name to the save.distances argument. The generated distance matrix can be loaded and passed on to many other clustering methods available in R, such as the hierarchical clustering function hclust (see below). If a distance matrix is available, it may also be supplied to cmp.cluster via the use.distances argument. This is useful when one has a pre-computed distance matrix either from a previous call to cmp.cluster or from other distance calculation subroutines.

Jarvis-Patrick Clustering

The Jarvis-Patrick clustering algorithm is widely used in cheminformatics (Jarvis & Patrick, 1973). It requires a nearest neighbor table, which consists of j nearest neighbors for each item (e.g. compound) in the data set. The nearest neighbor table is then used to join items into clusters that share at least k nearest neighbors. The values for j and k are user-defined parameters. The jarvisPatrick function implemented in ChemmineR can generate the nearest neighbor table for APset and FPset objects and then perform Jarvis-Patrick clustering on that table. It also accepts a precomputed nearest neighbor table in form of a matrix object. The output is a cluster vector with the item labels in the name slot and the cluster IDs in the data slot. Alternatively, the function can return the nearest neighbor matrix. As third parameter the user can set a minimum similarity cutoff value for generating the nearest neighbor table. The latter is an optional setting that is not part of the original Jarvis-Patrick algorithm. It allows to generate more tight clusters and to minimize certain limitations of this method, such as false joins of completely unrelated items when operating on small data sets.

## Load/create sample APset and FPset
data(apset)
fpset <- desc2fp(apset)


## Standard Jarvis-Patrick clustering on APset/FPset objects
jarvisPatrick(apset, j=6, k=5)

650001 650002 650003 650004 650005 650006 650007 650008 650009 650010 650011
     1      2      3      4      5      6      7      8      9     10     11
...


jarvisPatrick(fpset, j=6, k=5)


650001 650002 650003 650004 650005 650006 650007 650008 650009 650010 650011
     1      2      3      4      5      6      7      8      9     10     11
...


## Modified Jarvis-Patrick clustering with minimum similarity 'cutoff' value (here Tanimoto coefficient)
jarvisPatrick(fpset, j=6, k=2, cutoff=0.6, method="Tanimoto")

650001 650002 650003 650004 650005 650006 650007 650008 650009 650010 650011
     1      2      3      4      5      6      7      8      9     10     11
...



## Output nearest neighbor table (matrix)
nnm <- jarvisPatrick(fpset, j=6, k=5, type="matrix")
nnm[1:4,]

       1        2        3        4        5        6      
650001 "650001" "650102" "650072" "650015" "650094" "650069"
650002 "650002" "650072" "650033" "650048" "650074" "650090"
650003 "650003" "650007" "650090" "650104" "650103" "650102"
650004 "650004" "650033" "650082" "650090" "650001" "650104"


## Perform clustering on precomputed nearest neighbor table
jarvisPatrick(nnm, j=6, k=5)


650001 650002 650003 650004 650005 650006 650007 650008 650009 650010 650011
     1      2      3      4      5      6      7      8      9     10     11
...



Multi-Dimensional Scaling (MDS)

To visualize and compare clustering results, the cluster.visualize function can be used. The function performs Multi-Dimensional Scaling (MDS) and visualizes the results in form of a scatter plot. It requires as input an APset, a clustering result from cmp.cluster, and a cutoff for the minimum cluster size to consider in the plot. To help determining a proper cutoff size, the cluster.sizestat function is provided to generate cluster size statistics.

## MDS clustering and scatter plot
cluster.visualize(apset, clusters, size.cutoff=2, quiet = TRUE) # Color codes clusters with at least two members.

cluster.visualize(apset, clusters, size.cutoff=1, quiet = TRUE) # Plots all items

# Create a 3D scatter plot of MDS result
coord <- cluster.visualize(apset, clusters, size.cutoff=1, dimensions=3, quiet=TRUE)
library(scatterplot3d)
scatterplot3d(coord)

# Interactive 3D scatter plot with Open GL
# The rgl library needs to be installed for this.
library(rgl)
rgl.open(); offset <- 50; par3d(windowRect=c(offset, offset, 640+offset, 640+offset))
rm(offset); rgl.clear(); rgl.viewpoint(theta=45, phi=30, fov=60, zoom=1)
spheres3d(coord[,1], coord[,2], coord[,3], radius=0.03, color=coord[,4], alpha=1, shininess=20); aspect3d(1, 1, 1)
axes3d(col='black'); title3d("", "", "", "", "", col='black'); bg3d("white") #
To save a snapshot of the graph, one can use the command rgl.snapshot("test.png")

The data returned by the cluster.visualize can also be inspected with the fully interactive rggobi data visulalization system. The GGobi software and its dependencies can be obtained from the GGobi project site. The following commands demonstrate the import of the generated MDS data set into rggobi.

library(rggobi)
ggobi(coord)




Figure 8:
Clustering Results Visualized with the R Packages scatterplot3d and rggob

Clustering with Other Algorithms

ChemmineR allows the user to take advantage of the wide spectrum of clustering utilities available in R. An example on how to perform hierarchical clustering with the hclust function is given below.

Hierarchical Clustering

Using Atom Pairs
## Create atom pair distance matrix
dummy <- cmp.cluster(db=apset, cutoff=0, save.distances="distmat.rda")
load("distmat.rda")

# Hierarchical Clustering with hclust
hc <- hclust(as.dist(distmat), method="single")
hc[["labels"]] <- cid(apset) # Assign correct item labels
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=T) # Plots hierarchical clustering tree.
Figure 9: Hierarchical Clustering Tree for Compound Similarities


## Plot dendrogram with heatmap (here similarity matrix)
library(gplots)
heatmap.2(1-distmat, Rowv=as.dendrogram(hc), Colv=as.dendrogram(hc), col=colorpanel(40, "darkblue", "yellow", "white"), density.info="none", trace="none")

Figure 10: Hierarchical Clustering Result with Heatmap



Using PubChem Fingerprints
## Convert base 64 encoded fingerprints to character vector, matrix or FPset object
fpset <- fp2bit(sdfset)
    
## Compute fingerprint based Tanimoto similarity matrix
simMA <- sapply(cid(fpset), function(x) fpSim(fpset[x], fpset, sorted=FALSE))
    
## Hierarchical clustering with simMA as input
hc <- hclust(as.dist(1-simMA), method="single")
    
## Plot hierarchical clustering tree
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)


Searching PubChem

These features require internet connectivity as the ChemMine Tools web service is used as an intermediate, to translate queries from plain HTTP POST to a PUG SOAP query.

Get Compounds from PubChem by Id

The function getIds accepts one or more numeric PubChem compound ids and downloads the corresponding compounds from PubChem Power User Gateway (PUG) returning results in an SDFset container.

## Fetch 2 compounds from PubChem
compounds <- getIds(c(111,123))
compounds

Similarity Search of PubChem with SMILES Query

The function searchString accepts one SMILES (Simplified Molecular Input Line Entry Specification) string and performs a >0.95 similarity PubChem fingerprint search, returning the hits in an SDFset container. 

## Search a SMILES string on PubChem
compounds <- searchString("CC(=O)OC1=CC=CC=C1C(=O)O")
compounds

Similarity Search of PubChem with SDF Query

The function searchSim performs a PubChem similarity search just like searchString, but accepts a query in an SDFset container. If the query contains more than one compound, only the first is searched.

## Search an SDFset container on PubChem
data(sdfsample); sdfset <- sdfsample[1]
compounds <- searchSim(sdfset)
compounds


Biological Screen Analysis


This example uses an experimental software package, "bioassayR" that has not yet been publically released. The example currently only works from the UC Riverside biocluster system.

This library is designed for analyzing biological screen data, and comes loaded with all screens that target known proteins from PubChem Bioassay.

Users can access the data directly using SQL commands against the three built in tables (bioassay, proteins, sequences) or with one of the wrapper functions shown.

library(bioassayR)

## Show the first 10 rows in the "bioassay" database table using SQL
query(bioassay, "select * from bioassay limit 10")
# Activity key: 1=inactive, 2=active, 3=inconclusive, 4=unspecified


## Show the first 10 rows, joined with their target amino acid sequences using SQL
query(bioassay, "SELECT * FROM bioassay INNER JOIN (SELECT * FROM proteins INNER JOIN sequences USING (protein_target)) USING (aid) limit 10")

## Get a data frame of bioactivity data from a given assay

assay(bioassay, 348)

## Output the protein target for a given assay
protein(bioassay, 348)

## Get SDF file of all compounds from a given assay (#1000 in this example)
library(ChemmineR)
mySDFset <- getIds(as.numeric(assay(bioassay, 1000)$cid))


## Summarize data in database (VERY slow)
bioassay

Session Information


> sessionInfo()

R version 2.15.0 alpha (2012-03-05 r58604)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] fmcsR_0.99.1     ChemmineR_2.11.3

loaded via a namespace (and not attached):
[1] RCurl_1.91-1 tools_2.15.0

Exercises


## Download and import the following SD file
## Sample data set contains first 100 compounds from PubChem SD file "Compound_00650001_00675000.sdf.gz"
## from: ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

library(ChemmineR)
sdfset <- read.SDFset("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sdfsample.sdf")

## Assign compound IDs from SD file to ID slot in SDFset
cid(sdfset) <- sdfid(sdfset)
view(sdfset[1:4])


## Write the data back to an SD file where the order of the SDF entries is reversed and the data block omitted
write.SDF(sdfset[100:1], file="sub.sdf", sig=TRUE, cid=TRUE, db=NULL)


## Write the first 10 compounds to separate SD files using a loop (e.g. for or sapply)
sapply(cid(sdfset[1:10]), function(x)
write.SDF(sdfset[x], file=paste(x, "sdf", sep="")))
for(i in cid(sdfset[1:10])) write.SDF(sdfset[i], file=paste(i, "sdf", sep=""))

##
Extract the numeric and character values from the data block of the SD file
blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)
numchar <- splitNumChar(
blockmatrix=blockmatrix)

## Save the two data frames to external files for import into a spreadsheet program.
write.table(numchar[[1]], "numeric_val.xls", sep="\t", col.names = NA)
write.table(numchar[[2]], "character_val.xls", sep="\t", col.names = NA)

## Generate a box plot for columns 4 to 6 of the numeric data block and save it to a file
boxplot(numchar[[1]][,4:6])

## Generate a data frame containing of each compound its molecular formula, MW and atom frequencies
propma <- data.frame(MF=MF(sdfset), MW=MW(sdfset), atomcountMA(sdfset))
propma[1:4,]

## Inject the previous data frame into the data block of the SDFset and export it to a SD file
datablock(sdfset) <- propma
write.SDF(sdfset, file="sub.sdf")

## Append to the object propma microtiter plate location information and then repeat the inject/export step
## The plate locations can be created with these commands:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/384_96_48_24conversion.txt")
propma2 <- data.frame(
my_plate_mappings[1:100,], propma)
datablock(sdfset) <- propma2
write.SDF(sdfset, file="sub.sdf")

## Visualize the first four compounds along with their tabular data with ChemMine Tools

extra <- apply(propma2[1:4,], 1, function(x) data.frame(Property=colnames(propma2), Value=x))
sdf.visualize(sdfset[1:4], extra = extra)

## Compute atom pair descriptors for sdfset
apset <- sdf2ap(sdfset)

## Cluster the compounds with cmp.cluster using cutoff values of
0.65, 0.5, 0.3
clusters <- cmp.cluster(db=apset, cutoff = c(0.65, 0.5, 0.3), save.distances="distmat.rda")

## Perform hierarchical clustering using the distance matrix from the previous step
load("distmat.rda")
hc <- hclust(as.dist(distmat), method="single")
hc[["labels"]] <- cid(apset) # Assign correct item labels
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=T)

## First, compare the results in clusters and the hierarchical tree visually; then
## use the cutree function to identify discrete clusters in the tree and
## join them with the clusters from the binning clustering.
mycl <- cutree(hc, h=max(hc$height)/1.5)
cbind(clusters, hclust=mycl[clusters$ids])[1:10,]


Funding

This software was developed with funding from the National Science Foundation: ABI-0957099, 2010-0520325 and IGERT-0504249.



References

  1. Y Wang, TW Backman, K Horan, T Girke (2013) fmcsR: Mismatch Tolerant Maximum Common Substructure Searching in R. Bioinformatics: online. HubMed 
  2. T W Backman, Y Cao, and T Girke (2011) ChemMine tools: an online service for analyzing and clustering small molecules. Nucleic Acids Res, 39 (Web Server issue): 486–491. HubMed
  3. R.E. Carhart, D.H. Smith, and R. Venkataraghavan. Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2):64 73, 1985.
  4. X. Chen and C.H. Reynolds. Performance of Similarity Measures in 2D Fragment-Based Similarity Searching: Comparison of Structural Descriptors and Similarity Coefficients. Journal of Chemical Information and Computer Sciences, 42(6):1407 1414, 2002.
  5. T. Girke, L.C. Cheng, and N. Raikhel. ChemMine. A Compound Mining Database for Chemical Genomics.  Plant Physiol, 138(2):573-577, 2005.
  6.  Cao Y, Charisi A, Cheng LC, Jiang T, Girke T (2008) ChemmineR: A Compound Mining Framework for R.  Bioinformatics, 24(15): 1733-1734, 2008. HubMed.
  7. Th. Hanser, Ph. Jauffret, and G. Kaufmann. A new algorithm for exhaustive ring perception in a molecular graph. Journal of Chemical Information and Computer Sciences, 36(6): 1146–1152. URL




This site was accessed tumblr analytics times (Detailed Access Stats)