What is similar?. Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.

Slides:



Advertisements
Similar presentations
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
Advertisements

Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Clustering II.
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Data Mining Techniques: Clustering
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Introduction to Bioinformatics
Chemical Diversity Qualify and/or quantify the extent of variety within a set of compounds. Try to define the extent of chemical space. In combinatorial.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Distance Measures Tan et al. From Chapter 2.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Clustering Unsupervised learning Generating “classes”
Similarity Methods C371 Fall 2004.
DATA MINING CLUSTERING K-Means.
Hierarchical Clustering
Faculté de Chimie, ULP, Strasbourg, FRANCE
Lecture 20: Cluster Validation
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering.
Clustering C.Watters CS6403.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Chapter 2: Getting to Know Your Data
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Selecting Diverse Sets of Compounds C371 Fall 2004.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Use of Machine Learning in Chemoinformatics
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
Data Mining and Text Mining. The Standard Data Mining process.
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
K-means and Hierarchical Clustering
K Nearest Neighbor Classification
Revision (Part II) Ke Chen
Clustering and Multidimensional Scaling
Information Organization: Clustering
Revision (Part II) Ke Chen
Data Mining – Chapter 4 Cluster Analysis Part 2
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Presentation transcript:

Similarity and Diversity Alexandre Varnek, University of Strasbourg, France

What is similar?

Different „spaces“, classified by: Shape Colour Size Pattern

16 diverse aldehydes...

...sorted by common scaffold

...sorted by functional groups

The „Similarity Principle“ : Structurally similar molecules are assumed to have similar biological properties Compounds active as opioid receptors

Structural Spectrum of Thrombin Inhibitors structural similarity “fading away” … reference compounds 0.56 0.72 0.53 0.84 0.67 0.52 0.82 0.64 0.39

Key features in similarity/diversity calculations: Properties to describe elements (descriptors, fingerprints) Distance measure („metrics“)

N-Dimensional Descriptor Space descriptorn Each chosen descriptor adds a dimension to the reference space Calculation of n descriptor values produces an n-dimensional coordinate vector in descriptor space that determines the position of a molecule molecule Mi = (descriptor1(i), descriptor2(i), …, descriptorn(i)) descriptor2 descriptor1 descriptor3

Chemical Reference Space Distance in chemical space is used as a measure of molecular “similarity“ and “dissimilarity“ “Molecular similarity“ covers only chemical similarity but also property similarity including biological activity descriptor3 descriptor2 descriptor1 descriptorn DAB B A

Distance Metrics in n-D Space If two molecules have comparable values in all the n descriptors in the space, they are located close to each other in the n-D space. how to define “closeness“ in space as a measure of molecular similarity? distance metrics

Descriptor-based Similarity When two molecules A and B are projected into an n-D space, two vectors, A and B, represent their descriptor values, respectively. A = (a1,a2,...an) B = (b1,b2,...bn) The similarity between A and B, SAB, is negatively correlated with the distance DAB shorter distance ~ more similar molecules in the case of normalized distance (within value range [0,1]), similarity = 1 – distance descriptor3 descriptor2 descriptor1 descriptorn B DAB DBC C A DAB>DBC  SAB<SBC

Metrics Properties The distance values dAB 0; dAA= dBB= 0 Symmetry properties: dAB= dBA Triangle inequality: dAB dAC+ dBC descriptor3 descriptor2 descriptor1 descriptorn B DAB DBC C A

Euclidean Distance in n-D Space Given two n-dimensional vectors, A and B A = (a1,a2,...an) B = (b1,b2,...bn) Euclidean distance DAB is defined as: Example: A = (3,0,1); B = (5,2,0) DAB = = 3 descriptor3 descriptor2 descriptor1 descriptorn DAB A B

Manhattan Distance in n-D Space Given two n-dimensional vectors, A and B A = (a1,a2,...an) B = (b1,b2,...bn) Manhattan distance DAB is defined as: Example: A = (3,0,1); B = (5,2,0) DAB = = 5 descriptor3 descriptor2 descriptor1 descriptorn DAB A B

Distance Measures („Metrics“): Euclidian distance: [(x11 - x21) 2 + (x12 - x22)2] 1/2 = = (42 + 22)1/2 = 4.472 Manhattan (Hamming) distance: |x11 - x21| + |x12 - x22| = 4 + 2 = 6 Sup distance: Max (|x11 - x21|, |x12 - x22|) = = Max (4, 2) = 4

Binary Fingerprint

Popular Similarity/Distance Coefficients Similarity metrics: Tanimoto coefficient Dice coefficient Cosine coefficient Distance metrics: Euclidean distance Hamming distance Soergel distance

Tanimoto Coefficient (Tc) Definition: value range: [0,1] Tc is also known as Jaccard coefficient Tc is the most popular similarity coefficient A B C

Example Tc Calculation a = 4, b = 4, c = 2 A B binary

Dice Coefficient Definition: value range: [0,1] monotonic with the Tanimoto coefficient

Cosine Coefficient Definition: Properties: value range: [0,1] correlated with the Tanimoto coefficient but not strictly monotonic with it

Hamming Distance Definition: value range: [0,N] (N, length of the fingerprint) also called Manhattan/City Block distance

Soergel Distance Definition: Properties: value range: [0,1] equivalent to (1 – Tc) for binary fingerprints

Similarity coefficients

Properties of Similarlity and Distance Coefficients Metric Properties The distance values dAB 0; dAA= dBB= 0 Symmetry properties: dAB= dBA Triangle inequality: dAB dAC+ dBC Properties of Similarlity and Distance Coefficients The Euclidean and Hamming distances and the Tanimoto coefficients (dichotomous variables) obey all properties. The Tanimoto, Dice and Cosine coefficients do not obey inequality (3). Coefficients are monotonic if they produce the same similarlity ranking

Similarity search Using bit strings to encode molecular size. A biphenyl query is compared to a series of analogues of increasing size. The Tanimoto coefficient, which is shown next to the corresponding structure, decreases with increasing size, until a limiting value is reached. D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

Similarity search Molecular similarity at a range of Tanimoto coefficient values D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

Similarity search The distribution of Tanimoto coefficient values found in database searches with a range of query molecules of increasing size and complexity D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

Molecular Similarity A comparison of the Soergel and Hamming distance values for two pairs of structures to illustrate the effect of molecular size A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

Molecular Similarity Similarity = Nbonds(MCS) / Nbonds(query) The maximum common subgraph (MCS) between the two molecules is in bold Similarity = Nbonds(MCS) / Nbonds(query) A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

Activity landscape

How important is a choice of descriptors ? Inhibitors of acyl-CoA:cholesterol acyltransferase represented with MACCS (a), TGT (b), and Molprint2D (c) fingerprints.

R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646 discontinuous SARs continuous SARs gradual changes in structure result in moderate changes in activity “rolling hills” (G. Maggiora) small changes in structure have dramatic effects on activity “cliffs” in activity landscapes Structure-Activity Landscape Index: SALIij = DAij / DSij DAij (DSij ) is the difference between activities (similarities) of molecules i and j R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646

discontinuous SARs VEGFR-2 tyrosine kinase inhibitors MACCSTc: 1.00 Analog 6 nM 2390 nM small changes in structure have dramatic effects on activity “cliffs” in activity landscapes lead optimization, QSAR bad news for molecular similarity analysis...

Example of a “Classical” Discontinuous SAR Any similarity method must recognize these compounds as being “similar“ ... (MACCS Tanimoto similarity) Adenosine deaminase inhibitors

Libraries design Goal: to select a representative subset from a large database

Chemical Space Overlapping similarity radii  Redundancy „Void“ regions  Lack of information

Chemical Space „Void“ regions  Lack of information

Chemical Space No redundancy, no „voids“  Optimally diverse compound library

Subset selection from the libraries Clustering Dissimilarity-based methods Cell-based methods Optimisation techniques

Clustering in chemistry

What is clustering? Clustering is the separation of a set of objects into groups such that items in one group are more like each other than items in a different group A technique to understand, simplify and interpret large amounts of multidimensional data Classification without labels (“unsupervised learning”)

Where clustering is used ? General: data mining, statistical data analysis, data compression, image segmentation, document classification (information retrieval) Chemical: representative sample, subsets selection, classification of new compounds

Overall strategy Select descriptors Generate descriptors for all items Scale descriptors Define similarity measure (« metrics ») Apply appropriate clustering method to group the items on basis of chosen descriptors and similarity measure Analyse results

Data Presentation Pattern matrix Proximity matrix dii = 0; dij = dji descriptors molecules molecules molecules Pattern matrix Proximity matrix Library contains n molecules, each molecule is described by p descriptors dii = 0; dij = dji

Clustering methods Hierarchical Non-hierarchical Agglomerative Single Link Complete Link Agglomerative Divisive Group Average Hierarchical Weighted Gr Av Monothetic Centroid Polythetic Median Single Pass Ward Jarvis-Patrick Nearest Neighbour Non-hierarchical Mixture Model Relocation Topographic Others

Hierarchical Clustering divisive agglomerative A dendrogram representing an hierarchical clustering of 7 compounds

Sequential Agglomerative Hierarchical Non-overlapping (SAHN) methods Simple link Group average Complete link In the Single Link method, the intercluster distance is equal to the minimum distance between any two compounds, one from each cluster. In the Complete Link method, the intercluster distance is equal to the furthest distance between any two compounds, one from each cluster. The Group Average method measures intercluster distance as the average of the distances between all compounds in the two clusters.

Hierarchical Clustering: Johnson’s method The algorithm is an agglomerative scheme that erases rows and columns in the proximity matrix as old clusters are merged into new ones. Step 1. Group the pair of objects into a cluster Step 2. Update the proximity matrix Single-link Complete-link

Hierarchical Clustering: single link

Hierarchical Clustering: complete link

Hierarchical Clustering: single vs complete link

Non-Hierarchical Clustering: the Jarvis-Patrick method At the first step, all nearest neighbours of each compound are found by calculating of all paiwise similarities and sorting according to this similarlity. Then, two compounds are placed into the same cluster if: They are in each other’s list of m nearest neighbours. They have p (where p< m) nearest neighbours in common. Typical values: m = 14 ; p = 8. Pb: too many singletons.

Non-Hierarchical Clustering: the relocation methods Relocation algorithms involve an initial assignment of compounds to clusters, which is then iteratively refined by moving (relocating) certain compounds from one cluster to another. Example: the K-means method Random choise of c « seed » compounds. Other compounds are assigned to the nearest seed resulting in an initial set of c clusters. The centroides of cluster are calculated. The objects are re-assigned to the nearest cluster centroid. Pb: the method is dependent upon the initial set of cluster centroids.

Efficiency of Clustering Methods Storage space Time Hierarchical (general) O (N2) O (N3) (Ward’s method + RNN) O (N) Non-Hierarchical O (MN) (Jarvis-Patric method) N is the number of compounds and M is the number of clusters

Validity of clustering How many clusters are in the data ? Does partitioning match the categories ? Where should be the dendrogram be cut ? Which of two partitions fit the data better ?

Dissimilarity-Based Compound Selection (DBCS) 4 steps basic algorithm for DBCS: Select a compound and place it in the subset. Calculate the dissimilarity between each compound remaining in the data set and the compounds in the subset. Choose the next compound as that which is most dissimilar to the compounds in subset. If n < n0 (n0 being the desired size number of compounds in the final subset), return to step 2.

Dissimilarity-Based Compound Selection (DBCS) Basic algorithm for DBCS: 1st step – selection of the initial compound Select it at random; Choose the molecules which is « most representative » (e.g., has the largest sum of similarlities to other molecules); Choose the molecules which is « most dissimilar » (e.g., has the smallest sum of similarlities to other molecules).

Dissimilarity-Based Compound Selection (DBCS) Basic algorithm for DBCS: 2nd step – calculation of dissimilarity Dissimilarity is the opposite of similarity (Dissimilarity)i,j = 1 – (Similarity )i,j (where « Similarity » is Tanimoto, or Dice, or Cosine, … coefficients)

Diversity Diversity characterises a set of molecules

Dissimilarity-Based Compound Selection (DBCS) Basic algorithm for DBCS: 3nd step – selection the most dissimilar compound There are several methods to select a diversed subset containing m compounds 1). MaxSum method selects the compound i that has the maximum sum of distances to all molecules in the subset 2). MaxMin method selects the compound i with the maximum distance to its closest neighbour in the subset

Basic algorithm for DBCS: 3nd step – selection the most dissimilar compound 3). The Sphere Exclusion Algorithm Define a threshold dissimilarity, t Select a compound and place it in the subset. Remove all molecules from the data set that have a dissimilarity to the selected molecule of less than t Return to step 2 if there are molecules remaining in the data set. The next compound can be selected randomly; using MinMax-like method

DBCS : Subset selection from the libraries

Cell-based methods Cell-based or Partitioning methods operated within a predefined low-dimentional chemical space. If there are K axes (properties) and each is devided into bi bins, then the number of cells Ncells in the multidimentianal space is

Cell-based methods The construction of 2-dimentional chemical space. LogP bins: <0, 0-0.3, 3-7 and >7 MW bins: 0-250, 250-500, 500-750, > 750.

Cell-based methods A key feature of Cell-based methods is that they do not requere the calculation of paiwise distances Di,j between compounds; the chemical space is defined independently of the molecules that are positioned within it. Advantages of Cell-based methods Empty cells (voids) or cells with low ocupancy can be easily identified. The diversity of different subsets can be easily compared by examining the overlap in the cells occupied by each subset. Main pb: Cell-based methods are restricted to relatively low-dimentional space

Optimisation techniques DBCS methods prepare a diverse subset selecting interatively ONE molecule a time. Optimisation techniques provide an efficient ways of sampling large search spaces and selection of diversed subsets

Optimisation techniques Example: Monte-Carlo search Random selection of an initial subset and calculation of its diversity D. 2. A new subset is generated from the first by replacing some of its compounds with other randomly selected. 3. The diversity of the new subset Di+1 is compared with Di if DD = Di+1 - Di > 0, the new set is accepted if DD < 0, the probability of acceptence depends on the Metropolis condition, exp(- DD / kT).

Scaffolds and Frameworks

Frameworks Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893

Frameworks Dissection of a molecule according to Bemis and Murcko. Diazepam contains three sidechains and one framework with two ring systems and a zero-atom linker. G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171

Graph Frameworks for Compounds in the CMC Database (Numbers Indicate Frequency of Occurrence) Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893

Scaffolds et Frameworks L’algorithme de Bemis et Murcko de génération de framework : les hydrogènes sont supprimés, les atomes avec une seule liaison sont supprimés successivement, le scaffold est obtenu, tous les types d’atomes sont définis en tant que C et tous les types de liaisons sont définis en tant que simples liaisons, ce qui permet d’obtenir le framework. Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893 Contrairement à la méthode de Bemis et Murcko, A. Monge a proposé de distinguer les liaisons aromatiques et non aromatiques (thèse de doctorat, Univ. Orléans, 2007)

Scaffold-Hopping: How Far Can You Jump? G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171

The Scaffold Tree − Visualization of the Scaffold Universe by Hierarchical Scaffold Classification A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.Waldmann J. Chem. Inf. Model., 2007, 47 (1), 47-58

Scaffold tree for the results of pyruvate kinase assay Scaffold tree for the results of pyruvate kinase assay. Color intensity represents the ratio of active and inactive molecules with these scaffolds. A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.Waldmann J. Chem. Inf. Model., 2007, 47 (1), 47-58