Chemical Diversity Qualify and/or quantify the extent of variety within a set of compounds. Try to define the extent of chemical space. In combinatorial.

Slides:

Advertisements

Similar presentations

JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.

Advertisements

Analysis of High-Throughput Screening Data C371 Fall 2004.

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

AI Pathfinding Representing the Search Space

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Hierarchical Clustering, DBSCAN The EM Algorithm

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Clustering Categorical Data The Case of Quran Verses

File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.

A Multiobjective Approach to Combinatorial Library Design Val Gillet University of Sheffield, UK.

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Introduction to Bioinformatics

Cluster Analysis.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

4. Ad-hoc I: Hierarchical clustering

Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

Memory Allocation CS Introduction to Operating Systems.

Clustering Unsupervised learning Generating “classes”

Molecular Descriptors

Data Structures and Algorithms Graphs Minimum Spanning Tree PLSD210.

Similarity Methods C371 Fall 2004.

Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.

Presented by Tienwei Tsai July, 2005

CLUSTER ANALYSIS.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,

CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

SAR vs QSAR or “is QSAR different from SAR”

Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.

Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

System Testing Beyond unit testing. 2 System Testing Of the three levels of testing, system level testing is closest to everyday experience We evaluate.

K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Selecting Diverse Sets of Compounds C371 Fall 2004.

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.

Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.

Design of a Compound Screening Collection Gavin Harper Cheminformatics, Stevenage.

Use of Machine Learning in Chemoinformatics

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Identification of structurally diverse Growth Hormone Secretagogue (GHS) agonists by virtual screening and structure-activity relationship analysis of.

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Unsupervised Learning

Clustering CSC 600: Data Mining Class 21.

Text Categorization Berlin Chen 2003 Reference:

Unsupervised Learning

Presentation transcript:

Chemical Diversity Qualify and/or quantify the extent of variety within a set of compounds. Try to define the extent of chemical space. In combinatorial chemistry, we are interested in the diversity of a library.

Example 1: here we are looking at compounds that can possess up to 2 functional groups. How do we define libraries that have different numbers of these cells occupied? How do we quantify those that have duplicates within cells?

Chemical Diversity based on properties Example 2: We can try to define the diversity based on properties of the compounds. For example, we could look at the naturally occurring amino acids and span the space define by their pI. This gives a poor spread, so try pI and MW. Could go to higher dimensions by also looking at the number of H-bonds they make, the number of OH groups, their dipole moment, etc.

Why is Diversity Important? Similar Property Principle Structurally similar compounds will exhibit similar physicochemical and biological properties Test only representative compounds, eliminate redundancies For lead discovery want a diverse space to locate all possible hits (actives) – called a diverse library For refining a lead into a drug (lead optimization), want to survey a range of similar compounds – called a focused library Diversity hypothesis Diverse reactants will lead to diverse products Potentially useful for library design Quantify whether a library can be supplemented by additions of other compounds, other libraries Beno, Drug Discovery Today, 2001, 6, 251 Brown, JCICS, 1996, 36, 572 Gillet, JCICS, 1997, 37, 731

Types of Diversity A library with members that sample chemical space evenly – an ideal situation for lead discovery A library that covers the same chemical space but the compounds cluster and leave large holes. A library with even sampling of space, but only with limited diversity – useful for modification of a lead. From Rose, Drug Discovery Today, 2002, 7, 133.

Quantifying Diversity Need to define how similar (or dissimilar) two compounds are from each other Similarity indices Then need to determine the spread of the compounds throughout space Distance-based Cell-based partitioning Clustering Agrafiotis, Mol. Diversity, 1999, 4, 1

Defining Similarity Descriptors Structural keys Fingerprints Property-based Structure-based 2D 3D Pharmacophore Structural keys Fingerprints Similarity/Distance Coefficients Beno, Drug Discovery Today, 2001, 6, 251 Willett, Curr. Opin. Biotechnology, 2000, 11, 85 Willett, JCICS, 1998, 38, 983 Daylight, http://www.daylight.com/dayhtml/doc/theory/theory.finger.html

Structural Keys Boolean array expressing whether a pattern in present (TRUE) or not (FALSE) within a molecule This array is usually represented as a string of 1s (TRUE) or 0s (FALSE) – a bitmap So create a list of structural features and then set the corresponding bit to 1 if the feature is present Martin, J. Med. Chem., 1995, 38, 1431 Flower, JCICS, 1998, 38, 379

Fingerprints Problems with structural keys Solution – Fingerprint Lack of generality Choice of structural keys is arbitrary and may not be appropriate for the search or question at hand List of structural keys can be very long and unwieldy to generate and test Solution – Fingerprint Also a bitmap but NO assigned meaning to any particular bit! Your fingerprint is characteristic of you, but there is no meaning to any particular fragment of it Generate patterns from the molecule itself, such as a pattern for Each atom Each atom with nearest neighbors Each group of atoms and bonds connected by up to 2 bonds long Continuing with paths up to 3, 4, 5, 6, and 7 bonds long (seven seems to be the longest typically employed) This list of patterns is exhaustive, meaning all are generated for every molecule

Fingerprints. II. Since the number of patterns is huge, not possible to assign a particular bit to each pattern Instead, each pattern is the input into a hash function that creates a number of set bits (typically 4-5 bits). These set bits are then added (with logical OR) to the fingerprint. Note that bit sets for different patterns may have some bits in common This conflict is not a problem since every bit set from some pattern (substructure) will be set in the molecule’s fingerprint. Each pattern (substructure) generates its particular set of bits, and it is unlikely that another pattern will set those exact same bits. So a search for that substructure simply means looking to see if those bits have been set. Fingerprint advantages No predefined set of patterns (structural keys) Structural keys are usually quite sparse, fingerprints are much more dense

Similarity Coefficients a = S xjA number on bits in A b = S xjb number on bits in B c = S xjA xjB number on bits in both A and B D(A,B) is similarity of A and B using bits S(A,B) is similarity of A and B using continuous variables Euclidean Distance Tanimoto Coefficient Cosine Coefficient D(A,B) = [a + b – 2c]1/2 range 0 to n bits S(A,B) = [S (xjA – xjB)2 ]1/2 range 0 to infinity D(A,B) = c/[a + b – c] range 0 to 1 S(A,B) = S xjAxjB / [S xjA2 + S xjB2 + S xjAxjB] range -0.333 to 1 D(A,B) = c/[ab]1/2 range 0 to 1 S(A,B) = S xjAxjB / [S xjA2 S xjB2 ]1/2 range –1 to 1 Willett, JCICS, 1998, 38, 983

Example: Bitmap for 2,2-dimethylbutane 1111011000000 a = 6 Ethylcyclobutane 1111110011100 b = 9 c = 5 Euclid distance = (6+9-10)1/2 = 2.24 Tanimoto coefficient = 5/(6+9-5) = 0.5 Cosine coefficient = 5/(6*9)1/2 = 0.68

Problems with Tanimoto and related similarity indices Flower, JCICS, 1998, 38, 379

Quantifying Diversity Rules for a diversity function adding redundant molecules does not change the value of the diversity Adding non-redundant molecules always increases the value of the diversity Space-filling behavior should be preferred Perfect filling of space gives a finite value of the diversity As dissimilarity of a pair of compounds increases, the diversity should increase asymptotically Waldman, J. Mol. Graph. Model., 2000, 18, 412

Diversity definition 1 Where SIM(J,K) is some similarity measurement between compounds A and B. Can use this to build up a compound selection procedure for creating the sublibrary with maximal diversity Find similarities of all compounds in the library Select compound that is most dissimilar from all other Select 2nd compound that is most dissimilar from the first Select 3rd compound that is most dissimilar from first 2 Continue until you have selected as many compounds as you desire

Cell-based Partitioning Divide each dimension into a number of parts These divisions are called cells or bins Place compounds into appropriate bin based on the value of its properties and/or descriptors Can now create a sublibrary by choosing one compound from each bin, usually the one nearest the center of the bin Schematic representation of different sampling of diversity space (a) Maximize Euclidean distance to create maximum diversity (b) cell-based selection, choosing compound nearest center of each cell From Rose, Drug Discovery Today, 2002, 7, 133

Diversity definition 2 and 3 Suppose 10 molecules divided into 2 cells. Distribution 1: (5,5) – Dc2 = 0 Distribution 2: (7,3) - Dc2 = -8 So the more even distribution is scored as being more diverse. But this may actually go too far – Dc2(2,2,2) > Dc2 (4,1,1) = Dc2 (3,3,0) Makes these last two equivalent, but the (4,1,1) appears to be intuitively more diverse. This entropy-like definition ranks the three sets Dentropy(2,2,2) > Dentropy(4,1,1) > Dentropy(3,3,0) Waldman, J. Mol. Graph. Model., 2000, 18, 412

Clustering. I. Hierarchical clusters Small clusters within larger clusters Typically some relationship between clusters Two procedures Agglomerative Start with singletons and move upwards Calculate all similarities of all pairs Merge two most similar into a cluster Continue until all only one cluster remains Divisive Start with one cluster and break into smaller clusters Calculate all dissimilarities of all pairs Take the pair of most dissimilar structures and assign all other structures to the least dissimilar of these initial cluster centers. Recursively select the cluster with the largest diameter and partition it intow two such that largest resulting cluster has the smallest diameter Repeat step (c) for a maximum of n-1 times Brown, JCICS, 1996, 36, 572

Clustering. II. Nonhierarchical clusters No relation between clusters Jarvis-Patrick method calculate similarities of all pairs Record top n most similar structures to each structure (nearest-neighbor list) Assign compounds to clusters. A and B are in the same cluster if: A is in the top K nearest-neighbor list of B B is in the top K nearest-neighbor list of A A and B have at least Kmin of their top K nearest-neighbors in common Tends to produce lots of small clusters (singletons) under strict conditions or a few very large clusters under less strict conditions Brown, JCICS, 1996, 36, 572

Goals for Diversity Metrics Insure the exploratory libraries are broad enough to locate active molecules Insure that focused (directed) libraries are both broad enough to sample space but compact enough to maintain activity Need to keep libraries small enough to readily manage – so want to insure that sublibraries separate actives from inactives

Other Diversity Comments Krchnak, Mol. Diversity, 1996, 1, 193 (http://www.5z.com/moldiv/publish/MD023/md_023.html) General comments of combinatorial methods and diversity Good, JCICS, 1997, 40, 3926 Use of 3d pharmacophores demands selection of products not reagents, since they are not additive Martin, J. Comb. Chem., 1999, 1, 32 Beyond diversity, library construction should include MW, lipophilicity, ease of synthesis, pharmacophore features, reagent cost, solubility, complementarity to other libraries. Distance measures assess redundancy, coverage of space is better assessed with maps or binning procedures Diversity functions often overweight edges Oprea, J. Comb. Chem., 2001, 3, 157 Big numbers (lots of compounds) and serendipity are not enough Martin, J. Comb. Chem., 2001, 3, 231 Chemical similarity not always good predictor of bioproperties Unlikely that a few thousand compounds can span all of chemical space Just how much diversity is enough?