Download presentation
Presentation is loading. Please wait.
Published byElizabeth Curtis Modified over 9 years ago
1
Selecting Diverse Sets of Compounds C371 Fall 2004
2
Review Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds.
3
Techniques High-Throughput Screening (HTS) Combinatorial Chemistry Early attempts led to large libraries, but little variability in the molecules created Need a way to identify subsets of compounds for synthesis, purchase, or testing
4
Chemical Diversity No unambiguous definition Need to quantify the degree of diversity of a subset of compounds Four main approaches: –Cluster analysis –Dissimilarity-based methods –Cell-based methods –Use of optimization techniques
5
CLUSTER ANALYSIS Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar Many algorithms for doing this –Hierarchical methods seem to be better than non- hierarchical Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds
6
Key Steps in Cluster Analysis Generate descriptors for each compound Calculate the similarity or distance between all compounds Use a clustering algorithm to group the compounds Select a representative subset by taking one or more compounds from each cluster
7
“Distance” 1-S, where S is the similarity coefficient –When molecules are represented by binary descriptors Euclidean distance –When molecules are represented by physicochemical properties
8
Characteristics of Clustering Methods Non-overlapping: each object in one cluster only (Most use this approach) –Hierarchical methods –Non-hierarchical methods Overlapping: object can be in more than one cluster Efficiency and effectiveness issues: some approaches have very intensive computational requirements
9
Hierarchical Clustering Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme –Agglomerative methods start at the bottom and merge similar clusters Ward’s method: clusters are formed to minimize the variance (i.e., the sum of the squared deviations from the mean) Others: centroid method and the median method –Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data
10
Selecting the Appropriate Number of Clusters Need a cutoff value at which you are going to examine the molecules –Jaccard statistic of two clusters, C 1 and C 2 a -------------------------- a + b + c Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1 –Same as the Tanimoto coefficient
11
Non-Hierarchical Clustering Compounds are clustered without forming a hierarchical relationship Methods: –single-pass assigns a compound to a cluster according to a cut-off value Problem: doesn’t give same results all of the time, i.e., dependent on the order of the molecules –nearest neighbor: Jarvis Patrick clustering –relocation: K-means
12
DISSIMILARITY-BASED SELECTION METHODS Attempt to identify a diverse set of compounds directly Based on calculating distances or dissimilarities between compounds
13
Basic Algorithm for Dissimilarity- Based Selection Methods Decide on a desired size, n, of a final subset Select a compound and place it in the subset Calculate the dissimilarity between each of the other compounds and those in the subset Choose the next compound as the one most dissimilar to those in the subset If fewer than n in the subset, repeat the calculation of the dissimilarity until n is achieved Complexity varies as the square of n
14
CELL-BASED METHODS Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined Compounds are allocated to cells according to their molecular properties Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space –good for very large data sets –Examples: MW, logP, polarity, shape, hydrogen bonding, aromatic interactions
15
BCUT Descriptors Matrix representation of molecules Atomic properties used for diagonal –Atomic charges, polarizabilities, hydrogen bonding Connectivity used for the off-diagonals –2D graph or interatomic distances from 3D
16
Partitioning Using Pharmacophore Keys Each potential 3- or 4-point pharmacophore is considered to constitute a cell A given molecule could be in more than one cell Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules
17
OPTIMIZATION METHODS Techniques for sampling large sets of molecules May want to spread the compounds evenly in space Techniques: Monte Carlo, simulated annealing Selective replacement
18
CONCLUSIONS Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity No clear consensus on which screening approach is best Faster computer techniques (e.g., parallel computing) may help Descriptors used must be related to biological activity
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.