Download presentation
Presentation is loading. Please wait.
Published byGeraldine Maxwell Modified over 9 years ago
1
4. Molecular Similarity
2
2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts Pruning Search Space Aggregate Queries LSH
3
3 Historical Progression Maximum Common Subgraph-Isomorphism (MCS) –maximum common substructure between to molecules. –“NP-complete” Structural Keys –dictionary of predetermined, domain-specific sub-structures keyed to particular positions in a bit-vector constructed for each molecule –similarity computed between bit-vectors (fast O(D) scan) 2D Compressed Fingerprints –ALL substructures stored in a bit-vector using a hashing scheme plus lossy compression (modulo operator) –Similarity computed between bit-vectors or count vectors Faster Searches –database pruning –locality sensitive hashing (LSH): towards O(log n) similarity searching
4
4 Superstructure and Substructure Searches A is a superstructure of B (ignoring H) B is a substructure of A Tversky similarity A B
5
5 How similar? The Similarity Problem
6
6 Spectral Similarity 1.Count substructures 2.Compare the count/bit vectors
7
7 2D Graph Substructures For chemical compounds –atom/node labels: A = {C,N,O,H, … } –bond/edge labels: B = {s, d, t, ar, … } Trace ALL Paths O(N*d l ) Cycles and trees Combinatorial Space (CsNsCdO)
8
8 Mapping Structures to Bits Compact data representation Hash each path to bit vector Feature space → Bit space Resolve clashes with OR operator (i.e 1+1=1)
9
9 Similarity Measures There are many ways of measuring similarity (or distance) between bit/count vectors: –Euclidean –Cosine –Exponentials –Tanimoto/Jaccard –Tversky –MinMax –And many more (L1,L2,Lp,Hamming, Manhattan,….)
10
10
11
11
12
12 Similarity Measures: Tanimoto Tally features: –Unique (a,b) –Both on (c) –Both off (d) Similarity Formula –Tanimoto=c/(a+b+c) acb A B
13
13 Fingerprint bit similarity approximates chemical feature similarity. The Fingerprint Approximation
14
14 Similarity Measures: Tversky Tally features: –Unique (a,b) –Both on (c) –Both off (d) Similarity Formula –Tanimoto=c/(a+b+c) –Tversky(α,β)=c/(αa+βb+c) acb A B
15
15 Pathological Cases On the Properties of Bit String-Based Measures of Chemical Similarity. Flower DR, J. Chem. Inf. Comput. Sci. 1998, 38, 379-386
16
16 Pathological Cases Issue of labeling scheme.
17
17 MinMax similarity is a generalization of Tanimoto which uses the counts. MinMax can work better than Tanimoto. Counts
18
18 Pruning Search Space Using Bounds Linear speedup (search CxD) for fixed threshold, often by one order of magnitude or more. Sub-linear speedup (search CxD 0.6 ) for top K.
19
19
20
20 Speedup from Pruning Speedup depends on: –Threshold –Query –Fingerprint length –Database size
21
21
22
22
23
23 Bias in Query Distribution
24
24
25
25
26
26 Aggregate Queries (“Profiles”)
27
27 Two Basic Strategies Similar to bioinformatics 1.Aggregate individual pairwise measures 2.Build a fingerprint profile –Linear approaches –Non-linear approaches (consensus, modal, etc) Hybrid (profile + aggregation/”scaling”)) Profile-profile
28
28 Aggregations
29
29 Consensus Fingerprints Create consensus fingerprint Search database using the consensus & =
30
30 Local Sensitive Hashing Bin fingerprints based on projections onto randomly directed vectors log D random vectors → O(log D) Search for neighbors by returning bin corresponding to the query’s projection Has been used for clustering. May be useful for building diverse data sets. Not yet developed for searching
31
31 Outline Historical Progression Similarity Measures Fingerprint Construction Pathologic Cases MinMax- Counts Pruning Search Space Aggregate Queries LSH
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.