Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Amit Satsangi
© 2006 Department of Computing Science CMPUT 605 Background and Focus Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin) Structural similarity Similar biological and/or physico-chemical properties (Maggiora et al.) Classification of probe compound (unknown bioactivity) Similarity search amongst compounds with known bioactivity
© 2006 Department of Computing Science CMPUT 605 Background and Focus Determining similarity distance measures (SDM) Using SDM for classification of compounds—k-NN classification Efficient data structures for fast similarity search— DMVP trees (an improvement over SCVP trees used previously)
© 2006 Department of Computing Science CMPUT 605 Outline Similarity measures Classification techniques k-NN classifier DMVP tree Results, Observations and Conclusion
© 2006 Department of Computing Science CMPUT 605 Similarity between Molecules Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines) Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools) * Quantitative Structure-Activity Relationship
© 2006 Department of Computing Science CMPUT 605 Similarity Measures Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y: X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases) Range of Tanimoto coefficient: [0, 1]
© 2006 Department of Computing Science CMPUT 605 Similarity measures Tanimoto Dist. Measure: D T (X,Y) = 1 –T(X,Y) Minkowski distance (L P ): Real valued data possible
© 2006 Department of Computing Science CMPUT 605 Classification Techniques Multiple Linear Regression (MLR) Linear Discriminant Analysis (LDA) Artifical Neural Networks (ANN) Support Vector Machines (SVM) k-nearest Neighbor (k-NN) classification not used previously.
© 2006 Department of Computing Science CMPUT 605 Distance-based Classification Compounds—s & r S & R respective descriptor arrays If D(S,R) is small then bioactivity levels of s & r are similar Notion of distance classification of new compounds Distance measure == metric (conditions) e.g. Hamming Distance, Tanimoto distance etc.
© 2006 Department of Computing Science CMPUT 605 k-nn Classification Given Bioactivity To Find Distance measure that separates active and inactive compounds for the training set N- dimensional plane Problem Easy
© 2006 Department of Computing Science CMPUT 605 k-nn Classification GGiven Bioactivity TTo Find Distance measure that separates active and inactive compounds for the training set N- dimensional plane PProblem NP-hard SSolution Use Genetic Algorithms, heuristic linear search to find the plane
© 2006 Department of Computing Science CMPUT 605 QSAR approach Uses a linear combination of descriptors Assigns a weight to each dimension, W [0,1] Weighted Minkowski distance of order 1 Only binary classification considered (A/I) Methods are general
© 2006 Department of Computing Science CMPUT 605 Parameter Optimization
© 2006 Department of Computing Science CMPUT 605 k-NN Classifier Set of data elements: {X1, … Xn} Query element: Y Range query Find Xi such that D(Y,Xi) < R1 (user defined) k-nn query Find k items such that their distance to Y is as small as possible
© 2006 Department of Computing Science CMPUT 605 Data structures: VP-Trees Vantage Point (VP) tree Choose an arbitrary data point (called Vantage Point) Binary tree—recursively partitions the dataset into two equal sized subsets Zero in on the nearest neighbor
© 2006 Department of Computing Science CMPUT 605 Efficient data structures: SCVP Trees Space Covering Vantage Point tree Multiple vantage points chosen at each level No more a binary tree—multiple branches at each internal node Multiple inner partitions—hope is that each data point lies in atleast one inner partition
© 2006 Department of Computing Science CMPUT 605 DMVP Tree Memory requirements of SCVP tree can be large—redundancy of data elements Deterministic selection of Vantage points VP minimization—NP-Hard Minimization == Weighted set cover problem Use of greedy Algorithm: O(log l); l<n Approximates the min number of VP’s
© 2006 Department of Computing Science CMPUT 605 Experiments Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug- like(1202) 62 dimensional descriptor array (30 QSAR & 32 physico- chemical properties) k=1 i.e. one NN Comparison with LDA, MLR, ANN 70% data used for training wL 1 distance is calculated in all cases
© 2006 Department of Computing Science CMPUT 605 Experimental Results Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR ANN beats k-NN on almost all counts Pruning—more than 80% in each kind of bioactivity (over brute-force search) Key point – k-NN classifier is faster More than 100 times faster than ANN
© 2006 Department of Computing Science CMPUT 605 Experimental Results Can calculate the level of bioactivity instead of a YES/NO The value of the weights provides insights into the importance of descriptors for each bioactivity
© 2006 Department of Computing Science CMPUT 605 Observations & Conclusion Bacterial metabolites & antimicrobial drugs overlap (confirmation) Human metabolites display distinctive properties QSAR models for drugs + human metabolites dominated by few descriptors These descriptors favored by drug developers and natural evolution
© 2006 Department of Computing Science CMPUT 605 Observations & Conclusion Classification results from k-NN can help rationalize the design and discovery of drugs DMVP tree improves the space utilization of the program Provides a means for fast similarity search Data structure can be applied to any metric distance like wL p and Tanimoto distance
© 2006 Department of Computing Science CMPUT 605 Thank You For Your Attention!