Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc.

Slides:



Advertisements
Similar presentations
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Analysis of High-Throughput Screening Data C371 Fall 2004.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
Support Vector Machines
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Lazy vs. Eager Learning Lazy vs. eager learning
Face Recognition & Biometric Systems Support Vector Machines (part 2)
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Introduction to Bioinformatics
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
A Study on Feature Selection for Toxicity Prediction*
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
INSTANCE-BASE LEARNING
CS Instance Based Learning1 Instance Based Learning.
Predicting Highly Connected Proteins in PIN using QSAR Art Cherkasov Apr 14, 2011 UBC / VGH THE UNIVERSITY OF BRITISH COLUMBIA.
Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
This week: overview on pattern recognition (related to machine learning)
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Farhad Hormozdiari Lab for Computational Biology, Simon Fraser University EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
“Emergency discovery” of novel antimicrobials among known drugs in response to new and re-emerging infectious threats A. Cherkasov UBC / VGH Infectious.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic Algorithm Akmal Aulia, 1 Sunil Kumar, 2 Rajni Garg, * 3 A. Srinivas Reddy, 4 1 Computational.
Breast Cancer Diagnosis via Neural Network Classification Jing Jiang May 10, 2000.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Selecting Diverse Sets of Compounds C371 Fall 2004.
PharmaMiner: Geometric Mining of Pharmacophores 1.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Use of Machine Learning in Chemoinformatics
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Puulajeittainen estimointi ja ei-parametriset menetelmät Multi-scale Geospatial Analysis of Forest Ecosystems Tahko Petteri Packalén Faculty.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Data Transformation: Normalization
Semi-Supervised Clustering
SIMILARITY SEARCH The Metric Space Approach
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
School of Computer Science & Engineering
Instance Based Learning
Efficient Image Classification on Vertically Decomposed Data
Efficient Image Classification on Vertically Decomposed Data
Virtual Screening.
Instance Based Learning
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Nearest Neighbors CSC 576: Data Mining.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Amit Satsangi

© 2006 Department of Computing Science CMPUT 605 Background and Focus  Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin)  Structural similarity  Similar biological and/or physico-chemical properties (Maggiora et al.)  Classification of probe compound (unknown bioactivity)  Similarity search amongst compounds with known bioactivity

© 2006 Department of Computing Science CMPUT 605 Background and Focus  Determining similarity distance measures (SDM)  Using SDM for classification of compounds—k-NN classification  Efficient data structures for fast similarity search— DMVP trees (an improvement over SCVP trees used previously)

© 2006 Department of Computing Science CMPUT 605 Outline  Similarity measures  Classification techniques  k-NN classifier  DMVP tree  Results, Observations and Conclusion

© 2006 Department of Computing Science CMPUT 605 Similarity between Molecules  Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines)  Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools) * Quantitative Structure-Activity Relationship

© 2006 Department of Computing Science CMPUT 605 Similarity Measures  Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y:  X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases)  Range of Tanimoto coefficient: [0, 1]

© 2006 Department of Computing Science CMPUT 605 Similarity measures  Tanimoto Dist. Measure: D T (X,Y) = 1 –T(X,Y)  Minkowski distance (L P ):  Real valued data possible

© 2006 Department of Computing Science CMPUT 605 Classification Techniques  Multiple Linear Regression (MLR)  Linear Discriminant Analysis (LDA)  Artifical Neural Networks (ANN)  Support Vector Machines (SVM)  k-nearest Neighbor (k-NN) classification not used previously.

© 2006 Department of Computing Science CMPUT 605 Distance-based Classification  Compounds—s & r  S & R respective descriptor arrays  If D(S,R) is small then bioactivity levels of s & r are similar  Notion of distance  classification of new compounds  Distance measure == metric (conditions) e.g. Hamming Distance, Tanimoto distance etc.

© 2006 Department of Computing Science CMPUT 605 k-nn Classification  Given  Bioactivity  To Find  Distance measure that separates active and inactive compounds for the training set N- dimensional plane  Problem  Easy

© 2006 Department of Computing Science CMPUT 605 k-nn Classification GGiven  Bioactivity TTo Find  Distance measure that separates active and inactive compounds for the training set N- dimensional plane PProblem  NP-hard SSolution  Use Genetic Algorithms, heuristic linear search to find the plane

© 2006 Department of Computing Science CMPUT 605 QSAR approach Uses a linear combination of descriptors Assigns a weight to each dimension, W [0,1] Weighted Minkowski distance of order 1 Only binary classification considered (A/I) Methods are general

© 2006 Department of Computing Science CMPUT 605 Parameter Optimization

© 2006 Department of Computing Science CMPUT 605 k-NN Classifier  Set of data elements: {X1, … Xn}  Query element: Y  Range query  Find Xi such that D(Y,Xi) < R1 (user defined)  k-nn query  Find k items such that their distance to Y is as small as possible

© 2006 Department of Computing Science CMPUT 605 Data structures: VP-Trees  Vantage Point (VP) tree  Choose an arbitrary data point (called Vantage Point)  Binary tree—recursively partitions the dataset into two equal sized subsets  Zero in on the nearest neighbor

© 2006 Department of Computing Science CMPUT 605 Efficient data structures: SCVP Trees  Space Covering Vantage Point tree  Multiple vantage points chosen at each level  No more a binary tree—multiple branches at each internal node  Multiple inner partitions—hope is that each data point lies in atleast one inner partition

© 2006 Department of Computing Science CMPUT 605 DMVP Tree  Memory requirements of SCVP tree can be large—redundancy of data elements  Deterministic selection of Vantage points  VP minimization—NP-Hard  Minimization == Weighted set cover problem  Use of greedy Algorithm: O(log l); l<n  Approximates the min number of VP’s

© 2006 Department of Computing Science CMPUT 605 Experiments  Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug- like(1202)  62 dimensional descriptor array (30 QSAR & 32 physico- chemical properties)  k=1 i.e. one NN  Comparison with LDA, MLR, ANN  70% data used for training  wL 1 distance is calculated in all cases

© 2006 Department of Computing Science CMPUT 605 Experimental Results  Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR  ANN beats k-NN on almost all counts  Pruning—more than 80% in each kind of bioactivity (over brute-force search)  Key point – k-NN classifier is faster  More than 100 times faster than ANN

© 2006 Department of Computing Science CMPUT 605 Experimental Results  Can calculate the level of bioactivity instead of a YES/NO  The value of the weights provides insights into the importance of descriptors for each bioactivity

© 2006 Department of Computing Science CMPUT 605 Observations & Conclusion  Bacterial metabolites & antimicrobial drugs overlap (confirmation)  Human metabolites display distinctive properties  QSAR models for drugs + human metabolites dominated by few descriptors  These descriptors favored by drug developers and natural evolution

© 2006 Department of Computing Science CMPUT 605 Observations & Conclusion  Classification results from k-NN can help rationalize the design and discovery of drugs  DMVP tree improves the space utilization of the program  Provides a means for fast similarity search  Data structure can be applied to any metric distance like wL p and Tanimoto distance

© 2006 Department of Computing Science CMPUT 605 Thank You For Your Attention!