Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005.

Slides:



Advertisements
Similar presentations
1 A Statistical Analysis of the Precision-Recall Graph Ralf Herbrich Microsoft Research UK Joint work with Hugo Zaragoza and Simon Hill.
Advertisements

Time - Space Sandra Bies Marc van Kreveld maps from triangulations.
EE, NCKU Tien-Hao Chang (Darby Chang)
6th lecture Modern Methods in Drug Discovery WS10/11 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from.
Feature Selection 1 Feature Selection for Image Retrieval By Karina Zapién Arreola January 21th, 2005.
Introduction Our daily lives often involve a great deal of data, or numbers in context. It is important to understand how data is found, what it means,
K-means Clustering Ke Chen.
Chapter 5 Multiple Linear Regression
Chapter 10 Correlation and Regression
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems Ran Wolff Kanishka Bhaduri Hillol Kargupta CSEE Dept, UMBC Presented by: Kanishka Bhaduri.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
The perfect approach To the Pole Vault By: Sam Boswell.
Face Alignment with Part-Based Modeling
PROBABILISTIC ASSESSMENT OF THE QSAR APPLICATION DOMAIN Nina Jeliazkova 1, Joanna Jaworska 2 (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria (2)
High Throughput Computing and Protein Structure Stephen E. Hamby.
Locally Constraint Support Vector Clustering
CS 584. Review n Systems of equations and finite element methods are related.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Cluster analysis. Partition Methods Divide data into disjoint clusters Hierarchical Methods Build a hierarchy of the observations and deduce the clusters.
1 Nearest Neighbor Learning Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AI.
Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Basics: Notation: Sum:. PARAMETERS MEAN: Sample Variance: Standard Deviation: * the statistical average * the central tendency * the spread of the values.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Clustering Unsupervised learning Generating “classes”
Predicting Highly Connected Proteins in PIN using QSAR Art Cherkasov Apr 14, 2011 UBC / VGH THE UNIVERSITY OF BRITISH COLUMBIA.
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
Review of methods to assess a QSAR Applicability Domain Joanna Jaworska Procter & Gamble European Technical Center Brussels, Belgium and Nina Nikolova.
Data mining and machine learning A brief introduction.
Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Model Construction: interpolation techniques 1392.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Prediction of NMR Chemical Shifts. A Chemometrical Approach К.А. Blinov, Y.D. Smurnyy, Т.S. Churanova, М.Е. Elyashberg Advanced Chemistry Development (ACD)
Selecting Diverse Sets of Compounds C371 Fall 2004.
Essential Statistics Chapter 51 Least Squares Regression Line u Regression line equation: y = a + bx ^ –x is the value of the explanatory variable –“y-hat”
Chapter 2 Examining Relationships.  Response variable measures outcome of a study (dependent variable)  Explanatory variable explains or influences.
Presented by Ho Wai Shing
Use of Machine Learning in Chemoinformatics
Introduction Support Vector Regression QSAR Problems and Data SVMs for QSAR Linear Program Feature Selection Model Selection and Bagging Computational.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Global predictors of regression fidelity A single number to characterize the overall quality of the surrogate. Equivalence measures –Coefficient of multiple.
Linli Xu Martha White Dale Schuurmans University of Alberta
Action-Grounded Push Affordance Bootstrapping of Unknown Objects
Chapter 12.2A More About Regression—Transforming to Achieve Linearity
Correlation, Bivariate Regression, and Multiple Regression
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
National Taiwan University
Hierarchical Classification of Calculated Molecular Descriptors
Stretch Literacy Homework Skill Practice Assessed
Clustering (3) Center-based algorithms Fuzzy k-means
Machine Learning Feature Creation and Selection
RandPing: A Randomized Algorithm for IP Mapping
Regression Computer Print Out
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Nearest-Neighbor Classifiers
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
COSC 4335: Other Classification Techniques
Fuzzy Clustering Algorithms
Topological Signatures For Fast Mobility Analysis
Clustering The process of grouping samples so that the samples are similar within each group.
Machine Learning – a Probabilistic Perspective
Cluster analysis Presented by Dr.Chayada Bhadrakom
Using Clustering to Make Prediction Intervals For Neural Networks
Fuzzy Logic Based on a system of non-digital (continuous & fuzzy without crisp boundaries) set theory and rules. Developed by Lotfi Zadeh in 1965 Its advantage.
Machine Learning in Practice Lecture 20
Presentation transcript:

Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

D C = D D + D M + D A - c © IDBS 2005 What is QSAR? Motivation Modelling the Dataset Measure of Distance from Domain Validation Overview

D C = D D + D M + D A - c © IDBS 2005 What is QSAR? Quantitative Structure-Activity Relationships BiologicalActivity = f ( ChemicalStructure ) + Error Descriptor-based QSAR Descriptors measure chemical structure E.g. topological indices of chemical graph Use Multivariate Linear Regression Regress activity onto high-dimensional descriptor space Problem of extrapolation 3 c =0 3 c = c = c = c =1.802

D C = D D + D M + D A - c © IDBS 2005 Motivation QSAR model only valid in domain of its training set Measure membership of this domain of applicability Provides assurance of: External test set k-fold cross validation Prediction ? ?

D C = D D + D M + D A - c © IDBS 2005 Bounding Box Convex Hull Distance to Centroid Nearest Neighbour and k-NN Methods Existing Methods ? ?

D C = D D + D M + D A - c © IDBS 2005 Use clusters to model the shape of the dataset K-Means algorithm iteratively adjusts partitioning into clusters to increase accuracy of the model Computationally feasible K-Means for Clustering

D C = D D + D M + D A - c © IDBS 2005 Use the K-Means Model Base on distances to cluster centroids Fuzzy cluster membership Weighted average of distances to cluster centroids, weighted according to cluster membership Computationally efficient Measure of Distance

D C = D D + D M + D A - c © IDBS 2005 Contour Plot First contour defines boundary of applicability domain Measure of Distance

D C = D D + D M + D A - c © IDBS 2005 Assess stability of distance measure Use k-fold cross validation Leave out one group at a time Retrain distance measure Mean relative change in distance of compounds left out Internal Validation

D C = D D + D M + D A - c © IDBS 2005 Internal Validation MethodAveraged Relative Deviation Bounding Box53.2% Leverage80.5% k-NN83.1% Cluster-based43.2%

D C = D D + D M + D A - c © IDBS 2005 External Validation Assess relationship between distance and prediction error Analyse mean-square prediction error over: 50 new compounds Those inside domain Those outside domain

D C = D D + D M + D A - c © IDBS 2005 External Validation Mean Square Prediction Error MethodAll (50) Inside Domain Outside Domain Bounding Box (27) 2.40 (23) Leverage (48) 1.61 (2) k-NN (45) 3.11 (5) Cluster-based (46) 3.58 (4)

D C = D D + D M + D A - c © IDBS 2005 Need quantitative measure of applicability of a descriptor- based QSAR model to a structure Existing methods are all either too crude or too slow Our new method is computationally efficient, and copes well with non-convex domains Conclusions