Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005
D C = D D + D M + D A - c © IDBS 2005 What is QSAR? Motivation Modelling the Dataset Measure of Distance from Domain Validation Overview
D C = D D + D M + D A - c © IDBS 2005 What is QSAR? Quantitative Structure-Activity Relationships BiologicalActivity = f ( ChemicalStructure ) + Error Descriptor-based QSAR Descriptors measure chemical structure E.g. topological indices of chemical graph Use Multivariate Linear Regression Regress activity onto high-dimensional descriptor space Problem of extrapolation 3 c =0 3 c = c = c = c =1.802
D C = D D + D M + D A - c © IDBS 2005 Motivation QSAR model only valid in domain of its training set Measure membership of this domain of applicability Provides assurance of: External test set k-fold cross validation Prediction ? ?
D C = D D + D M + D A - c © IDBS 2005 Bounding Box Convex Hull Distance to Centroid Nearest Neighbour and k-NN Methods Existing Methods ? ?
D C = D D + D M + D A - c © IDBS 2005 Use clusters to model the shape of the dataset K-Means algorithm iteratively adjusts partitioning into clusters to increase accuracy of the model Computationally feasible K-Means for Clustering
D C = D D + D M + D A - c © IDBS 2005 Use the K-Means Model Base on distances to cluster centroids Fuzzy cluster membership Weighted average of distances to cluster centroids, weighted according to cluster membership Computationally efficient Measure of Distance
D C = D D + D M + D A - c © IDBS 2005 Contour Plot First contour defines boundary of applicability domain Measure of Distance
D C = D D + D M + D A - c © IDBS 2005 Assess stability of distance measure Use k-fold cross validation Leave out one group at a time Retrain distance measure Mean relative change in distance of compounds left out Internal Validation
D C = D D + D M + D A - c © IDBS 2005 Internal Validation MethodAveraged Relative Deviation Bounding Box53.2% Leverage80.5% k-NN83.1% Cluster-based43.2%
D C = D D + D M + D A - c © IDBS 2005 External Validation Assess relationship between distance and prediction error Analyse mean-square prediction error over: 50 new compounds Those inside domain Those outside domain
D C = D D + D M + D A - c © IDBS 2005 External Validation Mean Square Prediction Error MethodAll (50) Inside Domain Outside Domain Bounding Box (27) 2.40 (23) Leverage (48) 1.61 (2) k-NN (45) 3.11 (5) Cluster-based (46) 3.58 (4)
D C = D D + D M + D A - c © IDBS 2005 Need quantitative measure of applicability of a descriptor- based QSAR model to a structure Existing methods are all either too crude or too slow Our new method is computationally efficient, and copes well with non-convex domains Conclusions