6th lecture Modern Methods in Drug Discovery WS07/08 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from.

Slides:



Advertisements
Similar presentations
6th lecture Modern Methods in Drug Discovery WS10/11 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from.
Advertisements

Perceptron Lecture 4.
Analysis of High-Throughput Screening Data C371 Fall 2004.
Beyond Linear Separability
3.3 Hypothesis Testing in Multiple Linear Regression
Artificial Neural Networks (1)
Dimension reduction (1)
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
November 18, 2010Neural Networks Lecture 18: Applications of SOMs 1 Assignment #3 Question 2 Regarding your cascade correlation projects, here are a few.
Simple Neural Nets For Pattern Classification
x – independent variable (input)
4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Evaluating Hypotheses
Bioinformatics IV Quantitative Structure-Activity Relationships (QSAR) and Comparative Molecular Field Analysis (CoMFA) Martin Ott.
Chapter 11 Multiple Regression.
October 5, 2010Neural Networks Lecture 9: Applying Backpropagation 1 K-Class Classification Problem Let us denote the k-th class by C k, with n k exemplars.
Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Tables, Figures, and Equations
Radial Basis Function (RBF) Networks
Radial Basis Function Networks
Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.
Objectives of Multiple Regression
November 25, 2014Computer Vision Lecture 20: Object Recognition IV 1 Creating Data Representations The problem with some data representations is that the.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
Data Mining Techniques
BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Presented By Wanchen Lu 2/25/2013
Modern Methods in Drug Discovery WS08/09
Surveillance monitoring Operational and investigative monitoring Chemical fate fugacity model QSAR Select substance Are physical data and toxicity information.
Molecular Modeling: Conformational Molecular Field Analysis (CoMFA)
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
3D- QSAR. QSAR A QSAR is a mathematical relationship between a biological activity of a molecular system and its physicochemical parameters. QSAR attempts.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Applying Neural Networks Michael J. Watts
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre.
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Principle Component Analysis and its use in MA clustering Lecture 12.
Use of Machine Learning in Chemoinformatics
1 Overview representing region in 2 ways in terms of its external characteristics (its boundary)  focus on shape characteristics in terms of its internal.
April 12, 2016Introduction to Artificial Intelligence Lecture 19: Neural Network Application Design II 1 Now let us talk about… Neural Network Application.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Transformation: Normalization
Deep Feedforward Networks
Applying Neural Networks
Modern Methods in Drug Discovery WS09/10
Principal Component Analysis (PCA)
Machine Learning Today: Reading: Maria Florina Balcan
Virtual Screening.
Current Status at BioChemtek
Creating Data Representations
Parametric Methods Berlin Chen, 2005 References:
Principal Component Analysis
Lecture 16. Classification (II): Practical Considerations
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
What is Artificial Intelligence?
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

6th lecture Modern Methods in Drug Discovery WS07/08 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from lecture 5) QSAR equations form a quantitative connection between chemical structure and (biological) activity.

6th lecture Modern Methods in Drug Discovery WS07/08 2 Evaluating QSAR equations (III) (Simple) k-fold cross validation: Partition your data set that consists of N data points into k subsets (k < N). Generate k QSAR equations using a subset as test set and the remaining k-1 subsets as training set respectively. This gives you an average error from the k QSAR equations. In practice k = 10 has shown to be reasonable (= 10-fold cross validation) k times

6th lecture Modern Methods in Drug Discovery WS07/08 3 Evaluating QSAR equations (IV) Leave one out cross validation: Partition your data set that consists of N data points into k subsets (k = N). Disadvantages: Computationally expensive Partitioning into training and test set is more or less by random, thus the resulting average error can be way off in extreme cases. Solution: (feature) distribution within the training and test sets should be identical or similar N times

6th lecture Modern Methods in Drug Discovery WS07/08 4 Evaluating QSAR equations (V) Stratified cross validation: Same as k-fold cross validation but each of the k subsets has a similar (feature) distribution. The resulting average error is thus more prone against errors due to inequal distribution between training and test sets. k times

6th lecture Modern Methods in Drug Discovery WS07/08 5 Evaluating QSAR equations (VI) alternative Cross-validation and leave one out (LOO) schemes Leaving out one or more descriptors from the derived equation results in the cross- validated correlation coefficient q 2. This value is of course lower than the original r 2. q 2 being much lower than r 2 indicates problems...

6th lecture Modern Methods in Drug Discovery WS07/08 6 Evaluating QSAR equations (VII) One of most reliable ways to test the performance of a QSAR equation is applying an external test set. → partition your complete set of data into training set (2/3) and test set (1/3 of all compounds, idealy) compounds of the test set should be representative (confers to a 1-fold stratified cross validation) → Cluster analysis

6th lecture Modern Methods in Drug Discovery WS07/08 7 Interpretation of QSAR equations (I) The kind of applied variables/descriptors should enable us to draw conclusions about the underlying physico-chemical processes derive guidelines for the design of new molecules by interpolation Some descriptors give information about the biological mode of action: A dependence of (log P) 2 indicates a transport process of the drug to its receptor. Dependence from E LUMO or E HOMO indicates a chemical reaction Higher affinity requires more fluorine, less OH groups

6th lecture Modern Methods in Drug Discovery WS07/08 8 Correlation of descriptors Other approaches to handle correlated descriptors and/or a wealth of descriptors: Transforming descriptors to uncorrelated variables by principal component analysis (PCA) partial least square (PLS) comparative molecular field analysis (CoMFA) Methods that intrinsically handle correlated variables neural networks

6th lecture Modern Methods in Drug Discovery WS07/08 9 Partial least square (I) The idea is to construct a small set of latent variables t i (that are orthogonal to each other and therefore uncorrelated) from the pool of inter-correlated descriptors x i. In this case t 1 and t 2 result as the normal modes of x 1 and x 2 where t 1 shows the larger variance.

6th lecture Modern Methods in Drug Discovery WS07/08 10 Partial least square (II) The predicted term y is then a QSAR equation using the latent variables t i where The number of latent variables t i is chosen to be (much) smaller than that of the original descriptors x i. But, how many latent variables are reasonable ?

6th lecture Modern Methods in Drug Discovery WS07/08 11 Principal Component Analysis PCA (I) Principal component analysis determines the normal modes from a set of descriptors/variables. This is achieved by a coordinate transformation resulting in new axes. The first prinicpal component then shows the largest variance of the data. The second and further normal components are orthogonal to each other. Problem: Which are the (decisive) significant descriptors ?

6th lecture Modern Methods in Drug Discovery WS07/08 12 Principal Component Analysis PCA (II) The first component (pc1) shows the largest variance, the second component the second largest variance, and so on. Lit: E.C. Pielou: The Interpretation of Ecological Data, Wiley, New York, 1984

6th lecture Modern Methods in Drug Discovery WS07/08 13 Principal Component Analysis PCA (III) The significant principal components usually have an eigen value >1 (Kaiser-Guttman criterium). Frequently there is also a kink that separates the less relevant components (Scree test)

6th lecture Modern Methods in Drug Discovery WS07/08 14 Principal Component Analysis PCA (IV) The obtained principal components should account for more than 80% of the total variance.

6th lecture Modern Methods in Drug Discovery WS07/08 15 Principal Component Analysis (V) property pc1 pc2 pc3 dipole moment polarizability mean of +ESP mean of –ESP variance of ESP minimum ESP maximum ESP molecular volume surface fraction of total variance 28% 22% 10% Example: What descriptors determine the logP ? Lit: T.Clark et al. J.Mol.Model. 3 (1997) 142

6th lecture Modern Methods in Drug Discovery WS07/08 16 Comparative Molecular Field Analysis (I) The molecules are placed into a 3D grid and at each grid point the steric and electronic interaction with a probe atom is calculated (force field parameters) Lit: R.D. Cramer et al. J.Am.Chem.Soc. 110 (1988) Problems: „active conformation“ of the molecules needed All molecule must be superimposed (aligned according to their similarity) For this purpose the GRID program can be used: P.J. Goodford J.Med.Chem. 28 (1985) 849.

6th lecture Modern Methods in Drug Discovery WS07/08 17 Comparative Molecular Field Analysis (II) The resulting coefficients for the matrix S (N grid points, P probe atoms) have to determined using a PLS analysis. compoundlog (1/C) S1S2S3...P1P2P3... steroid14.15 steroid25.74 steroid38.83 steroid

6th lecture Modern Methods in Drug Discovery WS07/08 18 Comparative Molecular Field Analysis (III) Application of CoMFA: Affinity of steroids to the testosterone binding globulin Lit: R.D. Cramer et al. J.Am.Chem.Soc. 110 (1988) 5959.

6th lecture Modern Methods in Drug Discovery WS07/08 19 Comparative Molecular Field Analysis (IV) Analog to QSAR descriptors, the CoMFA variables can be interpreted. Here (color coded) contour maps are helpful Lit: R.D. Cramer et al. J.Am.Chem.Soc. 110 (1988) 5959 yellow blue yellow: regions of unfavorable steric interaction blue: regions of favorable steric interaction

6th lecture Modern Methods in Drug Discovery WS07/08 20 Comparative Molecular Similarity Indices Analysis (CoMSIA) CoMFA based on similarity indices at the grid points Lit: G.Klebe et al. J.Med.Chem. 37 (1994) Comparison of CoMFA and CoMSIA potentials shown along one axis of benzoic acid

6th lecture Modern Methods in Drug Discovery WS07/08 21 Neural Networks (I) From the many types of neural networks, backpropagation and unsupervised maps are the most frequently used. Neural networks can be regarded as a common implementation of artificial intelligence. The name is derived from the network-like connection between the switches (neurons) within the system. Thus they can also handle inter-correlated descriptors. modeling of a (regression) function

6th lecture Modern Methods in Drug Discovery WS07/08 22 Neural Networks (II) Furthermore, the actual kind of signal transduction between the neurons can be different: A typical backpropagation net consists of neurons organized as the input layer, one or more hidden layers, and the output layer

6th lecture Modern Methods in Drug Discovery WS07/08 23 Recursive Partitioning Instead of quantitative values often there is only qualitative information available, e.g. substrates versus non-substrates Thus we need classification methods such as decision trees support vector machines (neural networks): partition at what score value ?

6th lecture Modern Methods in Drug Discovery WS07/08 24 Decision Trees Iterative classification Lit: J.R. Quinlan Machine Learning 1 (1986) 81. Advantages: Interpretation of results, design of new compounds with desired properties Disadvantage: Local minima problem chosing the descriptors at each branching point

6th lecture Modern Methods in Drug Discovery WS07/08 25 Support Vector Machines Advantages: accuracy, a minimum of descriptors (= support vectors) used Disadvantage: Interpretation of results, design of new compounds with desired properties Support vector machines generate a hyperplane in the multi- dimensional space of the descriptors that separates the data points.

6th lecture Modern Methods in Drug Discovery WS07/08 26 Property prediction: So what ? Classical QSAR equations: small data sets, few descriptors that are (hopefully) easy to understand Partial least square: small data sets, many descriptors CoMFA: small data sets, many descriptors Neural nets: large data sets, some descriptors Support vector machines: large data sets, many descriptors interpretation of results often difficult black box methods

6th lecture Modern Methods in Drug Discovery WS07/08 27 Interpretation of QSAR equations (II) Caution is required when extrapolating beyond the underlying data range. Outside this range no reliable predicitions can be made Beyond the black stump...

6th lecture Modern Methods in Drug Discovery WS07/08 28 Interpretation of QSAR equations (III) There should be a reasonable connection between the used descriptors and the predicted quantity. Example: H. Sies Nature 332 (1988) 495. Scientific proof that babies are delivered by storks According data can be found at /home/stud/mihu004/qsar/storks.spc

6th lecture Modern Methods in Drug Discovery WS07/08 29 Interpretation of QSAR equations (IV) According to statistics more people die after being hit by a donkey than from the consequences of an airplane crash.