CZ3253: Computer Aided Drug design Drug Design Methods I: QSAR Prof. Chen Yu Zong Tel: 6874-6877 Room.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Dimension reduction (1)
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
6th lecture Modern Methods in Drug Discovery WS07/08 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from.
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
An introduction to Principal Component Analysis (PCA)
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
x – independent variable (input)
Principal Component Analysis
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensional reduction, PCA
4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004.
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Bioinformatics IV Quantitative Structure-Activity Relationships (QSAR) and Comparative Molecular Field Analysis (CoMFA) Martin Ott.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Tables, Figures, and Equations
Modern Navigation Thomas Herring
Separate multivariate observations
Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.
Molecular Descriptors
BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Principle Component Analysis Presented by: Sabbir Ahmed Roll: FH-227.
Chapter 2 Dimensionality Reduction. Linear Methods
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Molecular Modeling: Conformational Molecular Field Analysis (CoMFA)
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Factor Analysis Psy 524 Ainsworth. Assumptions Assumes reliable correlations Highly affected by missing data, outlying cases and truncated data Data screening.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Selecting Diverse Sets of Compounds C371 Fall 2004.
CpSc 881: Machine Learning
Principle Component Analysis and its use in MA clustering Lecture 12.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis (PCA)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Unsupervised Learning II Feature Extraction
CSE 554 Lecture 8: Alignment
Principal Component Analysis
LECTURE 11: Advanced Discriminant Analysis
Lecture: Face Recognition and Feature Reduction
Principal Component Analysis (PCA)
Descriptive Statistics vs. Factor Analysis
Dimension reduction : PCA and Clustering
Principal Component Analysis (PCA)
Multivariate Methods Berlin Chen
Principal Component Analysis
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

CZ3253: Computer Aided Drug design Drug Design Methods I: QSAR Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, National University of Singapore

2 Terminology SAR (Structure-Activity Relationships) –Circa 19 th century? QSAR (Quantitative Structure Activity Relationships) –Specific to some biological/pharmaceutical function of molecule (Absorption, Distribution/Digestion, Metabolism, Excretion) –Brown and Frazer (1868-9) ‘ constitution ’ related to biological response –LogP QSPR (Quantitative Structure Property Relationships) –Relate structure to any physical-chemical property of molecule

3 Statistical Models Simple –Mean, median and variation –Regression Advanced –Validation methods –Principal components, co-variance –Multiple Regression QSAR,QSPR

4 Modern QSAR –Hansch et. Al. (1963) Activity   ‘ travel through body ’  partitioning between varied solvents –C (minimum dosage required) –  (hydrophobicity) –  (electronic) –E s (steric)

5 Choosing Descriptors Buffon ’ s Problem –Needle Length? –Needle Color? –Needle Composition? –Needle Sheen? –Needle Orientation?

6 Choosing Descriptors Constitutional –MW, N atoms of element Topological –Connectivity,Weiner index (sums of bond distances) –2D Fingerprints (bit-strings) –3D topographical indices, pharmacophore keys Electrostatic –Polarity, polarizability, partial charges Geometrical Descriptors –Length, width, Molecular volume

7 Choosing Descriptors Chemical –Hydrophobicity (LogP) –HOMO and LUMO energies –Vibrational frequencies –Bond orders –Energy total –  G  S  H

8 Statistical Methods 1-D analysis Large dimension sets require decomposition techniques –Multiple Regression –PCA –PLS Connecting a descriptor with a structural element so as to interpolate and extrapolate data

9 Simple Error Analysis(1-D) Given N data points –Mean –Variance –Regression

10 Simple Error Analysis(1-D) Given N data points –Regression

11 Simple Error Analysis(1-D) Given N data points –(Poor 0<R 2 <1(Good)

12 Correlation vs. Dependence? Correlation –Two or more variables/descriptors may correlate to the same property of a system Dependence –When the correlation can be shown to be due to one changing caused by the change of the other Example: Elephants head and legs –Correlation exists between size of head and legs –The size of one does not depend on the size of the other

13 Quantitative Structure Activity/Property Relationships (QSAR,QSPR) Discern relationships between multiple variables (descriptors) Identify connections between structural traits (type of subunits, bond angles local components) and descriptor values (e.g. activity, LogP, % denatured)

14 Pre-Qualifications Size –Minimum of FIVE samples per descriptor Verification –Variance –Scaling –Correlations

15 QSAR/QSPR Pre-Qualifications Variance –Coefficient of Variation

16 QSAR/QSPR Pre-Qualifications Scaling –Standardizing or normalizing descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis

17 QSAR/QSPR Pre-Qualifications Scaling –Unit Variance (Auto Scaling) –Ensures equal statistical weights (initially) –Mean Centering

18 QSAR/QSPR Pre-Qualifications Correlations –Remove correlated descriptors –Keep correlated descriptors so as to reduce data set size –Apply math operation to remove correlation (PCR)

19 QSAR/QSPR Pre-Qualifications Correlations

20 QSAR/QSPR Scheme Goal –Predict what happens next (extrapolate)! –Predict what happens between data points (interpolate)!

21 QSAR/QSPR Scheme Types of Variable –Continuous Concentration, occupied volume, partition coefficient, hydrophobicity –Discrete Structural (1: Methyl group substituted, 0: no methyl group substitution)

22 QSAR/QSPR Principal Components Analysis Reduces dimensionality of descriptors Principle components are a set of vectors representing the variance in the original data

23 Principal components – reducing the dimensionality of a dataset x y Clearly there is a relationship between x and y - a high correlation. We can define a new variable z = x+y such that we can express most of the variation in the data as the new variable z. This new variable is a principal component. p i is the i th principal component and c i,j is the coefficient of the variable x j. There are v such variables.

24 QSAR/QSPR-Principal Components Analysis Geometric Analogy (3-D to 2-D PCA) y z x

25 PCA is the transformation of a set of correlated variables to a set of orthogonal uncorrelated variables called principal components. These new variables are a linear combination of the original variables in decreasing order of importance. data matrix loadings (measure of the variation between variables) scores (measure of the variation between samples) eigenvalue Principal components

26 QSAR/QSPR Principal Components Analysis Formulate matrix Diagonalize matrix Eigenvectors are the principal components –These principal components (new descriptors) are a linear combination of the original descriptors Eigenvalues represent variance –Largest accounts for greatest % of data variance –Next corresponds to second greatest and so on

27 QSAR/QSPR-Principal Components Analysis Formulate matrix (Several types) –Correlation or covariance (N x P) N is number of molecules P is number of descriptors –Variance-Covariance matrix (N x N) Diagonalize (Rotate) matrix

28 QSAR/QSPR-Principal Components Analysis Eigenvectors (Loadings) –Represents contribution from each original descriptor to PC (new descriptor) # columns = # of descriptors # rows = # of descriptors OR # of molecules Eigenvalues –Indicate which PC most important (representative of original descriptors) Benzene has 2 non-zero and 1 zero eigenvalue (planar)

29 QSAR/QSPR-Principal Components Analysis Scores –Graphing each object/molecule in space of 2 or more PCs # rows = # of objects/molecules # columns = # of descriptors OR # of molecules For benzene corresponds to graph in PC1 (x ’ ) and PC2 (y ’ ) system

30 PC1 PC2 x y The PC’s each maximise the variance in the data in orthogonal directions and are ordered by size. Usually only a few components are needed to explain (>90%) of the variance in the data – or the properties are not relevant The first step is to calculate the varience- covarience matrix from the data Principal components

31 PC1 PC2 x y If there are s observations each of which contains v values, the data can be represented by a matrix D with v rows and s columns. The varience-covariance matrix is Z = D T D. The eigenvectors of Z are the principal components. Z is a square symmetric matrix so the eigenvectors are orthogonal. Usually the matrix is diagonalised to obtain the eigenvectors (the weightings for the properties) and eigenvalues (the explained variance). Principal components

p p p p p eigenvalues – explain % variance Properties Multiply the property value for molecule by this for each eigenvalue Can do regression on the PC’s, eg V = 0.3PC1(0.1) + 0.2PC2(0.1) + 0.4(0.2) so, we’ve reduced a 5 property problem to a two property problem The output looks like this : Principal components

33 QSAR on SYBYL (Tripos Inc.)

34 QSAR on SYBYL (Tripos Inc.) 10D  3D

35 QSAR on SYBYL (Tripos Inc.) Eigenvalues   Explanation of variance in data

36 QSAR on SYBYL (Tripos Inc.) Each point corresponds to column (# points = # descriptors) in original data Proximity   correlation

37 QSAR on SYBYL (Tripos Inc.) Each point corresponds to row of original data (i.e. #points = #molecules) or graph of molecules in PC space He Napthalene H2OH2O Molecular Size Small acting Big Proximity  similarity

38 QSAR on SYBYL (Tripos Inc.) Outlier

39 QSAR on SYBYL (Tripos Inc.)

40 QSAR/QSPR-Regression Types Principal Component Analysis

41 QSAR/QSPR-Regression Types Principal Component Analysis

42 Non-Linear Mappings Calculate “ distance ” between points in N- dimensional descriptor/parameter space –Euclidean –City-block distances Randomly assign compounds in set to points on a 2-D or 3-D space Minimize Difference (Optimal N-d  2D plot)

43 Non-Linear Mappings Advantages –Non-linear –No assumptions! –Chance groupings unlikely (2D group likely an N-D group) Disadvantages –Dependence on initial guess (Use PCA scores to improve)

44 QSAR/QSPR-Regression Types Multiple Regression (MR) PCR PLS

45 QSAR/QSPR-Regression Types Linear Regression –Minimize difference between calculated and observed values (residuals) Multiple Regression

46 QSAR/QSPR-Regression Types Principal Component Regression –Regression but with Principal Components substituted for original descriptors/variables

47 QSAR/QSPR-Regression Types Partial Least Squares –Cross-validation determines number of descriptors/components to use –Derive equation –Use bootstrapping and t-test to test coefficients in QSAR regression

48 QSAR/QSPR-Regression Types Partial Least Squares (a.k.a. Projection to Latent Structures) –Regression of a Regression Provides insight into variation in x ’ s(b i,j ’ s as in PCA) AND y ’ s (a i ’ s) –The t i ’ s are orthogonal –M= (# of variables/descriptors OR #observations/molecules whichever smaller)

49 QSAR/QSPR-Regression Types PLS is NOT MR or PCR in practice –PLS is MR w/cross-validation –PLS Faster couples the target representation (QSAR generation) and component generation while PCA and PCR are separate PLS well applied to multi-variants problems

50 QSAR/QSPR Post-Qualifications Confidence in Regression –TSS-Total Sum of Squares –ESS-Explained Sum of Squares –RSS-Residual Sum of Squares

51 QSAR/QSPR Post-Qualifications Confidence in Prediction (Predictive Error Sum of Squares)

52 QSAR/QSPR Post-Qualification Bias? –Bootstrapping Choosing best model? –Cross Validation

53 QSAR/QSPR Post-Qualification Bootstrapping –ASSUME calculated data is experimental/observed data –Randomly choose N data (allowing for a multiple picks of same data) –Re-generate parameters/regression –Repeat M times –Average over M bootstraps –Compare (calculate residual) If close to zero then no bias If large then bias exists M is typically

54 QSAR/QSPR Post-Qualification Cross-Validation (used in PLS) –Remove one or more pieces of input data –Re-derive QSAR equation –Calculate omitted data –Compute root-mean-square error to evaluate efficacy of model Typically 20% of data is removed for each iteration The model with the lowest RMS error has the optimal number of components/descriptors

55 QSPR Example Relation between musk odorant properties and benzenoid structure –Training set of 148 compounds (81 non-musk and 67 musk) –47 chemical descriptors initially –Pre-qualifications Correlations (47-12=35) –Post-qualifications Bootstrapping Test-set –6/6 musks, 8/9 non-musks Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, (1986)

56 Practical Issues 10 times as many compounds as parameters fit 3-5 compounds per descriptor Traditional QSAR –Good for activity prediction –Not good for whether activity is due to binding or transport

57 Advanced Methods Neural Networks Support Vector Machines Genetic/Evolutionary Algorithms Monte Carlo Alternate descriptors –Reduced graphs –Molecular connectivity indices –Indicator variables (0 or 1) Combinatorics (e.g. multiple substituent sites)

58 Tools Available Sybyl (Tripos Inc.) Insight II (Accelrys Inc.) Pole Bio-Informatique Lyonnais – Molecular Biology – english/logiciels.htmlhttp:// english/logiciels.html

59 Summary QSAR/QSPR –Statistics connect structure/behavior w/ observables –Interpolate/Extrapolate Multi-Variate Analysis –Pre-Qualification –Regression PCA PLS MLS –Post-Qualification