3D QSAR Methodology (PCA & PLS) Computational Systems Biology Young-Mook Kang.

Slides:

Advertisements

Similar presentations

3.3 Hypothesis Testing in Multiple Linear Regression

Advertisements

Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.

Chapter Outline 3.1 Introduction

Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.

Generalised Inverses Modal Analysis and Modal Testing S. Ziaei Rad.

Dimension reduction (1)

« هو اللطیف » By : Atefe Malek. khatabi Spring 90.

6th lecture Modern Methods in Drug Discovery WS07/08 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Lecture 7: Principal component analysis (PCA)

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Eigenvalues and eigenvectors

Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Symmetric Matrices and Quadratic Forms

3D Geometry for Computer Graphics

The rank of a product of two matrices X and Y is equal to the smallest of the rank of X and Y: Rank (X Y) =min (rank (X), rank (Y)) A = C S.

4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004.

CALIBRATION Prof.Dr.Cevdet Demir

Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.

Multivariate R e g r e s s i o n

Bioinformatics IV Quantitative Structure-Activity Relationships (QSAR) and Comparative Molecular Field Analysis (CoMFA) Martin Ott.

Independent Component Analysis (ICA) and Factor Analysis (FA)

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

3D Geometry for Computer Graphics

ECIV 520 Structural Analysis II Review of Matrix Algebra.

Lecture 20 SVD and Its Applications Shang-Hua Teng.

Ordinary least squares regression (OLS)

Tables, Figures, and Equations

Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.

Boot Camp in Linear Algebra Joel Barajas Karla L Caballero University of California Silicon Valley Center October 8th, 2008.

1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)

Techniques for studying correlation and covariance structure

Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.

1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture 5QF Introduction to Vector and Matrix Operations Needed for the.

Objectives of Multiple Regression

BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties

Summarized by Soo-Jin Kim

Chapter 2 Dimensionality Reduction. Linear Methods

CHAPTER 2 MATRIX. CHAPTER OUTLINE 2.1 Introduction 2.2 Types of Matrices 2.3 Determinants 2.4 The Inverse of a Square Matrix 2.5 Types of Solutions to.

Some matrix stuff.

Molecular Modeling: Conformational Molecular Field Analysis (CoMFA)

1 Multivariate Linear Regression Models Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.

Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.

MULTIVARIATE REGRESSION Multivariate Regression; Selection Rules LECTURE 6 Supplementary Readings: Wilks, chapters 6; Bevington, P.R., Robinson, D.K.,

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.

Principal Component Analysis (PCA)

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.

Chapter 61 Chapter 7 Review of Matrix Methods Including: Eigen Vectors, Eigen Values, Principle Components, Singular Value Decomposition.

Boot Camp in Linear Algebra TIM 209 Prof. Ram Akella.

Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.

Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Review of Matrix Operations

Introduction to Matrices

Singular Value Decomposition

Chapter 3 Multiple Linear Regression

Multiple Regression Models

Parallelization of Sparse Coding & Dictionary Learning

5.4 General Linear Least-Squares

Feature space tansformation methods

Multivariate Linear Regression Models

Symmetric Matrices and Quadratic Forms

Multivariate Linear Regression

Lecture 16. Classification (II): Practical Considerations

Symmetric Matrices and Quadratic Forms

Presentation transcript:

3D QSAR Methodology (PCA & PLS) Computational Systems Biology Young-Mook Kang

 Electrostatic fields, Coulomb potential  Steric fields, Lennard-Jones potential Cramer R.D. III; Patterson D.E.; Bunce J.D. J. Am. Chem. Soc. 1988, 110, Comparative Molecular Field Analysis

Molecular Field Data N is the number of grids

Molecular Field handling methods The number of variable is too large. Decrease the number of variables. Principal Components Analysis or Partial Least Squares This can’t represent correlation between molecular fields and activities well. This methods are that consider all molecular fields with activities

Factor Analysis 인자분석은 관찰된 변수들을 그들에 대해 1 차식의 (linear) 관계를 가지는 소수의 잠재변 수 (latent variable) 들로 요약하는 수학적 방법이다.  ① 관찰된 변수 : 연구자가 경험적으로 자료를 수집한 변수 즉 직접 측정된 변수 ( 보통은 분자 의 표현자들을 관찰된 변수라 말할 수 있음 )  ② 잠재변수 ( 인자 ) : 직접 관찰은 안되지만 여러 개의 측정변수를 통해서 그 존재를 보여주는 변수이다. 하나의 잠재변수는 여러 개의 측정변수 속에 조금 조금씩 섞여서 나타나므로 여러 개의 측정변수 속에 들어있는 공통부분을 묶어서 추출하면 그것이 하나의 잠재변수로 정의가 된다.  ③ 1 차식의 관계 : 인자분석에서는 이 함수관계를 "1 차식의 관계 " 로 가정하고 분석을 하여 숨 어있는 이론변수를 뽑아내는데, 특별히 이들 이론변수를 "factor" 로 부르는 것이다.

Philosophical Basis Column designee Row designee Factor Row designee Column designee Factor

Factor Analysis Example 1 If only two factors, such as Biology and Chemistry, were considered important in the grading, each data point could be broken down into a sum of two factors.

Factor Analysis Example 2 If the absorbance data obey Beer’s law, the factors can be interpreted chemically.  The number of absorbing components  The concentration of each components in each mixture  The spectrum of each components

Principal Component Analysis (PCA) 주성분분석 (Principal Component Analysis) 은 원래 변수들의 선형결합 으로 표시되는 새로운 주성분 (principal components) 을 찾아서, 이를 통 하여 자료의 요약과 용이한 해석을 목적으로 한다. 주성분 분석은 분 석자체로 어떤 결론에 도달하기 위한 분석이라기보다는 차후의 분석 을 위한 수단을 제공하여 주는 단계이다.

Uses of PCA (1) 원래 변수들이 서로 상관관계를 갖는 경우에 상관관계가 없는 최소의 주성분식들을 찾아 이들을 통해 새로운 분석을 하는 데 사용할 수 있다. 이러한 형태의 분석을 적용할 수 있는 분야는 원래의 변수들간에 상관관계가 매우 높아 다중공선성이 발생하는 회귀 식에 응용할 수 있다. (2) 변수들에 대한 정규성을 검정하는 데 사용한다. 이는 주성분식이 정규분포를 하지 않는 경우에는 원래 변수들이 정규분포를 하지 않기 때문이다. (3) 이상치 (outlier) 를 찾는데 사용할 수 있다. 각 관찰치에 대해서 주성분식의 값을 볼 경우 그 관찰치가 특이한 관찰치에 대해서 주성분식의 값을 볼 경우 그 관찰치가 특이 한 관찰치인지 아닌지의 여부를 파악할 수 있다. 이상치 일수록 주성분식의 값이 다른 관찰치보다 더 크거나 작을 가능성이 있는 것이다.

Principal Components Analysis (PCA) in many molecules case PCA

X = S L Principal Component 1

X = S L Principal Component 1 Principal Component 2 Principal Component n

Eigenvector Analysis

PCA

X 신장 X 체중 X 흉위 X 앉은 키 EigenvalueDifferenceProportionCumulative PRIN PRIN PRIN PRIN PRIN1PRIN2PRIN3PRIN4 X X X X

Principle Component Regression (PCR) A serious problem arises with MLR when the independent variables that comprise X are not independent but are colinear.  In such cases the model parameters are more sensitive to noise, causing a loss of full rank. PCR and PLS circumvent the colinearity problem because the eigenvectors (call latent variables) derived from the independent block are constrained to be orthogonal. In PCR, X is replaced by X PCA, the abstract reproduced counterpart.  X PCA is obtain by deleting the error eigenvectors after subjecting X to abstract factor analysis.

Regression Analysis

PCR Solution

Factor Analysis The covariance matrix Z is constructed by premultiplying the data matrix by its transpose: This matrix is then diagonalized by finding a matrix Q such that Where q j is j th column of Q. These columns, called eigenvectors, constitute a mutually orthonormal set. Hence

Factor Analysis The following shows Q’ is identical to C : Solving for D yields P Q’Q’

Interpreting Factors Principal Component Analysis Principal Axes

PCA 경험적 이야기 Eigenvalue( 고유값 ) 가 높다고 가장 좋은 변수는 아니다.  고유값이 높은 것은 변수들간의 잠재적 변수의 크기가 크다는 것 이지 목적변수와의 상관관계가 높다는 것은 아니다. 다중 공선성 (multicollinearity)  대부분의 다중 공선성이 높은 경우 하나의 변수만을 선택하여 사 용하였지만, PCA 를 통하여 다중 공선성이 높은 변수들로만 PCA 를 수행하여 유의수준의 주축들을 추출해내어서 사용한다. 모든 변수들을 사용한 PCA 결과는 오히려 Over fitting 되어 질 수 있다.  모든변수를 사용하지 말고, PCA 는 서로간의 상관관계가 높은 변 수들의 집합을 모아서 그 집합들끼리의 주축을 만들어서 집합에서 새로운 하나의 변수로 변환하여 사용하는 것이 좋다.

QSPR / QSAR Progress (Selection) Calculation of Descriptors Descriptor Selection Generation of Prediction Models Model Validation Success Genetic Algorithm Forward Selection Backward Elimination Stepwise Selection Principal Components Analysis Partial Least Squares

QSPR / QSAR Progress (Selection) Calculation of Descriptors Descriptor Selection Generation of Prediction Models Model Validation Success Removing High Correlation among the Descriptors X1X2X3X4 X X X X Principal Component Analysis Generation of New the Descriptors Descriptor Set = {X1, X2, X5}

Partial Least Squares N-Components Repeat PCA

X = S Lx

Y = Sy Ly

Bi-PLS Suppose X(I x J) and y(I x 1) are column-centered (and scaled if necessary) matrices. The first PLS1 component which is calculated to predict y from X solves

Bi-PLS By defining

Bi-PLS Process

PLS Process (1) Autoscale X and Y (2) s = y / ||y|| Y-block  (3) ly’ = s’Y/(s’s)  (4) s = Y ly / (ly’ ly)  (5) s = s / ||s|| X-Block  (6) lx’ = s’ X / (s’s)  (7) s = X lx  (8) s = s / ||s|| (9) Repeat steps (3) to (8) until s in step (5) converges.

PLS Process (10) lx’ = s’ X (11) Ex = X – s lx’ (12) Ey = Y – s ly’ (13) X = Ex (14) Y = Ey (15) SSEy = trace(Ey’ Ey) (16) Go to Step (2) Extract the next set of eigenvectors if SSEy is too large

PLS Basic concept XB Y t w X X’X’ T B Y’Y’ Y Tt1t tn

3-way Array & Singular Value Decomposition X

3-way PLS Bro generalizes equation (1) to The vectors and can be solved in a very elegant way be defining Z as the matrix with typical element and using the SVD of Z. It can be shown that and are equal to the first left and right singular vectors of Z respectively. X

Proof

3-way PLS Process

3D-QSAR Method by Solvation Free Energy Density We developed a new Solvation Free Energy Density (SFED)-based 3D-QSAR method using partial least squares (PLS). This method was applied to 96 Protein Tyrosine Phosphatase 1B (PTP 1B) Inhibitors for validation. In this model, the field was used Hydration Free Energy Density. Comparatively, the statistics of the PLS model was excellent.  Test Set (MLR), Test Set (PLS).

Used Compounds Protein Tyrosine Phosphatase 1B Inhibitors. The in vitro inhibitory activity was used as dependent variable. Training set : 77 compounds. Test set : 19 compounds. Ref : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, Alignment Rules

Solvation Free Energy Density (SFED) Creating a cavity of the solute size in water Interaction of the solute with its environments q : The MPEOE charges Pol(a) : Atomic Polarizability ( Å 3) used CDEAP (the Charge Dependence of the Effective Atomic Polarizability) Solvation accessible surface area (SASA): to include the surface area effect the number of sampled points are used in this work

Generation SFED Deleting Low Correlation densities. –Deleting Zero Density Values. –Checking the insoluble problem in Multi-Regression Analysis (MRA). –Simple Regression Test. –Genetic Algorithm.

Result (Multi Linear Regression)

Result (Partial Least Squares)

Problems of SFED based CoMFA SFED based CoMFA doesn’t show common region representing the difference of activities. Number of Components is too numerous yet.  Selected number of factor is 27

Solvation Free Energy Density (SFED) Creating a cavity of the solute size in water Interaction of the solute with its environments q : The MPEOE charges Pol(a) : Atomic Polarizability ( Å 3) used CDEAP (the Charge Dependence of the Effective Atomic Polarizability) Solvation accessible surface area (SASA): to include the surface area effect the number of sampled points are used in this work Used

Partial Least Squares

Modified CoMFA Method using SFED

To use Basis functions To use Region Selection To use basis functions of SFED. –The representation of common region showing molecular differences. To use cross validated R^2 Guided Region Selection method. –The decrease of variables. To use 3-way Partial Least Squares method. –The decrease of components and the use of 3D data forms Partial Least Squares Bioactivity Equation.

The Selected Basis Functions 8860 개 1040 개 전체 9880 개의 Region 중에 -3 이상의 값만 표시 Grid 2 Å X range : -25 ~ 25 Å 26 개 Minimum : Maximum : 7.14 Y range : -19 ~ 17 Å 19 개 Minimum : Maximum : 3.94 Z range : -19 ~ 19 Å 20 개 Minimum : Maximum : 5.65

The Results of Region Selection method Q^2 Guided Region Selection method points Cross-validated r 2 cv (q 2 ) Significant statistical results 0.50 Use results only with care when q 2 > 0.4) Negative values = prediction worse than those based on the mean over all compounds ! 0.00 = No Model! 1.00 = Perfect prediction points 0.2

Results (Cutoff = 0.4, MLR) Cutoff Level of Q^2 = 0.4 Number of Components = 45 Statistical Method = Multiple Linear Regression points

Results (Cutoff = 0.4, PLS) Cutoff Level of Q^2 = 0.4 Number of Components = 5 Statistical Method = Partial Least Squares

Comparison Results CoMFA A log PHOMO A log P, HOMO Components3354 R^ R^2 (test set) CoMSIA This Work S, ES, E, HS, E, H, D, A SFED Components2455 R^ R^2 (test set) Reference : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, 2267.

Generation SFED

Deleting Low Correlation densities. Deleting Zero Density Values. Genetic Algorithm. Regression Test. Checking the insoluble problem in MRA.

Used Compounds Protein Tyrosine Phosphatase 1B Inhibitors. The in vitro inhibitory activity was used as dependent variable. Training set : 77 compounds. Test set : 19 compounds. Ref : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, 2267.

Alignment Rules