3D QSAR Methodology (PCA & PLS) Computational Systems Biology Young-Mook Kang
Electrostatic fields, Coulomb potential Steric fields, Lennard-Jones potential Cramer R.D. III; Patterson D.E.; Bunce J.D. J. Am. Chem. Soc. 1988, 110, Comparative Molecular Field Analysis
Molecular Field Data N is the number of grids
Molecular Field handling methods The number of variable is too large. Decrease the number of variables. Principal Components Analysis or Partial Least Squares This can’t represent correlation between molecular fields and activities well. This methods are that consider all molecular fields with activities
Factor Analysis 인자분석은 관찰된 변수들을 그들에 대해 1 차식의 (linear) 관계를 가지는 소수의 잠재변 수 (latent variable) 들로 요약하는 수학적 방법이다. ① 관찰된 변수 : 연구자가 경험적으로 자료를 수집한 변수 즉 직접 측정된 변수 ( 보통은 분자 의 표현자들을 관찰된 변수라 말할 수 있음 ) ② 잠재변수 ( 인자 ) : 직접 관찰은 안되지만 여러 개의 측정변수를 통해서 그 존재를 보여주는 변수이다. 하나의 잠재변수는 여러 개의 측정변수 속에 조금 조금씩 섞여서 나타나므로 여러 개의 측정변수 속에 들어있는 공통부분을 묶어서 추출하면 그것이 하나의 잠재변수로 정의가 된다. ③ 1 차식의 관계 : 인자분석에서는 이 함수관계를 "1 차식의 관계 " 로 가정하고 분석을 하여 숨 어있는 이론변수를 뽑아내는데, 특별히 이들 이론변수를 "factor" 로 부르는 것이다.
Philosophical Basis Column designee Row designee Factor Row designee Column designee Factor
Factor Analysis Example 1 If only two factors, such as Biology and Chemistry, were considered important in the grading, each data point could be broken down into a sum of two factors.
Factor Analysis Example 2 If the absorbance data obey Beer’s law, the factors can be interpreted chemically. The number of absorbing components The concentration of each components in each mixture The spectrum of each components
Principal Component Analysis (PCA) 주성분분석 (Principal Component Analysis) 은 원래 변수들의 선형결합 으로 표시되는 새로운 주성분 (principal components) 을 찾아서, 이를 통 하여 자료의 요약과 용이한 해석을 목적으로 한다. 주성분 분석은 분 석자체로 어떤 결론에 도달하기 위한 분석이라기보다는 차후의 분석 을 위한 수단을 제공하여 주는 단계이다.
Uses of PCA (1) 원래 변수들이 서로 상관관계를 갖는 경우에 상관관계가 없는 최소의 주성분식들을 찾아 이들을 통해 새로운 분석을 하는 데 사용할 수 있다. 이러한 형태의 분석을 적용할 수 있는 분야는 원래의 변수들간에 상관관계가 매우 높아 다중공선성이 발생하는 회귀 식에 응용할 수 있다. (2) 변수들에 대한 정규성을 검정하는 데 사용한다. 이는 주성분식이 정규분포를 하지 않는 경우에는 원래 변수들이 정규분포를 하지 않기 때문이다. (3) 이상치 (outlier) 를 찾는데 사용할 수 있다. 각 관찰치에 대해서 주성분식의 값을 볼 경우 그 관찰치가 특이한 관찰치에 대해서 주성분식의 값을 볼 경우 그 관찰치가 특이 한 관찰치인지 아닌지의 여부를 파악할 수 있다. 이상치 일수록 주성분식의 값이 다른 관찰치보다 더 크거나 작을 가능성이 있는 것이다.
Principal Components Analysis (PCA) in many molecules case PCA
X = S L Principal Component 1
X = S L Principal Component 1 Principal Component 2 Principal Component n
Eigenvector Analysis
PCA
X 신장 X 체중 X 흉위 X 앉은 키 EigenvalueDifferenceProportionCumulative PRIN PRIN PRIN PRIN PRIN1PRIN2PRIN3PRIN4 X X X X
Principle Component Regression (PCR) A serious problem arises with MLR when the independent variables that comprise X are not independent but are colinear. In such cases the model parameters are more sensitive to noise, causing a loss of full rank. PCR and PLS circumvent the colinearity problem because the eigenvectors (call latent variables) derived from the independent block are constrained to be orthogonal. In PCR, X is replaced by X PCA, the abstract reproduced counterpart. X PCA is obtain by deleting the error eigenvectors after subjecting X to abstract factor analysis.
Regression Analysis
PCR Solution
Factor Analysis The covariance matrix Z is constructed by premultiplying the data matrix by its transpose: This matrix is then diagonalized by finding a matrix Q such that Where q j is j th column of Q. These columns, called eigenvectors, constitute a mutually orthonormal set. Hence
Factor Analysis The following shows Q’ is identical to C : Solving for D yields P Q’Q’
Interpreting Factors Principal Component Analysis Principal Axes
PCA 경험적 이야기 Eigenvalue( 고유값 ) 가 높다고 가장 좋은 변수는 아니다. 고유값이 높은 것은 변수들간의 잠재적 변수의 크기가 크다는 것 이지 목적변수와의 상관관계가 높다는 것은 아니다. 다중 공선성 (multicollinearity) 대부분의 다중 공선성이 높은 경우 하나의 변수만을 선택하여 사 용하였지만, PCA 를 통하여 다중 공선성이 높은 변수들로만 PCA 를 수행하여 유의수준의 주축들을 추출해내어서 사용한다. 모든 변수들을 사용한 PCA 결과는 오히려 Over fitting 되어 질 수 있다. 모든변수를 사용하지 말고, PCA 는 서로간의 상관관계가 높은 변 수들의 집합을 모아서 그 집합들끼리의 주축을 만들어서 집합에서 새로운 하나의 변수로 변환하여 사용하는 것이 좋다.
QSPR / QSAR Progress (Selection) Calculation of Descriptors Descriptor Selection Generation of Prediction Models Model Validation Success Genetic Algorithm Forward Selection Backward Elimination Stepwise Selection Principal Components Analysis Partial Least Squares
QSPR / QSAR Progress (Selection) Calculation of Descriptors Descriptor Selection Generation of Prediction Models Model Validation Success Removing High Correlation among the Descriptors X1X2X3X4 X X X X Principal Component Analysis Generation of New the Descriptors Descriptor Set = {X1, X2, X5}
Partial Least Squares N-Components Repeat PCA
X = S Lx
Y = Sy Ly
Bi-PLS Suppose X(I x J) and y(I x 1) are column-centered (and scaled if necessary) matrices. The first PLS1 component which is calculated to predict y from X solves
Bi-PLS By defining
Bi-PLS Process
PLS Process (1) Autoscale X and Y (2) s = y / ||y|| Y-block (3) ly’ = s’Y/(s’s) (4) s = Y ly / (ly’ ly) (5) s = s / ||s|| X-Block (6) lx’ = s’ X / (s’s) (7) s = X lx (8) s = s / ||s|| (9) Repeat steps (3) to (8) until s in step (5) converges.
PLS Process (10) lx’ = s’ X (11) Ex = X – s lx’ (12) Ey = Y – s ly’ (13) X = Ex (14) Y = Ey (15) SSEy = trace(Ey’ Ey) (16) Go to Step (2) Extract the next set of eigenvectors if SSEy is too large
PLS Basic concept XB Y t w X X’X’ T B Y’Y’ Y Tt1t tn
3-way Array & Singular Value Decomposition X
3-way PLS Bro generalizes equation (1) to The vectors and can be solved in a very elegant way be defining Z as the matrix with typical element and using the SVD of Z. It can be shown that and are equal to the first left and right singular vectors of Z respectively. X
Proof
3-way PLS Process
3D-QSAR Method by Solvation Free Energy Density We developed a new Solvation Free Energy Density (SFED)-based 3D-QSAR method using partial least squares (PLS). This method was applied to 96 Protein Tyrosine Phosphatase 1B (PTP 1B) Inhibitors for validation. In this model, the field was used Hydration Free Energy Density. Comparatively, the statistics of the PLS model was excellent. Test Set (MLR), Test Set (PLS).
Used Compounds Protein Tyrosine Phosphatase 1B Inhibitors. The in vitro inhibitory activity was used as dependent variable. Training set : 77 compounds. Test set : 19 compounds. Ref : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, Alignment Rules
Solvation Free Energy Density (SFED) Creating a cavity of the solute size in water Interaction of the solute with its environments q : The MPEOE charges Pol(a) : Atomic Polarizability ( Å 3) used CDEAP (the Charge Dependence of the Effective Atomic Polarizability) Solvation accessible surface area (SASA): to include the surface area effect the number of sampled points are used in this work
Generation SFED Deleting Low Correlation densities. –Deleting Zero Density Values. –Checking the insoluble problem in Multi-Regression Analysis (MRA). –Simple Regression Test. –Genetic Algorithm.
Result (Multi Linear Regression)
Result (Partial Least Squares)
Problems of SFED based CoMFA SFED based CoMFA doesn’t show common region representing the difference of activities. Number of Components is too numerous yet. Selected number of factor is 27
Solvation Free Energy Density (SFED) Creating a cavity of the solute size in water Interaction of the solute with its environments q : The MPEOE charges Pol(a) : Atomic Polarizability ( Å 3) used CDEAP (the Charge Dependence of the Effective Atomic Polarizability) Solvation accessible surface area (SASA): to include the surface area effect the number of sampled points are used in this work Used
Partial Least Squares
Modified CoMFA Method using SFED
To use Basis functions To use Region Selection To use basis functions of SFED. –The representation of common region showing molecular differences. To use cross validated R^2 Guided Region Selection method. –The decrease of variables. To use 3-way Partial Least Squares method. –The decrease of components and the use of 3D data forms Partial Least Squares Bioactivity Equation.
The Selected Basis Functions 8860 개 1040 개 전체 9880 개의 Region 중에 -3 이상의 값만 표시 Grid 2 Å X range : -25 ~ 25 Å 26 개 Minimum : Maximum : 7.14 Y range : -19 ~ 17 Å 19 개 Minimum : Maximum : 3.94 Z range : -19 ~ 19 Å 20 개 Minimum : Maximum : 5.65
The Results of Region Selection method Q^2 Guided Region Selection method points Cross-validated r 2 cv (q 2 ) Significant statistical results 0.50 Use results only with care when q 2 > 0.4) Negative values = prediction worse than those based on the mean over all compounds ! 0.00 = No Model! 1.00 = Perfect prediction points 0.2
Results (Cutoff = 0.4, MLR) Cutoff Level of Q^2 = 0.4 Number of Components = 45 Statistical Method = Multiple Linear Regression points
Results (Cutoff = 0.4, PLS) Cutoff Level of Q^2 = 0.4 Number of Components = 5 Statistical Method = Partial Least Squares
Comparison Results CoMFA A log PHOMO A log P, HOMO Components3354 R^ R^2 (test set) CoMSIA This Work S, ES, E, HS, E, H, D, A SFED Components2455 R^ R^2 (test set) Reference : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, 2267.
Generation SFED
Deleting Low Correlation densities. Deleting Zero Density Values. Genetic Algorithm. Regression Test. Checking the insoluble problem in MRA.
Used Compounds Protein Tyrosine Phosphatase 1B Inhibitors. The in vitro inhibitory activity was used as dependent variable. Training set : 77 compounds. Test set : 19 compounds. Ref : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, 2267.
Alignment Rules