Presentation is loading. Please wait.

Presentation is loading. Please wait.

3D QSAR Methodology (PCA & PLS) Computational Systems Biology Young-Mook Kang.

Similar presentations


Presentation on theme: "3D QSAR Methodology (PCA & PLS) Computational Systems Biology Young-Mook Kang."— Presentation transcript:

1 3D QSAR Methodology (PCA & PLS) Computational Systems Biology Young-Mook Kang

2  Electrostatic fields, Coulomb potential  Steric fields, Lennard-Jones potential Cramer R.D. III; Patterson D.E.; Bunce J.D. J. Am. Chem. Soc. 1988, 110, 5959-5967 Comparative Molecular Field Analysis

3 Molecular Field Data N is the number of grids

4 Molecular Field handling methods The number of variable is too large. Decrease the number of variables. Principal Components Analysis or Partial Least Squares This can’t represent correlation between molecular fields and activities well. This methods are that consider all molecular fields with activities

5 Factor Analysis 인자분석은 관찰된 변수들을 그들에 대해 1 차식의 (linear) 관계를 가지는 소수의 잠재변 수 (latent variable) 들로 요약하는 수학적 방법이다.  ① 관찰된 변수 : 연구자가 경험적으로 자료를 수집한 변수 즉 직접 측정된 변수 ( 보통은 분자 의 표현자들을 관찰된 변수라 말할 수 있음 )  ② 잠재변수 ( 인자 ) : 직접 관찰은 안되지만 여러 개의 측정변수를 통해서 그 존재를 보여주는 변수이다. 하나의 잠재변수는 여러 개의 측정변수 속에 조금 조금씩 섞여서 나타나므로 여러 개의 측정변수 속에 들어있는 공통부분을 묶어서 추출하면 그것이 하나의 잠재변수로 정의가 된다.  ③ 1 차식의 관계 : 인자분석에서는 이 함수관계를 "1 차식의 관계 " 로 가정하고 분석을 하여 숨 어있는 이론변수를 뽑아내는데, 특별히 이들 이론변수를 "factor" 로 부르는 것이다.

6 Philosophical Basis Column designee Row designee Factor Row designee Column designee Factor

7 Factor Analysis Example 1 If only two factors, such as Biology and Chemistry, were considered important in the grading, each data point could be broken down into a sum of two factors.

8 Factor Analysis Example 2 If the absorbance data obey Beer’s law, the factors can be interpreted chemically.  The number of absorbing components  The concentration of each components in each mixture  The spectrum of each components

9 Principal Component Analysis (PCA) 주성분분석 (Principal Component Analysis) 은 원래 변수들의 선형결합 으로 표시되는 새로운 주성분 (principal components) 을 찾아서, 이를 통 하여 자료의 요약과 용이한 해석을 목적으로 한다. 주성분 분석은 분 석자체로 어떤 결론에 도달하기 위한 분석이라기보다는 차후의 분석 을 위한 수단을 제공하여 주는 단계이다.

10 Uses of PCA (1) 원래 변수들이 서로 상관관계를 갖는 경우에 상관관계가 없는 최소의 주성분식들을 찾아 이들을 통해 새로운 분석을 하는 데 사용할 수 있다. 이러한 형태의 분석을 적용할 수 있는 분야는 원래의 변수들간에 상관관계가 매우 높아 다중공선성이 발생하는 회귀 식에 응용할 수 있다. (2) 변수들에 대한 정규성을 검정하는 데 사용한다. 이는 주성분식이 정규분포를 하지 않는 경우에는 원래 변수들이 정규분포를 하지 않기 때문이다. (3) 이상치 (outlier) 를 찾는데 사용할 수 있다. 각 관찰치에 대해서 주성분식의 값을 볼 경우 그 관찰치가 특이한 관찰치에 대해서 주성분식의 값을 볼 경우 그 관찰치가 특이 한 관찰치인지 아닌지의 여부를 파악할 수 있다. 이상치 일수록 주성분식의 값이 다른 관찰치보다 더 크거나 작을 가능성이 있는 것이다.

11 Principal Components Analysis (PCA) in many molecules case PCA

12 X = S L Principal Component 1

13 X = S L Principal Component 1 Principal Component 2 Principal Component n

14 Eigenvector Analysis

15 PCA

16 1148417278 2160497786 3159458086 4153437683 5151427780 6140296474 7158497883 8137316673 9149478279 10160477487 11151427382 12157396880 13157488088 14144366876 15139326873 16139347176 17149344779 18142316676 19150437779 20139316874 21161477884 22140336777 23152357379 24145357077 25156447885 26147387378 27147306575 28151367480 29141306776 30148387078 X110.8580.54960.9205 신장 X20.85810.74870.8783 체중 X30.54960.748710.5963 흉위 X40.92050.87830.59631 앉은 키 EigenvalueDifferenceProportionCumulative PRIN13.293622.762070.8234050.8234 PRIN20.531550.434160.1328880.95629 PRIN30.097390.019960.0243480.98064 PRIN40.07743.0.0193581 PRIN1PRIN2PRIN3PRIN4 X10.510899-0.420980.361280.656687 X20.531180.045788-0.842280.079482 X30.4311310.8429880.3204270.028703 X40.520534-0.331740.23952-0.74941

17 Principle Component Regression (PCR) A serious problem arises with MLR when the independent variables that comprise X are not independent but are colinear.  In such cases the model parameters are more sensitive to noise, causing a loss of full rank. PCR and PLS circumvent the colinearity problem because the eigenvectors (call latent variables) derived from the independent block are constrained to be orthogonal. In PCR, X is replaced by X PCA, the abstract reproduced counterpart.  X PCA is obtain by deleting the error eigenvectors after subjecting X to abstract factor analysis.

18 Regression Analysis

19 PCR Solution

20 Factor Analysis The covariance matrix Z is constructed by premultiplying the data matrix by its transpose: This matrix is then diagonalized by finding a matrix Q such that Where q j is j th column of Q. These columns, called eigenvectors, constitute a mutually orthonormal set. Hence

21 Factor Analysis The following shows Q’ is identical to C : Solving for D yields P Q’Q’

22 Interpreting Factors Principal Component Analysis Principal Axes

23 PCA 경험적 이야기 Eigenvalue( 고유값 ) 가 높다고 가장 좋은 변수는 아니다.  고유값이 높은 것은 변수들간의 잠재적 변수의 크기가 크다는 것 이지 목적변수와의 상관관계가 높다는 것은 아니다. 다중 공선성 (multicollinearity)  대부분의 다중 공선성이 높은 경우 하나의 변수만을 선택하여 사 용하였지만, PCA 를 통하여 다중 공선성이 높은 변수들로만 PCA 를 수행하여 유의수준의 주축들을 추출해내어서 사용한다. 모든 변수들을 사용한 PCA 결과는 오히려 Over fitting 되어 질 수 있다.  모든변수를 사용하지 말고, PCA 는 서로간의 상관관계가 높은 변 수들의 집합을 모아서 그 집합들끼리의 주축을 만들어서 집합에서 새로운 하나의 변수로 변환하여 사용하는 것이 좋다.

24 QSPR / QSAR Progress (Selection) Calculation of Descriptors Descriptor Selection Generation of Prediction Models Model Validation Success Genetic Algorithm Forward Selection Backward Elimination Stepwise Selection Principal Components Analysis Partial Least Squares

25 QSPR / QSAR Progress (Selection) Calculation of Descriptors Descriptor Selection Generation of Prediction Models Model Validation Success Removing High Correlation among the Descriptors X1X2X3X4 X110.22858-0.82413-0.24545 X20.228581-0.13924-0.97295 X3-0.82413-0.1392410.02954 X4-0.24545-0.972950.029541 Principal Component Analysis Generation of New the Descriptors Descriptor Set = {X1, X2, X5}

26 Partial Least Squares N-Components Repeat PCA

27 X = S Lx

28 Y = Sy Ly

29 Bi-PLS Suppose X(I x J) and y(I x 1) are column-centered (and scaled if necessary) matrices. The first PLS1 component which is calculated to predict y from X solves

30 Bi-PLS By defining

31 Bi-PLS Process 1. 2. 3. 4. 5. 6. 7. 8.

32 PLS Process (1) Autoscale X and Y (2) s = y / ||y|| Y-block  (3) ly’ = s’Y/(s’s)  (4) s = Y ly / (ly’ ly)  (5) s = s / ||s|| X-Block  (6) lx’ = s’ X / (s’s)  (7) s = X lx  (8) s = s / ||s|| (9) Repeat steps (3) to (8) until s in step (5) converges.

33 PLS Process (10) lx’ = s’ X (11) Ex = X – s lx’ (12) Ey = Y – s ly’ (13) X = Ex (14) Y = Ey (15) SSEy = trace(Ey’ Ey) (16) Go to Step (2) Extract the next set of eigenvectors if SSEy is too large

34 PLS Basic concept XB Y t w X X’X’ T B Y’Y’ Y Tt1t2...... tn

35 3-way Array & Singular Value Decomposition X

36 3-way PLS Bro generalizes equation (1) to The vectors and can be solved in a very elegant way be defining Z as the matrix with typical element and using the SVD of Z. It can be shown that and are equal to the first left and right singular vectors of Z respectively. X

37 Proof

38 3-way PLS Process

39 3D-QSAR Method by Solvation Free Energy Density We developed a new Solvation Free Energy Density (SFED)-based 3D-QSAR method using partial least squares (PLS). This method was applied to 96 Protein Tyrosine Phosphatase 1B (PTP 1B) Inhibitors for validation. In this model, the field was used Hydration Free Energy Density. Comparatively, the statistics of the PLS model was excellent.  Test Set (MLR), Test Set (PLS).

40 Used Compounds Protein Tyrosine Phosphatase 1B Inhibitors. The in vitro inhibitory activity was used as dependent variable. Training set : 77 compounds. Test set : 19 compounds. Ref : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, 2267. Alignment Rules

41 Solvation Free Energy Density (SFED) Creating a cavity of the solute size in water Interaction of the solute with its environments q : The MPEOE charges Pol(a) : Atomic Polarizability ( Å 3) used CDEAP (the Charge Dependence of the Effective Atomic Polarizability) Solvation accessible surface area (SASA): to include the surface area effect the number of sampled points are used in this work

42 Generation SFED Deleting Low Correlation densities. –Deleting Zero Density Values. –Checking the insoluble problem in Multi-Regression Analysis (MRA). –Simple Regression Test. –Genetic Algorithm.

43 Result (Multi Linear Regression)

44 Result (Partial Least Squares)

45 Problems of SFED based CoMFA SFED based CoMFA doesn’t show common region representing the difference of activities. Number of Components is too numerous yet.  Selected number of factor is 27

46 Solvation Free Energy Density (SFED) Creating a cavity of the solute size in water Interaction of the solute with its environments q : The MPEOE charges Pol(a) : Atomic Polarizability ( Å 3) used CDEAP (the Charge Dependence of the Effective Atomic Polarizability) Solvation accessible surface area (SASA): to include the surface area effect the number of sampled points are used in this work Used

47 Partial Least Squares

48 Modified CoMFA Method using SFED

49 To use Basis functions To use Region Selection To use basis functions of SFED. –The representation of common region showing molecular differences. To use cross validated R^2 Guided Region Selection method. –The decrease of variables. To use 3-way Partial Least Squares method. –The decrease of components and the use of 3D data forms Partial Least Squares Bioactivity Equation.

50 The Selected Basis Functions 8860 개 1040 개 전체 9880 개의 Region 중에 -3 이상의 값만 표시 Grid 2 Å X range : -25 ~ 25 Å 26 개 Minimum : -10.14 Maximum : 7.14 Y range : -19 ~ 17 Å 19 개 Minimum : -5.63 Maximum : 3.94 Z range : -19 ~ 19 Å 20 개 Minimum : -5.17 Maximum : 5.65

51 The Results of Region Selection method Q^2 Guided Region Selection method 0.3 254 points Cross-validated r 2 cv (q 2 ) Significant statistical results 0.50 Use results only with care when q 2 > 0.4) Negative values = prediction worse than those based on the mean over all compounds ! 0.00 = No Model! 1.00 = Perfect prediction 0.4 114 points 0.2

52 Results (Cutoff = 0.4, MLR) Cutoff Level of Q^2 = 0.4 Number of Components = 45 Statistical Method = Multiple Linear Regression 0.5 9 points

53 Results (Cutoff = 0.4, PLS) Cutoff Level of Q^2 = 0.4 Number of Components = 5 Statistical Method = Partial Least Squares

54 Comparison Results CoMFA A log PHOMO A log P, HOMO Components3354 R^20.7240.8420.7450.735 R^2 (test set)0.5130.7520.5370.663 CoMSIA This Work S, ES, E, HS, E, H, D, A SFED Components2455 R^20.8210.910.8660.9012 R^2 (test set)0.4120.6480.5620.9334 Reference : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, 2267.

55

56

57

58

59

60

61

62

63

64 Generation SFED

65 Deleting Low Correlation densities. Deleting Zero Density Values. Genetic Algorithm. Regression Test. Checking the insoluble problem in MRA.

66 Used Compounds Protein Tyrosine Phosphatase 1B Inhibitors. The in vitro inhibitory activity was used as dependent variable. Training set : 77 compounds. Test set : 19 compounds. Ref : Murthy, V.S.;Kulkarni V.M. Bioorg. Med. Chem. 2002, 10, 2267.

67 Alignment Rules


Download ppt "3D QSAR Methodology (PCA & PLS) Computational Systems Biology Young-Mook Kang."

Similar presentations


Ads by Google