Download presentation
Presentation is loading. Please wait.
1
4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004
2
The Problem of Factor Selection in PCA-Based Calibration Methods By: Bahram Hemmateenejad Medicinal & Natural Products Chemistry Research Center, Shiraz University of Medical Science 4 Th ICW 4 Th ICW
3
Multivariate Calibration Regression Equation relating measurements on m samples to k different variables by: y = X b y (m 1): Dependent variable or Predicted Variable X (m k) : Independent variables or Predictor Variables b (k 1): regression coefficient 4 Th ICW 4 Th ICW
4
Multicomponent Analysis y: concentration of the analyte X: Recorded analytical signals at k different channels, i.e. absorbance at different wavelength QSAR/QSPR Studies y: chemical property or biological activity X: Molecular descriptors representing structural features of molecules by number 4 Th ICW 4 Th ICW
5
Colinearity between the independent variables (X) Number of dependent variables (k) should be much lower than the number of samples (m) 4 Th ICW 4 Th ICW Problems associated with MLR Reduced number of variables must be used
6
Feature selection The variables are selected based on their generalization ability using selection methods such as stepwise variable selection, genetic algorithm, simulated annealing,… Feature extraction The variables are transformed into new coordinate axes with lower dimension Principal Component Analysis (PCA) or Factor Analysis (FA) 4 Th ICW 4 Th ICW
7
PCA or FA or PFA X = T P X (m k) T (m k) P (k k) T =[t 1 t 2 t 3 t 4 t 5 … t k ] Score P T =[p T 1 p T 2 p T 3 p T 4 p T 5 … p T k ] Loading =[ 1 2 3 4 5 … k ] eigen-value 1 > 2 > 3 > 4 > 5 > …> k 4 Th ICW 4 Th ICW
8
Each vector of T or P is named eigen-vector or PC or factor i shows the amount of variances in the X matrix that is explained by the corresponding eigen- vectors (t i or p i ) A reduced set of PCs is necessary to reproduce the original data matrix without losing significant information 4 Th ICW 4 Th ICW (m k) (m f) (f k)
9
f is the number of significant factors f is the rank of the original data matrix f describes the complexity of the X matrix Ideally, f is the number of nonzero eigen- values f can be determined by the theory of FA Scree plot, indicator function, imbedded error, real error, … 4 Th ICW 4 Th ICW
10
PCA-Based regression method MLR (Classical Least Squares) y = X b b = (X T X) -1 X T y y new = x new b Principal Component Regression (PCR) X = T P y = T b b = (T T T) -1 T T y t new = x new P y new = t new b 4 Th ICW 4 Th ICW
11
1.How many PCs must be used in PCR? 2.Which PCs should be considered in PCR modeling? 3.Is the magnitude of an eigen-value necessarily a measure of its significance for the calibration? Significance of factor selection 4 Th ICW 4 Th ICW Some Questions
12
Top-down eigen-value ranking (ER) Factors are entered to the model based on their decreasing eigen-value one after the other Once new factor is entered, the regression model is build and its performances are validated by the existing procedures such as cross-validation 4 Th ICW 4 Th ICW
21
4 Th ICW 4 Th ICW Top-down Correlation Ranking (CR) First the correlation between each one of the factors and the dependent variable (concentration, y) is determined Then, the factors are entered to the models based on their decreasing correlation consecutively.
24
Other factor selection methods 4 Th ICW 4 Th ICW Stepwise selection procedure Search algorithms Simulated annealing Genetic algorithm
25
Some references 4 Th ICW 4 Th ICW 1.Xie YL, Kalivas JH. Evaluation of principal component selection methods to form a global prediction model by principal component regression. Anal. Chim. Acta 1997; 348: 19-27. 2.Sutter JM, Kalivas JH. Which principal components to utilize for principal component regression. J. Chemometrics 1992; 6: 217-225. 3.Sun J. A correlation principal component regression analysis of NIR data. J. Chemometrics 1995; 9: 21-29. 4.Depczynski U, Frost VJ, Molt K. Genetic algorithms applied to the selection of factors in principal component regression. Anal. Chim. Acta 2000; 420: 217-227. 5.Barros AS, Rutledge DN. Genetic algorithm applied to the selection of principal components. Chemometrics Intell. Lab. Syst. 1998; 40: 65-81. 6.Verdu-Andres J, Massart DL. Comparison of prediction-and correlation- Based methods to select the best Subset of principal components for principal component regression and detect outlying objects. Appl. Spect. 1998; 52: 1425-1434. 7.Xie YL, Kalivas JH. Local prediction models by principal component regression. Anal. Chim. Acta 1997; 348: 29-38. 8.Ferre L. Selection of components in principal component analysis: a comparison of methods. Comput. Stat. Data Anal. 1995; 19: 669-682.
26
Quantitative Structure-Electrochemistry Relationship Study of Some Organic Compounds Dependent variable Half-wave reduction potential (E 1/2 )of 69 compounds Independent variables 1150 theoretical molecular descriptors calculated by DRAGON software 4 Th ICW 4 Th ICW A QSPR example
30
ANN is a nonlinear non-parametric modeling method Feature selection is more important for ANN Feature selection-based ANN modeling is a complex procedure Orthogonalization of the variables before introducing to the network substantially decreases the computational time and increases the overall performances of the ANN PC-ANN is a feature extraction-based algorithm 4 Th ICW 4 Th ICW Principal Component-Artificial Neural Network (PC-ANN)
31
4 Th ICW 4 Th ICW Genetic Algorithm Applied to the selection of Factors in PC-ANN modeling, The set of PCs selected by GA could model the structure-antagonist activity of the calcium channel blockers better than the ER procedure B. Hemmateenejad, M. Akhond, R. Miri, M. Shamsipur, J. Chem. Inf,. Comput. Sci. 43 (2003) 1328. How are the factors ranked based on their correlation coefficient in PC-ANN? PC-GA-ANN Algorithm
32
CR-PC-ANN Algorithm Correlation Ranking Procedure for factor selection in PC- ANN modeling, The nonlinear relationship between each one of the PCs and the dependent variable (y) was modeled by separate ANN models. It was found that the subset of PCs selected by CR was relatively the same as those selected by GA. Therefore the results of these factor selection procedures were similar B. Hemmateenejad, Chemometrics Intelligent Laboratory System, 2004, Accepted. 4 Th ICW 4 Th ICW
33
1.Application of ab initio theory to QSAR study of the 1,4- dihydrpyridine-based calcium channel blockers using GA-MLR and PC-GA-ANN procedures, B. Hemmateenejad, M.A. Safarpour, R.Miri, F. Taghavi, Journal of Computational Chemistry 25 (2004) 1495. 2.Highly Correlating Distance-Connectivity-Based Topological Indices. 2: Prediction of 15 Properties of a Large Set of Alkanes Using a Stepwise Factor Selection-Based PCR Analysis, M. Shamsipur, R. Ghavami, B. Hemmateenejad, H. Sharghi, QSAR Combinatorial Sciences, 2004, Accepted. 3.Quantitative Structure-Electrochemistry Relationship Study of some Organic Compounds using PCR and PC-ANN, B. Hemmateenejad, M. Shamsipur, Internet Electronic Journal of Molecular Design 3 (2004) 316. 4.Toward an Optimal Procedure for PC-ANN Model Building: Prediction of the Carcinogenic Activity of a Large Set of Drugs, B. Hemmateenejad, M.A. Safarpour, R. Miri, N. Nesari, Journal of Chemical Information and Computer Sciences, Revised 5.Optimal QSAR analysis of the carcinogenic activity of drugs by correlation ranking and genetic algorithm-based PCR, B. Hemmateenejad, Journal of Chemometrics, Submitted.
34
1.Selection of Latent Variables in PLS 2.Application of other selection algorithms such as successive projections algorithm 3.Comparison between the importance of factor selection in multicomponent analysis and QSAR/QSPR studies 4.Application of the factor selection-based ANN modeling in multicomponent analysis 5.Validation of the different factor selection algorithms by new criteria 4 Th ICW 4 Th ICW Feature Works
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.