4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004.

Slides:



Advertisements
Similar presentations
Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Advertisements

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
Component Analysis (Review)
Chapter Nineteen Factor Analysis.
Feature Selection, Feature Extraction
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Artificial Neural Networks
6th lecture Modern Methods in Drug Discovery WS07/08 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from.
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis
Basic Steps of QSAR/QSPR Investigations
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
The rank of a product of two matrices X and Y is equal to the smallest of the rank of X and Y: Rank (X Y) =min (rank (X), rank (Y)) A = C S.
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
CALIBRATION Prof.Dr.Cevdet Demir
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Net Analyte Signal Based Multivariate Calibration Methods By: Bahram Hemmateenejad Medicinal & Natural Products Chemistry Research Center, Shiraz University.
Initial estimates for MCR-ALS method: EFA and SIMPLISMA
Bioinformatics IV Quantitative Structure-Activity Relationships (QSAR) and Comparative Molecular Field Analysis (CoMFA) Martin Ott.
8 th Iranian workshop of Chemometrics 7-9 February 2009 Progress of Chemometrics in Iran Mehdi Jalali-Heravi February 2009 In the Name of God.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bayesian belief networks 2. PCA and ICA
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.
Factor Analysis Psy 524 Ainsworth.
Summarized by Soo-Jin Kim
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Molecular Modeling: Conformational Molecular Field Analysis (CoMFA)
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON John Kalivas Department of Chemistry Idaho State University Pocatello,
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic Algorithm Akmal Aulia, 1 Sunil Kumar, 2 Rajni Garg, * 3 A. Srinivas Reddy, 4 1 Computational.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORT FOR ENVIRONMENTAL PERSISTENCE AND LONG.
Chimiometrie 2009 Proposed model for Challenge2009 Patrícia Valderrama
Lecture 12 Factor Analysis.
From linearity to nonlinear additive spline modeling in Partial Least-Squares regression Jean-François Durand Montpellier II University Scuola della Società.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
CZ3253: Computer Aided Drug design Drug Design Methods I: QSAR Prof. Chen Yu Zong Tel: Room.
Applied Quantitative Analysis and Practices
Education 795 Class Notes Factor Analysis Note set 6.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis (PCA)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Multivariate Transformation. Multivariate Transformations  Started in statistics of psychology and sociology.  Also called multivariate analyses and.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Canonical Correlation Analysis (CCA). CCA This is it! The mother of all linear statistical analysis When ? We want to find a structural relation between.
Exploratory Factor Analysis
Mini-Revision Since week 5 we have learned about hypothesis testing:
Analysis of Survey Results
Dimension Reduction in Workers Compensation
Factor analysis Advanced Quantitative Research Methods
Principal Component Analysis (PCA)
Descriptive Statistics vs. Factor Analysis
X.1 Principal component analysis
Multivariate Linear Regression Models
X.2 Linear Discriminant Analysis: 2-Class
Factor Analysis (Principal Components) Output
Presentation transcript:

4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004

The Problem of Factor Selection in PCA-Based Calibration Methods By: Bahram Hemmateenejad Medicinal & Natural Products Chemistry Research Center, Shiraz University of Medical Science 4 Th ICW 4 Th ICW

Multivariate Calibration Regression Equation relating measurements on m samples to k different variables by: y = X b y (m  1): Dependent variable or Predicted Variable X (m  k) : Independent variables or Predictor Variables b (k  1): regression coefficient 4 Th ICW 4 Th ICW

Multicomponent Analysis y: concentration of the analyte X: Recorded analytical signals at k different channels, i.e. absorbance at different wavelength QSAR/QSPR Studies y: chemical property or biological activity X: Molecular descriptors representing structural features of molecules by number 4 Th ICW 4 Th ICW

Colinearity between the independent variables (X) Number of dependent variables (k) should be much lower than the number of samples (m) 4 Th ICW 4 Th ICW Problems associated with MLR Reduced number of variables must be used

Feature selection The variables are selected based on their generalization ability using selection methods such as stepwise variable selection, genetic algorithm, simulated annealing,… Feature extraction The variables are transformed into new coordinate axes with lower dimension Principal Component Analysis (PCA) or Factor Analysis (FA) 4 Th ICW 4 Th ICW

PCA or FA or PFA X = T P X (m  k) T (m  k) P (k  k) T =[t 1 t 2 t 3 t 4 t 5 … t k ] Score P T =[p T 1 p T 2 p T 3 p T 4 p T 5 … p T k ] Loading =[ … k ] eigen-value 1 > 2 > 3 > 4 > 5 > …> k 4 Th ICW 4 Th ICW

Each vector of T or P is named eigen-vector or PC or factor i shows the amount of variances in the X matrix that is explained by the corresponding eigen- vectors (t i or p i ) A reduced set of PCs is necessary to reproduce the original data matrix without losing significant information 4 Th ICW 4 Th ICW (m  k) (m  f) (f  k)

f is the number of significant factors f is the rank of the original data matrix f describes the complexity of the X matrix Ideally, f is the number of nonzero eigen- values f can be determined by the theory of FA Scree plot, indicator function, imbedded error, real error, … 4 Th ICW 4 Th ICW

PCA-Based regression method MLR (Classical Least Squares) y = X b b = (X T X) -1 X T y y new = x new b Principal Component Regression (PCR) X = T P y = T b b = (T T T) -1 T T y t new = x new P y new = t new b 4 Th ICW 4 Th ICW

1.How many PCs must be used in PCR? 2.Which PCs should be considered in PCR modeling? 3.Is the magnitude of an eigen-value necessarily a measure of its significance for the calibration? Significance of factor selection 4 Th ICW 4 Th ICW Some Questions

Top-down eigen-value ranking (ER) Factors are entered to the model based on their decreasing eigen-value one after the other Once new factor is entered, the regression model is build and its performances are validated by the existing procedures such as cross-validation 4 Th ICW 4 Th ICW

4 Th ICW 4 Th ICW Top-down Correlation Ranking (CR) First the correlation between each one of the factors and the dependent variable (concentration, y) is determined Then, the factors are entered to the models based on their decreasing correlation consecutively.

Other factor selection methods 4 Th ICW 4 Th ICW Stepwise selection procedure Search algorithms Simulated annealing Genetic algorithm

Some references 4 Th ICW 4 Th ICW 1.Xie YL, Kalivas JH. Evaluation of principal component selection methods to form a global prediction model by principal component regression. Anal. Chim. Acta 1997; 348: Sutter JM, Kalivas JH. Which principal components to utilize for principal component regression. J. Chemometrics 1992; 6: Sun J. A correlation principal component regression analysis of NIR data. J. Chemometrics 1995; 9: Depczynski U, Frost VJ, Molt K. Genetic algorithms applied to the selection of factors in principal component regression. Anal. Chim. Acta 2000; 420: Barros AS, Rutledge DN. Genetic algorithm applied to the selection of principal components. Chemometrics Intell. Lab. Syst. 1998; 40: Verdu-Andres J, Massart DL. Comparison of prediction-and correlation- Based methods to select the best Subset of principal components for principal component regression and detect outlying objects. Appl. Spect. 1998; 52: Xie YL, Kalivas JH. Local prediction models by principal component regression. Anal. Chim. Acta 1997; 348: Ferre L. Selection of components in principal component analysis: a comparison of methods. Comput. Stat. Data Anal. 1995; 19:

Quantitative Structure-Electrochemistry Relationship Study of Some Organic Compounds Dependent variable Half-wave reduction potential (E 1/2 )of 69 compounds Independent variables 1150 theoretical molecular descriptors calculated by DRAGON software 4 Th ICW 4 Th ICW A QSPR example

ANN is a nonlinear non-parametric modeling method Feature selection is more important for ANN Feature selection-based ANN modeling is a complex procedure Orthogonalization of the variables before introducing to the network substantially decreases the computational time and increases the overall performances of the ANN PC-ANN is a feature extraction-based algorithm 4 Th ICW 4 Th ICW Principal Component-Artificial Neural Network (PC-ANN)

4 Th ICW 4 Th ICW Genetic Algorithm Applied to the selection of Factors in PC-ANN modeling, The set of PCs selected by GA could model the structure-antagonist activity of the calcium channel blockers better than the ER procedure B. Hemmateenejad, M. Akhond, R. Miri, M. Shamsipur, J. Chem. Inf,. Comput. Sci. 43 (2003) How are the factors ranked based on their correlation coefficient in PC-ANN? PC-GA-ANN Algorithm

CR-PC-ANN Algorithm Correlation Ranking Procedure for factor selection in PC- ANN modeling, The nonlinear relationship between each one of the PCs and the dependent variable (y) was modeled by separate ANN models. It was found that the subset of PCs selected by CR was relatively the same as those selected by GA. Therefore the results of these factor selection procedures were similar B. Hemmateenejad, Chemometrics Intelligent Laboratory System, 2004, Accepted. 4 Th ICW 4 Th ICW

1.Application of ab initio theory to QSAR study of the 1,4- dihydrpyridine-based calcium channel blockers using GA-MLR and PC-GA-ANN procedures, B. Hemmateenejad, M.A. Safarpour, R.Miri, F. Taghavi, Journal of Computational Chemistry 25 (2004) Highly Correlating Distance-Connectivity-Based Topological Indices. 2: Prediction of 15 Properties of a Large Set of Alkanes Using a Stepwise Factor Selection-Based PCR Analysis, M. Shamsipur, R. Ghavami, B. Hemmateenejad, H. Sharghi, QSAR Combinatorial Sciences, 2004, Accepted. 3.Quantitative Structure-Electrochemistry Relationship Study of some Organic Compounds using PCR and PC-ANN, B. Hemmateenejad, M. Shamsipur, Internet Electronic Journal of Molecular Design 3 (2004) Toward an Optimal Procedure for PC-ANN Model Building: Prediction of the Carcinogenic Activity of a Large Set of Drugs, B. Hemmateenejad, M.A. Safarpour, R. Miri, N. Nesari, Journal of Chemical Information and Computer Sciences, Revised 5.Optimal QSAR analysis of the carcinogenic activity of drugs by correlation ranking and genetic algorithm-based PCR, B. Hemmateenejad, Journal of Chemometrics, Submitted.

1.Selection of Latent Variables in PLS 2.Application of other selection algorithms such as successive projections algorithm 3.Comparison between the importance of factor selection in multicomponent analysis and QSAR/QSPR studies 4.Application of the factor selection-based ANN modeling in multicomponent analysis 5.Validation of the different factor selection algorithms by new criteria 4 Th ICW 4 Th ICW Feature Works