Multivariate Data Analysis Principal Component Analysis
Principal Component Analysis (PCA) Singular Value Decomposition Eigenvector / eigenvalue calculation
Data Matrix (IxK) Reduce variables Improve projections Remove noise Find outliers Find classes X I K
PCA Example with 2 variables, 6 objects Find best (most informative) direction in space Describe direction Make projection
x1x1 x2x2
x1x1 x2x2
1st PC
Score Residual
1st PC Loading p1 Loading p2 Unit vector
1st PC Loading p1 = cos( ) Loading p2 = sin ( ) Unit vector
X t p I K Score vector Loading vector i
X t p I K Score vector Loading vector k
X t p I K Score vector Loading vector
X = t 1 p 1 ’ + t 2 p 2 ’ t A p A ’ + E X=TP’+E X : properly preprocessed (IxK) T: Score matrix (IxA) P: loading matrix (KxA) E: residual matrix (IxK) t a : score vector p a : loading vector
The Wine Example People magazine Wise & Gallagher
France Italy Switz Austra Brit U.S.A. Russia Czech Japan Mexico Wine Beer Spirit LifeEx HeartD
Beer Wine Spirit LifeEx HeartD Mean Standard Deviation
Component Singular value 1 =46% 32% 12% 8% 2%
Score 1 (46%) Score 2 (32%) France Italy Switz Austral Brit USA Russia Czech Japan Mex
Loading 1 Loading 2 Wine Beer Spirit Life exp. Heart dis.
Conclusions Scores = positions of objects in multivariate space Loadings = importance of original variables for new directions Try to explain a large enough portion of X (46+32 = 78%)
The Apricot Example Manley & Geladi
Wavelength, nm Pseudoabsorbance Appelkoos
Component number Singular value Scree plot
What is rank? Mathematical rank = max(min(I,K)) Gives zero residual Effective rank = A Separates model from noise
ANOVA Comp# SSSS%SS%cum Total
Score 1 (98%) Score 2 (2%)
ANOVA SS tot = SS 1 + SS 2 + SS SS (I or K) SS tot = (I or K) From largest to smallest!
ANOVA X = TP’ + E data = model + residual SStot = SSmod + SSres R 2 = SSmod / SStot = 1 - SSres / SStot Coefficient of determination (often in %)
Examples Wines R 2 = SSmod = 78% SSres = 22% 2 Comp. Apricots 1 R 2 = SSmod = 99.93% SSres = 0.07% 2 Comp. Apricots 2 R 2 = SSmod = 100% SSres = ±0.0% 3 Comp.
Wavelength, nm Absorbance Outliers removed
Singular values Component No outliers 1 =81% 16% 3%
Score 2 (16%) Score 3 (3%) Whole fruit No kernel Thin slice
Wavelength, nm Loading 2 3
Loading 2 Loading 3
More nomenclature Score = Latent Variable Loading vector = Eigenvector Effective rank = Pseudorank = Model dimensionality = Number of components SS a = Eigenvalue Singular value = SS a 1/2
An analysis sequence 1. Scale, mean-center data 2. Calculate a few components 3. Check scores, loadings 4. Find outliers, groupings, explain 5. Remove outliers
An analysis sequence 6. Scale, mean-center data 7. Calculate enough components 8. Try to detemine pseudorank 9. Check score plots 10. Check loading plots 11. Check residuals
Residual stdev Wines
Residual stdev Wines