Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Lecture 7: Principal component analysis (PCA)
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
An introduction to Principal Component Analysis (PCA)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principal Component Analysis
Dimensional reduction, PCA
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Techniques for studying correlation and covariance structure
Correlation. The sample covariance matrix: where.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Discriminant Analysis
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Signal & Weight Vector Spaces
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Feature Extraction 主講人:虞台文.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
Principal Component Analysis
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
INTRODUCTION TO Machine Learning 3rd Edition
Eigen Decomposition Based on the slides by Mani Thomas and book by Gilbert Strang. Modified and extended by Longin Jan Latecki.
CH 5: Multivariate Methods
Principal Component Analysis (PCA)
Ying shen Sse, tongji university Sep. 2016
Eigen Decomposition Based on the slides by Mani Thomas and book by Gilbert Strang. Modified and extended by Longin Jan Latecki.
Eigen Decomposition Based on the slides by Mani Thomas and book by Gilbert Strang. Modified and extended by Longin Jan Latecki.
Techniques for studying correlation and covariance structure
Principal Component Analysis
Introduction PCA (Principal Component Analysis) Characteristics:
Machine Learning Math Essentials Part 2
Eigen Decomposition Based on the slides by Mani Thomas and book by Gilbert Strang. Modified and extended by Longin Jan Latecki.
Dimensionality Reduction
Feature space tansformation methods
Generally Discriminant Analysis
Principal Components What matters most?.
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.
Review for test #2 Fundamentals of ANN Dimensionality reduction
Eigen Decomposition Based on the slides by Mani Thomas and book by Gilbert Strang. Modified and extended by Longin Jan Latecki.
Principal Component Analysis
Test #1 Thursday September 20th
INTRODUCTION TO Machine Learning
Presentation transcript:

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (beyond 2 attributes, it gets complicated) 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Why Reduce Dimensionality?

Feature Selection vs Extraction Feature selection: Chose k<d important features, ignore the remaining d – k Data snooping Genetic algorithm Feature extraction: Project the original d attributes onto a new k<d dimensional feature space Principal components analysis (PCA), Linear discriminant analysis (LDA), Factor analysis (FA) Auto-association ANN 2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Principal Components Analysis (PCA) Assume that attributes in dataset are drawn from a multivariate normal distribution. P(x)=N( ,  ) 3 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) dx1 1xd dxd Variance is a matrix called “covariance”. Diagonal elements are  2 of individual attributes. Off diagonals describe how fluctuations in one attribute affect fluctuations in another.

dx1 1xd dxd Dividing off-diagonal elements by the product of variances, gives “correlation coefficients” Correlation among attributes makes it difficult to say how any one attribute contributes to an effect.

Consider a linear transformation of attributes z = Mx where M is a dxd matrix. The d features z will also be normally distributed (proof later). A choice of M that results in a diagonal covariance matrix in feature-space has the following advantages: 1.Interpretation of uncorrelated features is easier 2.Total variance of features is the sum of diagonal elements

Diagonalization of the covariance matrix: The transformation z = Mx that leads to a diagonal feature-space covariance has M = W T where the columns of W are the eigenvectors of the covariance matrix  The collection of eigenvalue equations  w k = k w k can be written as  W = WD where D = diag( 1... d ) and W is formed by column vectors [w 1... w d ]. W T = W -1 so W T  W = W -1 WD = D If we arrange the eigenvectors so that eigenvalues 1... d are in decreasing order of magnitude, then z i = w i T x, i = 1…k < d are the “principle components”

Proportion of Variance (PoV) explained by k principal components (λ i sorted in descending order) is 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) A plot PoV vs k shows how many eigenvalues are required in capture given part of total variance How many principal components ?

Proof that if attributes x are normally distributed with mean  and covariance , then z=w T x is normally distributed with mean w T  and variance w T  w. Var(z) = Var(w T x) = E[(w T x – w T μ) 2 ] = E[(w T x – w T μ)(x T w –  T w)] = E[w T (x – μ)(x – μ) T w] = w T E[(x – μ)(x –μ) T ]w = w T ∑ w The objective of PCA is to maximize Var(z)=w T ∑ w Must be done subject to the constraint ||w 1 || = w 1 T w 1 = 1 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Review: constrained optimization by Lagrange multipliers find the stationary point of f(x 1, x 2 ) = 1 - x 1 2 – x 2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 = 1 Constrained optimization

Form the Lagrangian L(x, ) = f(x 1, x 2 ) + (g(x 1, x 2 ) - c) L(x, ) = 1-x 1 2 -x (x 1 +x 2 -1)

-2x 1 + = 0 -2x 2 + = 0 x 1 + x 2 -1 = 0 Solve for x 1 and x 2 Set the partial derivatives of L with respect to x 1, x 2, and equal to zero L(x, ) = 1-x 1 2 -x (x 1 +x 2 -1)

In this case, not necessary to find sometimes called “undetermined multiplier ” Solution is x 1 * = x 2 * = ½

Application of Lagrange multipliers in PCA Find w 1 such that w 1 T  w 1 is maximum subject to constraint ||w 1 || = w 1 T w 1 = 1 Maximize L = w 1 T  w 1 + c(w 1 T w 1 – 1) gradient of L = 2  w 1 + 2cw 1 = 0  w 1 = -cw 1 w 1 is an eigenvector of covariance matrix let c = is eigenvalue associate with w 1 13 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Prove that 1 is the variance of principal component 1 z 1 = w 1 T x  w 1 = 1 w 1 var(z 1 ) = w 1 T  w 1 = 1 w 1 T w 1 =  1 To maximize var(z 1 ), chose 1 as largest eigenvalue 14 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

More principal components: If  has 2 distinct eigenvalues, define 2 nd principal component by max Var(z 2 ), such that ||w 2 ||=1 and orthogonal to w 1 Introduce Lagrange multipliers  and  Set gradient of L with respect to w 2 to zero 2  w 2 – 2  w 2 –  w 1 = 0 Choose  = 0 and  = 2 get  w 2 = 2 w 2 To maximize Var(z 2 ) chose 2 as the second largest eigenvalue

For any dxd matrix M, z=M T x is a linear transformation of attributes x that defines features z If attributes x are normally distributed with mean  and covariance , then z is normally distributed with mean M T  and covariance M T  M. (proof slide 8) If M = W, a matrix with columns that are the normalized eigenvectors of , then the covariance of z is diagonal with elements equal to the eigenvalues of  (proof slide 6) Arrange the eigenvalues in decreasing order of magnitude and find 1... k that account for most (e.g. 90%) of the total variance, then z i = w i T x, are the “principle components” 16 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review

MatLab’s [V,D] = eig(A) returns both eigenvectors (columns of V) and eigenvalues D in increasing order. Invert the order and construct 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) More review Chose k that captures the desired amount of total variance

Example: cancer diagnostics Metabonomics data 94 samples 35 metabolites in each sample = d 60 control samples 34 diseased samples

proportion of variance plot ranked eigenvalues 3 PCs capture > 95%

1-34 cancer >35 control Samples from cancer patients cluster Scatter plot of PCs 1 and 2

Assignment 5 due Find the accuracy of a model that classifies all 6 types of beer bottles in glassdata.csv by multivariate linear regression. Find the eigenvalues and eigenvectors of the covariance matrix for the full beer-bottle data set. How many eigenvalues are required to capture more than the 90% of the variance? Transform the attribute data by the eigenvectors of the 3 largest eigenvalues. What is the accuracy of a linear model that uses these features. Plot the accuracy when you successively extent the linear model by including z 1 2, z 2 2, z 3 2, z 1 z 2, z 1 z 3, and z 2 z 3.

PCA code for glass data

eigenvalues indexed by decreasing magnitude

PoV

Extend MLR with PCA features

L +x 1 2 +x 2 2 +x 1 3 +x 1 x 2 +x 1 x 3 +x 2 x 3