Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear discriminant analysis (LDA)
Feature reduction vs feature selection Feature reduction – All original features are used – The transformed features are linear combinations of the original features. Feature selection – Only a subset of the original features are used.
Why feature reduction? Most machine learning and data mining techniques may not be effective for high- dimensional data – Curse of Dimensionality – Query accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type of disease may be small.
Why feature reduction? Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy.
Applications of feature reduction Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification
High-dimensional data in bioinformatics Gene expression Gene expression pattern images
High-dimensional data in computer vision Face imagesHandwritten digits
Principle component analysis (PCA) Motivation Example: Given 53 blood and urine samples (features) from 65 people. How can we visualize the measurements?
PCA Matrix format (65x53) Instances Features Difficult to see the correlations between the features...
PCA - Data Visualization Spectral format (53 pictures, one for each feature) Difficult to see the correlations between the features...
Bi-variate Tri-variate PCA - Data Visualization How can we visualize the other variables??? … difficult to see in 4 or higher dimensional spaces...
Data Visualization Is there a representation better than the coordinate axes? Is it really necessary to show all the 53 dimensions? – … what if there are strong correlations between the features? How could we find the smallest subspace of the 53-D space that keeps the most information about the original data? A solution: Principal Component Analysis
Principle Component Analysis Orthogonal projection of data onto lower-dimension linear space that... – maximizes variance of projected data (purple line) – minimizes mean squared distance between data point and projections (sum of blue lines) PCA:
Principle Components Analysis Idea: – Given data points in a d-dimensional space, project into lower dimensional space while preserving as much information as possible Eg, find best planar approximation to 3D data Eg, find best 12-D approximation to D data – In particular, choose projection that minimizes squared error in reconstructing original data
Vectors originating from the center of mass Principal component #1 points in the direction of the largest variance. Each subsequent principal component… – is orthogonal to the previous ones, and – points in the directions of the largest variance of the residual subspace The Principal Components
2D Gaussian dataset
1 st PCA axis
2 nd PCA axis
PCA algorithm I (sequential) We maximize the variance of the projection in the residual subspace We maximize the variance of projection of x x’ PCA reconstruction Given the centered data {x 1, …, x m }, compute the principal vectors: 1 st PCA vector k th PCA vector w1(w1Tx)w1(w1Tx) w2(w2Tx)w2(w2Tx) x w1w1 w2w2 x’=w 1 (w 1 T x)+w 2 (w 2 T x) w
PCA algorithm II (sample covariance matrix) Given data {x 1, …, x m }, compute covariance matrix PCA basis vectors = the eigenvectors of Larger eigenvalue more important eigenvectors where
PCA algorithm II PCA algorithm(X, k): top k eigenvalues/eigenvectors % X = N m data matrix, % … each data point x i = column vector, i=1..m X subtract mean x from each column vector x i in X X X T … covariance matrix of X { i, u i } i=1..N = eigenvectors/eigenvalues of 1 2 … N Return { i, u i } i=1.. k % top k principle components
PCA algorithm III (SVD of the data matrix) Singular Value Decomposition of the centered data matrix X. X features samples = USV T X VTVT SU = samples significant noise significant sig.
PCA algorithm III Columns of U – the principal vectors, { u (1), …, u (k) } – orthogonal and has unit norm – so U T U = I – Can reconstruct the data using linear combinations of { u (1), …, u (k) } Matrix S – Diagonal – Shows importance of each eigenvector Columns of V T – The coefficients for reconstructing the samples
Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w PCA – Input and Output
From k original variables: x 1,x 2,...,x k : Produce k new variables: y 1,y 2,...,y k : y 1 = a 11 x 1 + a 12 x a 1k x k y 2 = a 21 x 1 + a 22 x a 2k x k... y k = a k1 x 1 + a k2 x a kk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. y i 's are Principal Components PCA – usage (k to k transfer)
PCA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer
From k original variables: x 1,x 2,...,x k : Produce t new variables: y 1,y 2,...,y t : y 1 = a 11 x 1 + a 12 x a 1k x k y 2 = a 21 x 1 + a 22 x a 2k x k... y t = a t1 x 1 + a t2 x a tk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. PCA – usage (Dimension reduction) Use first t rows Of Matrix w
PCA Summary until now Rotates multivariate dataset into a new configuration which is easier to interpret Purposes – simplify data – look at relationships between variables – look at patterns of units
Example of PCA on Iris Data
PCA application on image compression
Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid View each as a 144-D vector
L 2 error and PCA dim
PCA compression: 144D ) 60D
PCA compression: 144D ) 16D
16 most important eigenvectors
PCA compression: 144D ) 6D
6 most important eigenvectors
PCA compression: 144D ) 3D
3 most important eigenvectors
PCA compression: 144D ) 1D
60 most important eigenvectors Looks like the discrete cosine bases of JPG!...
2D Discrete Cosine Basis
PCA for image compression p=1p=2p=4p=8 p=16p=32p=64p=100 Original Image
On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR
Linear Discriminant Analysis First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA): Dimension reduction – Finds linear combinations of the features X=X 1,...,X d with large ratios of between-groups to within-groups sums of squares - discriminant variables; Classification – Predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables
Is PCA a good criterion for classification? Data variation determines the projection direction What’s missing? – Class information
What is a good projection? Similarly, what is a good criterion? – Separating different classes Two classes overlap Two classes are separated
What class information may be useful? Between-class distance – Distance between the centroids of different classes
What class information may be useful? Within-class distance – Accumulated distance of an instance to the centroid of its class Between-class distance –Distance between the centroids of different classes
Linear discriminant analysis Linear discriminant analysis (LDA) finds most discriminant projection by maximizing between-class distance and minimizing within-class distance
Linear discriminant analysis Linear discriminant analysis (LDA) finds most discriminant projection by maximizing between-class distance and minimizing within-class distance
Notations Data matrix Training data from different from 1, 2, …, k
Notations Between-class scatter Properties: Between-class distance = trace of between-class scatter (I.e., the summation of diagonal elements of the scatter) Within-class distance = trace of within-class scatter Within-class scatter
Discriminant criterion Discriminant criterion in mathematical formulation – Between-class scatter matrix – Within-class scatter matrix The optimal transformation is given by solving a generalized eigenvalue problem
Graphical view of classification d n n K-1 d 1 Find the nearest neighbor Or nearest centroid d 1 A test data point h
LDA Computation
Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w LDA – Input and Output
LDA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer
LDA Principle: Maximizing between class distances and Minimizing within class distances
An Example: Fisher’s Iris Data Actual Group Number of Observations Predicted Group SetosaVersicolorVirginica Setosa Versicolor Virginica Table 1:Linear Discriminant Analysis (APER = )
LDA on Iris Data
On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR