Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Feature reduction vs feature selection Feature reduction – All original features are used – The transformed features are linear combinations of the original features. Feature selection – Only a subset of the original features are used.

Why feature reduction? Most machine learning and data mining techniques may not be effective for high- dimensional data – Curse of Dimensionality – Query accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type of disease may be small.

Why feature reduction? Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy.

Applications of feature reduction Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification

High-dimensional data in bioinformatics Gene expression Gene expression pattern images

High-dimensional data in computer vision Face imagesHandwritten digits

Principle component analysis (PCA) Motivation Example: Given 53 blood and urine samples (features) from 65 people. How can we visualize the measurements?

PCA Matrix format (65x53) Instances Features Difficult to see the correlations between the features...

PCA - Data Visualization Spectral format (53 pictures, one for each feature) Difficult to see the correlations between the features...

Bi-variate Tri-variate PCA - Data Visualization How can we visualize the other variables??? … difficult to see in 4 or higher dimensional spaces...

Data Visualization Is there a representation better than the coordinate axes? Is it really necessary to show all the 53 dimensions? – … what if there are strong correlations between the features? How could we find the smallest subspace of the 53-D space that keeps the most information about the original data? A solution: Principal Component Analysis

Principle Component Analysis Orthogonal projection of data onto lower-dimension linear space that... – maximizes variance of projected data (purple line) – minimizes mean squared distance between data point and projections (sum of blue lines) PCA:

Principle Components Analysis Idea: – Given data points in a d-dimensional space, project into lower dimensional space while preserving as much information as possible Eg, find best planar approximation to 3D data Eg, find best 12-D approximation to 10 4 -D data – In particular, choose projection that minimizes squared error in reconstructing original data

Vectors originating from the center of mass Principal component #1 points in the direction of the largest variance. Each subsequent principal component… – is orthogonal to the previous ones, and – points in the directions of the largest variance of the residual subspace The Principal Components

2D Gaussian dataset

1 st PCA axis

2 nd PCA axis

PCA algorithm I (sequential) We maximize the variance of the projection in the residual subspace We maximize the variance of projection of x x’ PCA reconstruction Given the centered data {x 1, …, x m }, compute the principal vectors: 1 st PCA vector k th PCA vector w1(w1Tx)w1(w1Tx) w2(w2Tx)w2(w2Tx) x w1w1 w2w2 x’=w 1 (w 1 T x)+w 2 (w 2 T x) w

PCA algorithm II (sample covariance matrix) Given data {x 1, …, x m }, compute covariance matrix  PCA basis vectors = the eigenvectors of  Larger eigenvalue  more important eigenvectors where

PCA algorithm II PCA algorithm(X, k): top k eigenvalues/eigenvectors % X = N  m data matrix, % … each data point x i = column vector, i=1..m X  subtract mean x from each column vector x i in X   X X T … covariance matrix of X { i, u i } i=1..N = eigenvectors/eigenvalues of   1  2  …  N Return { i, u i } i=1.. k % top k principle components

PCA algorithm III (SVD of the data matrix) Singular Value Decomposition of the centered data matrix X. X features  samples = USV T X VTVT SU = samples significant noise significant sig.

PCA algorithm III Columns of U – the principal vectors, { u (1), …, u (k) } – orthogonal and has unit norm – so U T U = I – Can reconstruct the data using linear combinations of { u (1), …, u (k) } Matrix S – Diagonal – Shows importance of each eigenvector Columns of V T – The coefficients for reconstructing the samples

Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a 12... a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w PCA – Input and Output

From k original variables: x 1,x 2,...,x k : Produce k new variables: y 1,y 2,...,y k : y 1 = a 11 x 1 + a 12 x 2 +... + a 1k x k y 2 = a 21 x 1 + a 22 x 2 +... + a 2k x k... y k = a k1 x 1 + a k2 x 2 +... + a kk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. y i 's are Principal Components PCA – usage (k to k transfer)

PCA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer

From k original variables: x 1,x 2,...,x k : Produce t new variables: y 1,y 2,...,y t : y 1 = a 11 x 1 + a 12 x 2 +... + a 1k x k y 2 = a 21 x 1 + a 22 x 2 +... + a 2k x k... y t = a t1 x 1 + a t2 x 2 +... + a tk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. PCA – usage (Dimension reduction) Use first t rows Of Matrix w

PCA Summary until now Rotates multivariate dataset into a new configuration which is easier to interpret Purposes – simplify data – look at relationships between variables – look at patterns of units

Example of PCA on Iris Data

PCA application on image compression

Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid View each as a 144-D vector

L 2 error and PCA dim

PCA compression: 144D ) 60D

16 most important eigenvectors

60 most important eigenvectors Looks like the discrete cosine bases of JPG!...

2D Discrete Cosine Basis http://en.wikipedia.org/wiki/Discrete_cosine_transform

PCA for image compression p=1p=2p=4p=8 p=16p=32p=64p=100 Original Image

On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR

Linear Discriminant Analysis First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA): Dimension reduction – Finds linear combinations of the features X=X 1,...,X d with large ratios of between-groups to within-groups sums of squares - discriminant variables; Classification – Predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables

Is PCA a good criterion for classification? Data variation determines the projection direction What’s missing? – Class information

What is a good projection? Similarly, what is a good criterion? – Separating different classes Two classes overlap Two classes are separated

What class information may be useful? Between-class distance – Distance between the centroids of different classes

What class information may be useful? Within-class distance – Accumulated distance of an instance to the centroid of its class Between-class distance –Distance between the centroids of different classes

Linear discriminant analysis Linear discriminant analysis (LDA) finds most discriminant projection by maximizing between-class distance and minimizing within-class distance

Notations Data matrix Training data from different from 1, 2, …, k

Notations Between-class scatter Properties: Between-class distance = trace of between-class scatter (I.e., the summation of diagonal elements of the scatter) Within-class distance = trace of within-class scatter Within-class scatter

Discriminant criterion Discriminant criterion in mathematical formulation – Between-class scatter matrix – Within-class scatter matrix The optimal transformation is given by solving a generalized eigenvalue problem

Graphical view of classification d n n K-1 d 1 Find the nearest neighbor Or nearest centroid d 1 A test data point h

LDA Computation

Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a 12... a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w LDA – Input and Output

LDA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer

LDA Principle: Maximizing between class distances and Minimizing within class distances

An Example: Fisher’s Iris Data Actual Group Number of Observations Predicted Group SetosaVersicolorVirginica Setosa Versicolor Virginica 50 0 48 1 0 2 49 Table 1:Linear Discriminant Analysis (APER = 0.0200)

LDA on Iris Data

On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Similar presentations

Presentation on theme: "Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Similar presentations

Presentation on theme: "Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear."— Presentation transcript:

Similar presentations

About project

Feedback