Download presentation
Presentation is loading. Please wait.
Published byBenjamin Wilkerson Modified over 9 years ago
1
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear discriminant analysis (LDA)
2
Feature reduction vs feature selection Feature reduction – All original features are used – The transformed features are linear combinations of the original features. Feature selection – Only a subset of the original features are used.
3
Why feature reduction? Most machine learning and data mining techniques may not be effective for high- dimensional data – Curse of Dimensionality – Query accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type of disease may be small.
4
Why feature reduction? Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy.
5
Applications of feature reduction Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification
6
High-dimensional data in bioinformatics Gene expression Gene expression pattern images
7
High-dimensional data in computer vision Face imagesHandwritten digits
8
Principle component analysis (PCA) Motivation Example: Given 53 blood and urine samples (features) from 65 people. How can we visualize the measurements?
9
PCA Matrix format (65x53) Instances Features Difficult to see the correlations between the features...
10
PCA - Data Visualization Spectral format (53 pictures, one for each feature) Difficult to see the correlations between the features...
11
Bi-variate Tri-variate PCA - Data Visualization How can we visualize the other variables??? … difficult to see in 4 or higher dimensional spaces...
12
Data Visualization Is there a representation better than the coordinate axes? Is it really necessary to show all the 53 dimensions? – … what if there are strong correlations between the features? How could we find the smallest subspace of the 53-D space that keeps the most information about the original data? A solution: Principal Component Analysis
13
Principle Component Analysis Orthogonal projection of data onto lower-dimension linear space that... – maximizes variance of projected data (purple line) – minimizes mean squared distance between data point and projections (sum of blue lines) PCA:
14
Principle Components Analysis Idea: – Given data points in a d-dimensional space, project into lower dimensional space while preserving as much information as possible Eg, find best planar approximation to 3D data Eg, find best 12-D approximation to 10 4 -D data – In particular, choose projection that minimizes squared error in reconstructing original data
15
Vectors originating from the center of mass Principal component #1 points in the direction of the largest variance. Each subsequent principal component… – is orthogonal to the previous ones, and – points in the directions of the largest variance of the residual subspace The Principal Components
16
2D Gaussian dataset
17
1 st PCA axis
18
2 nd PCA axis
19
PCA algorithm I (sequential) We maximize the variance of the projection in the residual subspace We maximize the variance of projection of x x’ PCA reconstruction Given the centered data {x 1, …, x m }, compute the principal vectors: 1 st PCA vector k th PCA vector w1(w1Tx)w1(w1Tx) w2(w2Tx)w2(w2Tx) x w1w1 w2w2 x’=w 1 (w 1 T x)+w 2 (w 2 T x) w
20
PCA algorithm II (sample covariance matrix) Given data {x 1, …, x m }, compute covariance matrix PCA basis vectors = the eigenvectors of Larger eigenvalue more important eigenvectors where
21
PCA algorithm II PCA algorithm(X, k): top k eigenvalues/eigenvectors % X = N m data matrix, % … each data point x i = column vector, i=1..m X subtract mean x from each column vector x i in X X X T … covariance matrix of X { i, u i } i=1..N = eigenvectors/eigenvalues of 1 2 … N Return { i, u i } i=1.. k % top k principle components
22
PCA algorithm III (SVD of the data matrix) Singular Value Decomposition of the centered data matrix X. X features samples = USV T X VTVT SU = samples significant noise significant sig.
23
PCA algorithm III Columns of U – the principal vectors, { u (1), …, u (k) } – orthogonal and has unit norm – so U T U = I – Can reconstruct the data using linear combinations of { u (1), …, u (k) } Matrix S – Diagonal – Shows importance of each eigenvector Columns of V T – The coefficients for reconstructing the samples
24
Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a 12... a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w PCA – Input and Output
25
From k original variables: x 1,x 2,...,x k : Produce k new variables: y 1,y 2,...,y k : y 1 = a 11 x 1 + a 12 x 2 +... + a 1k x k y 2 = a 21 x 1 + a 22 x 2 +... + a 2k x k... y k = a k1 x 1 + a k2 x 2 +... + a kk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. y i 's are Principal Components PCA – usage (k to k transfer)
26
PCA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer
27
From k original variables: x 1,x 2,...,x k : Produce t new variables: y 1,y 2,...,y t : y 1 = a 11 x 1 + a 12 x 2 +... + a 1k x k y 2 = a 21 x 1 + a 22 x 2 +... + a 2k x k... y t = a t1 x 1 + a t2 x 2 +... + a tk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. PCA – usage (Dimension reduction) Use first t rows Of Matrix w
28
PCA Summary until now Rotates multivariate dataset into a new configuration which is easier to interpret Purposes – simplify data – look at relationships between variables – look at patterns of units
29
Example of PCA on Iris Data
30
PCA application on image compression
31
Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid View each as a 144-D vector
32
L 2 error and PCA dim
33
PCA compression: 144D ) 60D
34
PCA compression: 144D ) 16D
35
16 most important eigenvectors
36
PCA compression: 144D ) 6D
37
6 most important eigenvectors
38
PCA compression: 144D ) 3D
39
3 most important eigenvectors
40
PCA compression: 144D ) 1D
41
60 most important eigenvectors Looks like the discrete cosine bases of JPG!...
42
2D Discrete Cosine Basis http://en.wikipedia.org/wiki/Discrete_cosine_transform
43
PCA for image compression p=1p=2p=4p=8 p=16p=32p=64p=100 Original Image
44
On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR
45
Linear Discriminant Analysis First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA): Dimension reduction – Finds linear combinations of the features X=X 1,...,X d with large ratios of between-groups to within-groups sums of squares - discriminant variables; Classification – Predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables
46
Is PCA a good criterion for classification? Data variation determines the projection direction What’s missing? – Class information
47
What is a good projection? Similarly, what is a good criterion? – Separating different classes Two classes overlap Two classes are separated
48
What class information may be useful? Between-class distance – Distance between the centroids of different classes
49
What class information may be useful? Within-class distance – Accumulated distance of an instance to the centroid of its class Between-class distance –Distance between the centroids of different classes
50
Linear discriminant analysis Linear discriminant analysis (LDA) finds most discriminant projection by maximizing between-class distance and minimizing within-class distance
51
Linear discriminant analysis Linear discriminant analysis (LDA) finds most discriminant projection by maximizing between-class distance and minimizing within-class distance
52
Notations Data matrix Training data from different from 1, 2, …, k
53
Notations Between-class scatter Properties: Between-class distance = trace of between-class scatter (I.e., the summation of diagonal elements of the scatter) Within-class distance = trace of within-class scatter Within-class scatter
54
Discriminant criterion Discriminant criterion in mathematical formulation – Between-class scatter matrix – Within-class scatter matrix The optimal transformation is given by solving a generalized eigenvalue problem
55
Graphical view of classification d n n K-1 d 1 Find the nearest neighbor Or nearest centroid d 1 A test data point h
56
LDA Computation
57
Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a 12... a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w LDA – Input and Output
58
LDA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer
59
LDA Principle: Maximizing between class distances and Minimizing within class distances
60
An Example: Fisher’s Iris Data Actual Group Number of Observations Predicted Group SetosaVersicolorVirginica Setosa Versicolor Virginica 50 0 48 1 0 2 49 Table 1:Linear Discriminant Analysis (APER = 0.0200)
61
LDA on Iris Data
62
On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.