Dimensionality reduction

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Component Analysis (Review)
Face Recognition and Biometric Systems
Principal Component Analysis
Fisher’s Linear Discriminant  Find a direction and project all data points onto that direction such that:  The points in the same class are as close.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Eigenfaces As we discussed last time, we can reduce the computation by dimension reduction using PCA –Suppose we have a set of N images and there are c.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Face Recognition Jeremy Wyatt.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
Comparing Kernel-based Learning Methods for Face Recognition Zhiguo Li
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Recognition Part II Ali Farhadi CSE 455.
Face Recognition and Feature Subspaces
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
This week: overview on pattern recognition (related to machine learning)
Face Recognition and Feature Subspaces
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Linear Models for Classification
Discriminant Analysis
Feature extraction using fuzzy complete linear discriminant analysis The reporter : Cui Yan
CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
2D-LDA: A statistical linear discriminant analysis for image matrix
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
LDA (Linear Discriminant Analysis) ShaLi. Limitation of PCA The direction of maximum variance is not always good for classification.
Feature Extraction 主講人:虞台文.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Machine Learning Fisher’s Criteria & Linear Discriminant Analysis
Clustering Usman Roshan.
Background on Classification
Dimensionality reduction
LECTURE 10: DISCRIMINANT ANALYSIS
Dimensionality Reduction
CH 5: Multivariate Methods
Dimensionality reduction
Dimensionality reduction
Classification Discriminant Analysis
Classification Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
CS4670: Intro to Computer Vision
Support vector machines
LECTURE 09: DISCRIMINANT ANALYSIS
Clustering Usman Roshan CS 675.
Presentation transcript:

Dimensionality reduction Usman Roshan

Dimensionality reduction What is dimensionality reduction? Compress high dimensional data into lower dimensions How do we achieve this? PCA (unsupervised): We find a vector w of length 1 such that the variance of the projected data onto w is maximized. Binary classification (supervised): Find a vector w that maximizes ratio (Fisher) or difference (MMC) of means and variances of the two classes.

PCA Find projection that maximizes variance PCA minimizes reconstruction error How many dimensions to reduce data to? Consider difference between consecutive eigenvalues If it exceeds a threshold we stop there.s

Feature extraction vs selection PCA and other dimensionality reduction algorithms (to follow) allow feature extraction and selection. In extraction we consider a linear combination of all features. In selection we pick specific features from the data.

Kernel PCA Main idea of kernel version XXTw = λw XTXXTw = λXTw (XTX)XTw = λXTw XTw is projection of data on the eigenvector w and also the eigenvector of XTX This is also another way to compute projections in space quadratic in number of rows but only gives projections.

Kernel PCA In feature space the mean is given by Suppose for a moment that the data is mean subtracted in feature space. In other words mean is 0. Then the scatter matrix in feature space is given by

Kernel PCA The eigenvectors of ΣΦ give us the PCA solution. But what if we only know the kernel matrix? First we center the kernel matrix so that mean is 0 where j is a vector of 1’s.K = K

Kernel PCA Recall from earlier Same idea for kernel PCA XXTw = λw XTXXTw = λXTw (XTX)XTw = λXTw XTw is projection of data on the eigenvector w and also the eigenvector of XTX XTX is the linear kernel matrix Same idea for kernel PCA The projected solution is given by the eigenvectors of the centered kernel matrix.

Polynomial degree 2 kernel Breast cancer

Polynomial degree 2 kernel Climate

Polynomial degree 2 kernel Qsar

Polynomial degree 2 kernel Ionosphere

Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: Maximize ratio of difference means to sum of variance

Linear discriminant analysis Fisher linear discriminant: Difference in means of projected data gives us the between-class scatter matrix Variance gives us within-class scatter matrix

Linear discriminant analysis Fisher linear discriminant solution: Take derivative w.r.t. w and set to 0 This gives us w = cSw-1(m1-m2)

Scatter matrices Sb is between class scatter matrix Sw is within-class scatter matrix St = Sb + Sw is total scatter matrix

Fisher linear discriminant General solution is given by eigenvectors of Sb-1Sw

Fisher linear discriminant Problems can happen with calculating the inverse A different approach is the maximum margin criterion

Maximum margin criterion (MMC) Define the separation between two classes as S(C) represents the variance of the class. In MMC we use the trace of the scatter matrix to represent the variance. The scatter matrix is

Maximum margin criterion (MMC) The scatter matrix is The trace (sum of diagonals) is Consider an example with two vectors x and y

Maximum margin criterion (MMC) Plug in trace for S(C) and we get The above can be rewritten as Where Sw is the within-class scatter matrix And Sb is the between-class scatter matrix

Weighted maximum margin criterion (WMMC) Adding a weight parameter gives us In WMMC dimensionality reduction we want to find w that maximizes the above quantity in the projected space. The solution w is given by the largest eigenvector of the above

How to use WMMC for classification? Reduce dimensionality to fewer features Run any classification algorithm like nearest means or nearest neighbor.

K-nearest neighbor Classify a given datapoint to be the majority label of the k closest points The parameter k is cross-validated Simple yet can obtain high classification accuracy

Weighted maximum variance (WMV) Find w that maximizes the weighted variance

Weighted maximum variance (WMV) Reduces to PCA if Cij = 1/n