Dimensionality reduction

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
This week: overview on pattern recognition (related to machine learning)
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Discriminant Analysis
Feature extraction using fuzzy complete linear discriminant analysis The reporter : Cui Yan
CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer.
Dimensionality reduction
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Kernel nearest means Usman Roshan. Feature space transformation Let Φ(x) be a feature space transformation. For example if we are in a two- dimensional.
LDA (Linear Discriminant Analysis) ShaLi. Limitation of PCA The direction of maximum variance is not always good for classification.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Usman Roshan Dept. of Computer Science NJIT
Spectral Methods for Dimensionality
Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’
Principal Component Analysis (PCA)
Support vector machines
Machine Learning Fisher’s Criteria & Linear Discriminant Analysis
Usman Roshan CS 675 Machine Learning
ECE 471/571 - Lecture 19 Review 02/24/17.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Intrinsic Data Geometry from a Training Set
Clustering Usman Roshan.
Background on Classification
Dimensionality reduction
INTRODUCTION TO Machine Learning 3rd Edition
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
Dimensionality Reduction
CH 5: Multivariate Methods
Face Recognition and Feature Subspaces
Basic machine learning background with Python scikit-learn
Roberto Battiti, Mauro Brunato
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Support vector machines
X.3 Linear Discriminant Analysis: C-Class
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimensionality Reduction
Feature space tansformation methods
Generally Discriminant Analysis
Support vector machines
LECTURE 09: DISCRIMINANT ANALYSIS
Support vector machines
Test #1 Thursday September 20th
INTRODUCTION TO Machine Learning
Clustering Usman Roshan CS 675.
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Lecture 14 – Gradient Descent
Presentation transcript:

Dimensionality reduction Usman Roshan

Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: Maximize ratio of difference means to sum of variance

Linear discriminant analysis Fisher linear discriminant: Difference in means of projected data gives us the between-class scatter matrix Variance gives us within-class scatter matrix

Linear discriminant analysis Fisher linear discriminant solution: Take derivative w.r.t. w and set to 0 This gives us w = cSw-1(m1-m2)

Scatter matrices Sb is between class scatter matrix Sw is within-class scatter matrix St = Sb + Sw is total scatter matrix

Fisher linear discriminant General solution is given by eigenvectors of Sw-1Sb

Fisher linear discriminant Computational problems can happen with calculating the inverse A different approach is the maximum margin criterion

Maximum margin criterion (MMC) Define the separation between two classes as S(C) represents the variance of the class. In MMC we use the trace of the scatter matrix to represent the variance. The scatter matrix is

Maximum margin criterion (MMC) The scatter matrix is The trace (sum of diagonals) is Consider an example with two vectors x and y

Maximum margin criterion (MMC) Plug in trace for S(C) and we get The above can be rewritten as Where Sw is the within-class scatter matrix And Sb is the between-class scatter matrix

Weighted maximum margin criterion (WMMC) Adding a weight parameter gives us In WMMC dimensionality reduction we want to find w that maximizes the above quantity in the projected space. The solution w is given by the largest eigenvector of the above

How to use WMMC for classification? Reduce dimensionality to fewer features Run any classification algorithm like nearest means or nearest neighbor.

K-nearest neighbor Classify a given datapoint to be the majority label of the k closest points The parameter k is cross-validated Simple yet can obtain high classification accuracy

Weighted maximum variance (WMV) Find w that maximizes the weighted variance

Weighted maximum variance (WMV) Reduces to PCA if Cij = 1/n

MMC via WMV Let yi be class labels and let nk be the size of class k. Let Gij be 1/n for all i and j and Lij be 1/nk if i and j are in same class. Then MMC is given by

MMC via WMV (proof sketch)

Graph Laplacians We can rewrite WMV with Laplacian matrices. Recall WMV is Define Laplacian L = D – C where Dii = ΣjCij (more next slide) Then WMV is given by where X = [x1, x2, …, xn] contains each xi as a column. w is given by largest eigenvector of XLXT

Graph Laplacians Widely used in spectral clustering (see tutorial on course website) Weights Cij may be obtained via Epsilon neighborhood graph K-nearest neighbor graph Fully connected graph Allows semi-supervised analysis (where test data is available but not labels)

Back to WMV – a two parameter approach Recall that WMV is given by Collapse Cij into two parameters Cij = α < 0 if i and j are in same class Cij = β > 0 if i and j are in different classes We call this 2-parameter WMV

Experimental results To evaluate dimensionality reduction for classification we first extract features and then apply 1-nearest neighbor in cross-validation 20 datasets from UCI machine learning archive Compare 2PWMV+1NN, WMMC+1NN, PCA+1NN, 1NN Parameters for 2PWMV+1NN and WMMC+1NN obtained by cross- validation

Datasets

Results

Results

Results Average error: Parametric dimensionality reduction does help 2PWMV+1NN: 9.5% (winner in 9 out of 20) WMMC+1NN: 10% (winner in 7 out of 20) PCA+1NN: 13.6% 1NN: 13.8% Parametric dimensionality reduction does help

High dimensional data

High dimensional data

Results Average error on high dimensional data: 2PWMV+1NN: 15.2% PCA+1NN: 17.8% 1NN: 22%