Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.

Slides:



Advertisements
Similar presentations
Partitional Algorithms to Detect Complex Clusters
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Face Recognition and Biometric Systems
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 21: Spectral Clustering
Spectral Clustering Scatter plot of a 2D data set K-means ClusteringSpectral Clustering U. von Luxburg. A tutorial on spectral clustering. Technical report,
Spectral Clustering 指導教授 : 王聖智 S. J. Wang 學生 : 羅介暐 Jie-Wei Luo.
CS 584. Review n Systems of equations and finite element methods are related.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Proceedings of the 2007 SIAM International Conference on Data Mining.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
This week: overview on pattern recognition (related to machine learning)
1 Graph Embedding (GE) & Marginal Fisher Analysis (MFA) 吳沛勳 劉冠成 韓仁智
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Optimal Component Analysis Optimal Linear Representations of Images for Object Recognition X. Liu, A. Srivastava, and Kyle Gallivan, “Optimal linear representations.
Jakob Verbeek December 11, 2009
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Discriminant Analysis
Feature extraction using fuzzy complete linear discriminant analysis The reporter : Cui Yan
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality reduction
June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’
Intrinsic Data Geometry from a Training Set
Clustering Usman Roshan.
Dimensionality reduction
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Dimensionality reduction
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Clustering Usman Roshan CS 675.
Presentation transcript:

Dimensionality reduction Usman Roshan CS 675

Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference means to sum of variance

Linear discriminant analysis Fisher linear discriminant: –Difference in means of projected data gives us the between-class scatter matrix –Variance gives us within-class scatter matrix

Linear discriminant analysis Fisher linear discriminant solution: –Take derivative w.r.t. w and set to 0 –This gives us w = cS w -1 (m 1 -m 2 )

Scatter matrices S b is between class scatter matrix S w is within-class scatter matrix S t = S b + S w is total scatter matrix

Fisher linear discriminant General solution is given by eigenvectors of S w -1 S b

Fisher linear discriminant Problems can happen with calculating the inverse A different approach is the maximum margin criterion

Maximum margin criterion (MMC) Define the separation between two classes as S(C) represents the variance of the class. In MMC we use the trace of the scatter matrix to represent the variance. The scatter matrix is

Maximum margin criterion (MMC) The scatter matrix is The trace (sum of diagonals) is Consider an example with two vectors x and y

Maximum margin criterion (MMC) Plug in trace for S(C) and we get The above can be rewritten as Where S w is the within-class scatter matrix And S b is the between-class scatter matrix

Weighted maximum margin criterion (WMMC) Adding a weight parameter gives us In WMMC dimensionality reduction we want to find w that maximizes the above quantity in the projected space. The solution w is given by the largest eigenvector of the above

How to use WMMC for classification? Reduce dimensionality to fewer features Run any classification algorithm like nearest means or nearest neighbor. Experimental results to follow.

K-nearest neighbor Classify a given datapoint to be the majority label of the k closest points The parameter k is cross-validated Simple yet can obtain high classification accuracy

Weighted maximum variance (WMV) Find w that maximizes the weighted variance

PCA via WMV Reduces to PCA if C ij = 1/n

MMC via WMV Let y i be class labels and let nk be the size of class k. Let G ij be 1/n for all i and j and L ij be 1/n k if i and j are in same class. Then MMC is given by

MMC via WMV (proof sketch)

Graph Laplacians We can rewrite WMV with Laplacian matrices. Recall WMV is Let L = D – C where D ii = Σ j C ij Then WMV is given by where X = [x 1, x 2, …, x n ] contains each x i as a column. w is given by largest eigenvector of XLX T

Graph Laplacians Widely used in spectral clustering (see tutorial on course website) Weights C ij may be obtained via –Epsilon neighborhood graph –K-nearest neighbor graph –Fully connected graph Allows semi-supervised analysis (where test data is available but not labels)

Graph Laplacians We can perform clustering with the Laplacian Basic algorithm for k clusters: –Compute first k eigenvectors v i of Laplacian matrix –Let V = [v 1, v 2, …, v k ] –Cluster rows of V (using k-means) Why does this work?

Graph Laplacians We can cluster data using the mincut problem Balanced version is NP-hard We can rewrite balanced mincut problem with graph Laplacians. Still NP- hard because solution is allowed only discrete values By relaxing to allow real values we obtain spectral clustering.

Back to WMV – a two parameter approach Recall that WMV is given by Collapse C ij into two parameters –C ij = α < 0 if i and j are in same class –C ij = β > 0 if i and j are in different classes We call this 2-parameter WMV

Experimental results To evaluate dimensionality reduction for classification we first extract features and then apply 1-nearest neighbor in cross-validation 20 datasets from UCI machine learning archive Compare 2PWMV+1NN, WMMC+1NN, PCA+1NN, 1NN Parameters for 2PWMV+1NN and WMMC+1NN obtained by cross- validation

Datasets

Results

Average error: –2PWMV+1NN: 9.5% (winner in 9 out of 20) –WMMC+1NN: 10% (winner in 7 out of 20) –PCA+1NN: 13.6% –1NN: 13.8% Parametric dimensionality reduction does help

High dimensional data

Results Average error on high dimensional data: –2PWMV+1NN: 15.2% –PCA+1NN: 17.8% –1NN: 22%