CHAPTER 6: Dimensionality Reduction Author: Christoph Eick The material is mostly based on the Shlens PCA Tutorial

Slides:

Advertisements

Similar presentations

CHAPTER 13: Alpaydin: Kernel Machines

Advertisements

Introduction to Machine Learning BITS C464/BITS F464

Sam Malek Department of Computer Science George Mason University Dagstuhl Seminar on Software Engineering for Self-Adaptive Systems.

RANDOM PROJECTIONS IN DIMENSIONALITY REDUCTION APPLICATIONS TO IMAGE AND TEXT DATA Ella Bingham and Heikki Mannila Ângelo Cardoso IST/UTL November 2009.

CS 221 Chapter 2 Excel. In Excel: A1 = 95 A2 = 95 A3 = 80 A4 = 0 =IF(A1

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO Machine Learning 3rd Edition

Dr Simin Nasseri, SPSU Tutorial: Pulley Chapter 2 Dr Simin Nasseri, SPSU Copyright of Howard and Musto, 2009.

Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.

Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Tutorial: Pulley Chapter 1.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Final Review for CS 562. Final Exam on December 18, 2014 in CAS 216 Time: 3PM – 5PM (~2hours) OPEN NOTES, SLIDES, BOOKS Study the topics that we covered.

1 Optimizing Business Process Performance Chapter 10 Business Process Modeling, Simulation and Design.

Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics, RU,Bangladesh

Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.

Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

VIEWNG SYSTEMS There are two ‘viewing systems’ that we use to visually interpret the world around us. What do you think the two ‘viewing systems’ are?

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Principal Component Analysis (PCA).

M M M M 5. Not Listed

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Dimensionality reduction CISC 5800 Professor Daniel Leeds.

Chapter 15: Classification of Time- Embedded EEG Using Short-Time Principal Component Analysis by Nguyen Duc Thang 5/2009.

亚洲的位置和范围吉林省白城市洮北区教师进修学校郑春艳. Q 宠宝贝神奇之旅 —— 亚洲 Q 宠快递你在网上拍的一套物理实验器材到了。 Q 宠宝贝打电话给你：你好，我是快递员，有你的邮件，你的收货地址上面写的是学校地址，现在学校放假了，能把你家的具体位置告诉我吗？请向快递员描述自己家的详细位置！

Feature selection/extraction/reduction Data Science in Practice Week 8, 04/11 Jia-Ming Chang CS Dept, National Chengchi University

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

INTRODUCTION TO Machine Learning 3rd Edition

Xin Zhao and Arie Kaufman

INTRODUCTION TO Machine Learning 3rd Edition

Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.

INTRODUCTION TO Machine Learning

Blind Signal Separation using Principal Components Analysis

Optimizing Business Process Performance

Probabilistic Models with Latent Variables

Physics-based simulation for visual computing applications

Principal Component Analysis

CHAPTER 15: Hidden Markov Models

CS4670: Intro to Computer Vision

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Dimensionality Reduction

INTRODUCTION TO Machine Learning

Curse of Dimensionality

Feature Selection Methods

INTRODUCTION TO Machine Learning

INTRODUCTION TO Machine Learning

Presentation transcript:

CHAPTER 6: Dimensionality Reduction Author: Christoph Eick The material is mostly based on the Shlens PCA Tutorial and to a lesser extend based on material in the Alpaydin book.

2 Why Reduce Dimensionality? 1. Reduces time complexity: Less computation 2. Reduces space complexity: Less parameters 3. Saves the cost of aquiring the feature 4. Simpler models are more robust 5. Easier to interpret; simpler explanation 6. Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions Ch. Eick: Dimensionality Reduction

3 Feature Selection/Extraction/Construction Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms Feature extraction: Project the original x i, i =1,...,d dimensions to new k<d dimensions, z j, j =1,...,k Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA) Feature construction: create new features based on old features: f=(…) with f usually being a non-linear function  support vector machines,… Ch. Eick: Dimensionality Reduction

4 Key Ideas Dimensionality Reduction Given a dataset X Find a low-dimensional linear projection Two possible formulations  The variance in low-d is maximized  The average projection cost is minimized Both are equivalent Ch. Eick: Dimensionality Reduction

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Principal Components Analysis (PCA) Find a low-dimensional space such that when x is projected there, information loss is minimized. The projection of x on the direction of w is: z = w T x Find w such that Var(z) capture is maximized Var(z) = Var(w T x) = E[(w T x – w T μ ) 2 ] = E[(w T x – w T μ )(w T x – w T μ )] = E[w T (x – μ )(x – μ ) T w] = w T E[(x – μ )(x – μ ) T ]w = w T ∑ w where Var(x)= E[(x – μ )(x – μ ) T ] = ∑ Question: Why does PCA maximize and not minimize the variance in z?

Clarifications Assume the dataset x is d-dimensional with n examples and we want to reduce it to a k-dimensional dataset z : x= {(…) n w T = {(…) d (…) (…) d } kxd (…) n } d  n z= w T x kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset) Remarks: w contains the k eigenvectors of the co-variance matrix  of x with the highest eigenvalues: w i  = i w i k is usually chosen based on the variance captured/largeness of the first k eigenvalues. 6 Ch. Eick: Dimensionality Reduction k Eigenvectors Corrected on 2/24/2011

Example x= {(…) n w T = {(…) d (…) (…) d } kxd (…) n } d  n z= w T x kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset) Example: 4d dataset which contains 5 examples x w T 7 Ch. Eick: Dimensionality Reduction Corrected on 2/24/2011 (z1,z2):= (0.5*a1+0.5*a2  a3+0.5*a4, a1+a2+a4)

Shlens Tutorial on PCA PCA most valuable result of applied linear algebra (other PageRank) “The goal of PCA is to compute the most meaningful basis to re-express a noisy dataset. The hope is that the new basis will filter out noise and reveal hidden structure”. The goal of PCA is deciphering “garbled” data, referring to: rotation, redundancy, and noise. PCA is a non-parametric method; no way to incorporate preferences and other choices 8 Ch. Eick: Dimensionality Reduction

Computing Principal Components as Eigenvectors of the Covariance Matrix 1. Normalize x by subtracting from each attribute value its mean, obtaining y. 2. Compute 1/(n-1)*yy T =  the covariance matrix of x. 3. Diagonalize  obtaining a set of eigenvectors e with: e  e T = i I ( i is the eigenvalue of the i th eigenvector) 4. Select how many and which eigenvectors in e to keep, obtaining w (based on variance expressed/largeness of eigenvalues and possibly other criteria) 5. Create your transformed dataset z= w T x Remark: Symmetric matrices are always orthogonally diagonalizable see proof page 11 of Shlens paper! 9 Ch. Eick: Dimensionality Reduction

10 Maximize Var(z) subject to ||w||=1 ∑w 1 = α w 1 that is, w 1 is an eigenvector of ∑ Choose the one principal component with the largest eigenvalue for Var(z) Second principal component: Max Var(z 2 ), s.t., ||w 2 ||=1 and orthogonal to w 1 ∑ w 2 = α w 2 that is, w 2 is another eigenvector of ∑ and so on. Textbook’s PCA Version

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 What PCA does z = W T (x – m) where the columns of W are the eigenvectors of ∑, and m is sample mean Centers the data at the origin and rotates the axes

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 How to choose k ? Proportion of Variance (PoV) explained when λ i are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow”

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13

14 SVD Maximizing Variance Eigenvectors of  Rome — Principal Components All Roads Lead to Rome! Ch. Eick: Dimensionality Reduction

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Visualizing Numbers after applying PCA

16 Multidimensional Scaling Given pairwise distances between N points, d ij, i,j =1,...,N place on a low-dim map s.t. distances are preserved. z = g (x | θ )Find θ that min Sammon stress L1-Norm: Lq-NormL1-Norm:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 Map of Europe by MDS Map from CIA – The World Factbook: