INTRODUCTION TO Machine Learning 3rd Edition

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

CHAPTER 13: Alpaydin: Kernel Machines
Covariance Matrix Applications
Component Analysis (Review)
INTRODUCTION TO Machine Learning 2nd Edition
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Principal Component Analysis
Dimensional reduction, PCA
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Dimensionality Reduction
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
Manifold learning: MDS and Isomap
Non-Linear Dimensionality Reduction
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Spectral Methods for Dimensionality
Principal Component Analysis (PCA)
CSCE822 Data Mining and Warehousing
PREDICT 422: Practical Machine Learning
Dimensionality Reduction
Exploring Microarray data
LECTURE 10: DISCRIMINANT ANALYSIS
Dimensionality Reduction
Dimensionality reduction
CS 2750: Machine Learning Dimensionality Reduction
Face Recognition and Feature Subspaces
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Jianping Fan Dept of CS UNC-Charlotte
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Dynamic graphics, Principal Component Analysis
Principal Component Analysis
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Recitation: SVD and dimensionality reduction
Dimensionality Reduction
INTRODUCTION TO Machine Learning
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Curse of Dimensionality
Radial Basis Functions: Alternative to Back Propagation
INTRODUCTION TO Machine Learning
Presentation transcript:

INTRODUCTION TO Machine Learning 3rd Edition Lecture Slides for INTRODUCTION TO Machine Learning 3rd Edition ETHEM ALPAYDIN © The MIT Press, 2014 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml3e

CHAPTER 6: Dimensionality Reduction

Why Reduce Dimensionality? Reduces time complexity: Less computation Reduces space complexity: Fewer parameters Saves the cost of observing the feature Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions

Feature Selection vs Extraction Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k

Subset Selection There are 2d subsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new feature j = argmini E ( F È xi ) Add xj to F if E ( F È xj ) < E ( F ) Hill-climbing O(d2) algorithm Backward search: Start with all features and remove one at a time, if possible. Floating search (Add k, remove l)

Iris data: Single feature Chosen

Iris data: Add one more feature to F4 Chosen

Principal Components Analysis Find a low-dimensional space such that when x is projected there, information loss is minimized. The projection of x on the direction of w is: z = wTx Find w such that Var(z) is maximized Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)] = E[wT(x – μ)(x – μ)Tw] = wT E[(x – μ)(x –μ)T]w = wT ∑ w where Var(x)= E[(x – μ)(x –μ)T] = ∑

Maximize Var(z) subject to ||w||=1 ∑w1 = αw1 that is, w1 is an eigenvector of ∑ Choose the one with the largest eigenvalue for Var(z) to be max Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1 ∑ w2 = α w2 that is, w2 is another eigenvector of ∑ and so on.

What PCA does z = WT(x – m) where the columns of W are the eigenvectors of ∑ and m is sample mean Centers the data at the origin and rotates the axes

How to choose k ? Proportion of Variance (PoV) explained when λi are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow”

Feature Embedding When X is the Nxd data matrix, XTX is the dxd matrix (covariance of features, if mean- centered) XXT is the NxN matrix (pairwise similarities of instances) PCA uses the eigenvectors of XTX which are d-dim and can be used for projection Feature embedding uses the eigenvectors of XXT which are N-dim and which give directly the coordinates after projection Sometimes, we can define pairwise similarities (or distances) between instances, then we can use feature embedding without needing to represent instances as vectors.

Factor Analysis Find a small number of factors z, which when combined generate x : xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi where zj, j =1,...,k are the latent factors with E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j , εi are the noise sources E[ εi ]= ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 , and vij are the factor loadings

PCA vs FA PCA From x to z z = WT(x – µ) FA From z to x x – µ = Vz + ε

Factor Analysis In FA, factors zj are stretched, rotated and translated to generate x

Singular Value Decomposition and Matrix Factorization Singular value decomposition: X=VAWT V is NxN and contains the eigenvectors of XXT W is dxd and contains the eigenvectors of XTX and A is Nxd and contains singular values on its first k diagonal X=u1a1v1T+...+ukakvkT where k is the rank of X

Matrix Factorization Matrix factorization: X=FG F is Nxk and G is kxd Latent semantic indexing

Multidimensional Scaling Given pairwise distances between N points, dij, i,j =1,...,N place on a low-dim map s.t. distances are preserved (by feature embedding) z = g (x | θ ) Find θ that min Sammon stress

Map of Europe by MDS Map from CIA – The World Factbook: http://www.cia.gov/

Linear Discriminant Analysis Find a low-dimensional space such that when x is projected, classes are well-separated. Find w that maximizes

Between-class scatter: Within-class scatter:

Fisher’s Linear Discriminant Find w that max LDA soln: Parametric soln:

K>2 Classes Within-class scatter: Between-class scatter: Find W that max The largest eigenvectors of SW-1SB; maximum rank of K-1

PCA vs LDA

Canonical Correlation Analysis X={xt,yt}t ; two sets of variables x and y x We want to find two projections w and v st when x is projected along w and y is projected along v, the correlation is maximized:

CCA x and y may be two different views or modalities; e.g., image and word tags, and CCA does a joint mapping

Isomap Geodesic distance is the distance along the manifold that the data lies in, as opposed to the Euclidean distance in the input space

Isomap Instances r and s are connected in the graph if ||xr-xs||<e or if xs is one of the k neighbors of xr The edge length is ||xr-xs|| For two nodes r and s not connected, the distance is equal to the shortest path between them Once the NxN distance matrix is thus formed, use MDS to find a lower-dimensional mapping

Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html

Locally Linear Embedding Given xr find its neighbors xs(r) Find Wrs that minimize Find the new coordinates zr that minimize

LLE on Optdigits Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html

Laplacian Eigenmaps Let r and s be two instances and Brs is their similarity, we want to find zr and zs that Brs can be defined in terms of similarity in an original space: 0 if xr and xs are too far, otherwise Defines a graph Laplacian, and feature embedding returns zr

Laplacian Eigenmaps on Iris Spectral clustering (chapter 7)