Download presentation
Presentation is loading. Please wait.
Published byIra Gaines Modified over 9 years ago
1
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara
2
Tony Jebara, Columbia University Topic 13 Manifolds Continued and Spectral Clustering Convex Invariance Learning (CoIL) Kernel PCA (KPCA) Spectral Clustering & N-Cuts
3
Tony Jebara, Columbia University Manifolds Continued PCA: linear manifold MDS: get inter-point distances, find 2D data with same LLE: mimic neighborhoods using low dimensional vectors GTM: fit a grid of Gaussians to data via nonlinear warp Linear PCA after Nonlinear normalization/invariance of data Manifold is Linear PCA in Hilbert space (Kernels) Spectral Clustering in Hilbert space
4
Tony Jebara, Columbia University Convex Invariance Learning PCA is appropriate for finding a linear manifold Variation in data is only modeled linearly But, many problems are nonlinear However, the nonlinear variations may be irrelevant: Images: morph, rotate, translate, zoom… Audio: pitch changes, ambient acoustics… Video: motion, camera view, angles… Genomics:proteins fold, insertions, deletions… Databases:fields swapped, formats, scaled… Imagine a “Gremlin” is corrupting your data by multiplying each input vector X t by a type of matrix A t to give A t X t Idea: remove nonlinear irrelevant variations before PCA But, make this part of PCA optimization, not pre-processing
5
Tony Jebara, Columbia University Convex Invariance Learning Example of irrelevant variation in our data: permutation in image data… each image X t is multiplied by a permutation matrix A t by gremlin. Must clean it. When we convert images to a vector, we are assuming arbitrary meaningless ordering (like Gremlin mixing order) This arbitrary ordering causes wild nonlinearities (manifold) We should not trust ordering, assume gremlin has permuted it with arbitrary permutation matrix…
6
Tony Jebara, Columbia University Permutation Invariance Permutation is irrelevant variation in our data. Gremlin is permuting fields in our input vectors So, view a datum as “Bag of Vectors” instead single vector i.e. grayscale image = Set of Vectors or Bag of Pixels N pixels, each is a D=3 XYI tuple matrix Ai by gremlin. Must clean it. Treat each input as permutable “Bag of Pixels” x x x
7
Tony Jebara, Columbia University Optimal Permutation Vectorization / Rasterization: uses index in image to sort pixels into large vector. If we knew “optimal” correspondence: could fix sorting pixels in bag into large vector more appropriately … we don’t know it, must learn it…
8
Tony Jebara, Columbia University PCA on Permutated Data In non-permuted vector images, linear changes & eigenvectors are additions & deletions of intensities (bad!). Translating, raising eyebrows, etc. = erasing & redrawing In bag of pixels (vectorized only after knowing optimal permutation) get linear changes & eigenvectors are morphings, warpings, jointly spatial & intensity change
9
Tony Jebara, Columbia University Permutation as a Manifold Assume order unknown. “Set of Vectors” or “Bag of Pixels” Get permutational invariance (order doesn’t matter) Can’t represent invariance by single ‘X’ vector point in DxN space since we don’t know the ordering Get permutation invariance by ‘X’ spanning all possible reorderings. Multiply X by unknown A matrix (permutation or doubly-stochastic) x x x x x
10
Tony Jebara, Columbia University Invariant Paths as Matrix Ops Move vector along manifold by multiplying by matrix: Restrict A to be permutation matrix (operator) Resulting manifold of configurations is “orbit” if A is group Or, for smooth manifold, make A doubly-stochastic matrix Endow each image in dataset with own transformation matrix. Each is now a bag or manifold:
11
Tony Jebara, Columbia University A Dataset of Invariant Manifolds E.g. assume model is PCA, learn 2D subspace of 3D data Permutation lets points move independently along paths Find PCA after moving to form ‘tight’ 2D subspace More generally, move along manifolds to improve fit of any model (PCA, SVM, probability density, etc.)
12
Tony Jebara, Columbia University Optimizing the Permutations Optimize: modeling cost & linear constraints on matrices Estimate transformation parameters and model parameters (PCA, Gaussian, SVM) Cost on matrices A emerges from modeling criterion Typically, get a Convex Cost with Convex Hull of Constraints (Unique!) Since A matrices are soft permutation matrices (doubly-stochastic) we have:
13
Tony Jebara, Columbia University Example Cost: Gaussian Mean Maximum Likelihood Gaussian Mean Model: Theorem 1: C(A) is convex in A (Convex Program) Can solve via a quadratic program on the A matrices Minimizing the trace of a covariance tries to pull the data spherically towards a common mean
14
Tony Jebara, Columbia University Example Cost: Gaussian Cov Theorem 2: Regularized log determinant of covariance is convex. Equivalently, minimize Theorem 3: Cost non-quadratic but upper boundable by quad. Iteratively solve with QP with variational bound: Min’ing determinant flattens data into low volume pancake
15
Tony Jebara, Columbia University Example Cost: Fisher Discrimin. Find linear Fisher Discriminant model w that maximizes ratio of between & within-class scatter For discriminative invariance, transformation matrices should increase between-class scatter (numerator) and should reduce within class scatter (denominator) Minimizing above permutes data to make classification easy x x x x x x x x x xx x x x x x x x x x xx
16
Tony Jebara, Columbia University Interpreting C(A) Maximum Likelihood Mean Permute data towards common mean Maximum Likelihood Mean & Covariance Permute data towards flat subspace Pushes energy into few eigenvectors Great as pre-processing before PCA Fisher Discriminant Permute data towards two flat subspaces while repelling away from each other’s means
17
Tony Jebara, Columbia University SMO Optimization of QP Quadratic Programming used for all C(A) since: Gaussian Mean quadratic Gaussian Covariance upper boundable by quadratic Fisher Discriminant upper boundable by quadratic Use Sequential Minimal Optimization axis parallel optimization, pick axes to update, ensure constraints not violated Soft permutation matrix 4 constraints or 4 entries at a time
18
Tony Jebara, Columbia University XY Digits Permuted PCA Original PCAPermuted PCA 20 Images of ‘3’ and ‘9’ Each is 70 (x,y) dots No order on the ‘dots’ PCA compress with same number of Eigenvectors Convex Program first estimates the permutation better reconstruction
19
Tony Jebara, Columbia University Interpolation Intermediate images are smooth morphs Points nicely corresponded Spatial morphing versus ‘redrawing’ No ghosting
20
Tony Jebara, Columbia University XYI Faces Permuted PCA Original PCA Permuted Bag-of XYI Pixels PCA 2000 XYI Pixels: Compress to 20 dims Improve squared error of PCA by Almost 3 orders of magnitude x10 3
21
Tony Jebara, Columbia University XYI Multi-Faces Permuted PCA +/- Scaling on Eigenvector Top 5 Eigenvectors All just linear variations in bag of XYI pixels Vectorization nonlinear needs huge # of eigenvectors
22
Tony Jebara, Columbia University XYI Multi-Faces Permuted PCA +/- Scaling on Eigenvector Next 5 Eigenvectors
23
Tony Jebara, Columbia University Replace all dot-products in PCA with kernel evaluations. Recall, could do PCA on DxD covariance matrix of data or NxN Gram matrix of data: For nonlinearity, do PCA on feature expansions: Instead of doing explicit feature expansion, use kernel I.e. d-th order polynomial As usual, kernel must satisfy Mercer’s theorem Assume, for simplicity, all feature data is zero-mean Kernel PCA Evals & Evecs satisfy If data is zero-mean
24
Tony Jebara, Columbia University Kernel PCA Efficiently find & use eigenvectors of C-bar: Can dot either side of above equation with feature vector: Eigenvectors are in span of feature vectors: Combine equations:
25
Tony Jebara, Columbia University From before, we had: this is an eig equation! Get eigenvectors and eigenvalues of K Eigenvalues are N times For each eigenvector k there is an eigenvector v k Want eigenvectors v to be normalized: Can now use alphas only for doing PCA projection & reconstruction! Kernel PCA
26
Tony Jebara, Columbia University Kernel PCA To compute k’th projection coefficient of a new point (x) Reconstruction*: *Pre-image problem, linear combo in Hilbert goes outside Can now do nonlinear PCA and do PCA on non-vectors Nonlinear KPCA eigenvectors satisfy same properties as usual PCA but in Hilbert space. These evecs: 1) Top q have max variance 2) Top q reconstruction has with min mean square error 3) Are uncorrelated/orthogonal 4) Top have max mutual with inputs
27
Tony Jebara, Columbia University Centering Kernel PCA So far, we had assumed the data was zero-mean: We want this: How to do without touching feature space? Use kernels… Can get alpha eigenvectors from K tilde by adjusting old K
28
Tony Jebara, Columbia University KPCA on 2d dataset Left-to-right Kernel poly order goes from 1 to 3 1=linear=PCA Top-to-bottom top evec to weaker evecs 2D KPCA
29
Tony Jebara, Columbia University Use coefficients of the KPCA for training a linear SVM classifier to recognize chairs from their images. Use various polynomial kernel degrees where 1=linear as in regular PCA Kernel PCA Results
30
Tony Jebara, Columbia University Use coefficients of the KPCA for training a linear SVM classifier to recognize characters from their images. Use various polynomial kernel degrees where 1=linear as in regular PCA (worst case in experiments) Inferior performance to nonlinear SVMs (why??) Kernel PCA Results
31
Tony Jebara, Columbia University Typically, use EM or k-means to cluster N data points Can imagine clustering the data points only from NxN matrix capturing their proximity information This is spectral clustering Again compute Gram matrix using, e.g. RBF kernel Example: have N pixels from an image, each x = [xcoord, ycoord, intensity] of each pixel From eigenvectors of K matrix (or slight, variant), these seem to capture some segmentation or clustering of data points! Nonparametric form of clustering since we didn’t assume Gaussian distribution… Spectral Clustering
32
Tony Jebara, Columbia University Stability in Spectral Clustering Standard problem when computing & using eigenvectors: Small changes in data can cause eigenvectors to change wildly Ensure the eigenvectors we keep are distinct & stable: look at eigengap… Some algorithms ensure the eigenvectors are going to have a safe eigengap. Adjust or process Gram matrix to ensure eigenvectors are still stable. 3 evecs=unsafe 3 evecs=safe gap
33
Tony Jebara, Columbia University Stabilized Spectral Clustering Stabilized spectral clustering algorithm:
34
Tony Jebara, Columbia University Stabilized Spectral Clustering Example results compared to other clustering algorithms (traditional kmeans, unstable spectral clustering, connected components).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.