High Dimensional Probabilistic Modelling through Manifolds

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
Nonlinear Dimension Reduction Presenter: Xingwei Yang The powerpoint is organized from: 1.Ronald R. Coifman et al. (Yale University) 2. Jieping Ye, (Arizona.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings.
Visual Recognition Tutorial
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Pattern Recognition and Machine Learning
Principal Component Analysis
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Dimensional reduction, PCA
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Amos Storkey, School of Informatics. Density Traversal Clustering and Generative Kernels a generative framework for spectral clustering Amos Storkey, Tom.
Continuous Latent Variables --Bishop
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Diffusion Maps and Spectral Clustering
Latent Variable Models Christopher M. Bishop. 1. Density Modeling A standard approach: parametric models  a number of adaptive parameters  Gaussian.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Cao et al. ICML 2010 Presented by Danushka Bollegala.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Multifactor GPs Suppose now we wish to model different mappings for different styles. We will add a latent style vector s along with x, and define the.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y.
Manifold learning: MDS and Isomap
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Linear Models for Classification
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Neil Lawrence Machine Learning Group Department of Computer Science
Spectral Methods for Dimensionality
Neil Lawrence Machine Learning Group Department of Computer Science
Neil Lawrence Machine Learning Group Department of Computer Science
Dimensionality Reduction
Ch 12. Continuous Latent Variables ~ 12
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Neil Lawrence Machine Learning Group Department of Computer Science
Motion Segmentation with Missing Data using PowerFactorization & GPCA
Lecture 8:Eigenfaces and Shared Features
CS 2750: Machine Learning Dimensionality Reduction
Support Vector Machines
Machine Learning Basics
Spectral Methods Tutorial 6 1 © Maks Ovsjanikov
Latent Variables, Mixture Models and EM
Probabilistic Models with Latent Variables
A Unifying Review of Linear Gaussian Models
Principal Component Analysis
Neil Lawrence Machine Learning Group Department of Computer Science
LECTURE 09: DISCRIMINANT ANALYSIS
Marios Mattheakis and Pavlos Protopapas
Probabilistic Surrogate Models
Presentation transcript:

High Dimensional Probabilistic Modelling through Manifolds Neil Lawrence Machine Learning Group Department of Computer Science University of Sheffield, U.K.

Overview Motivation Mathematical Foundations Some Results Extensions Modelling high dimensional data. Smooth low dimensional embedded spaces. Mathematical Foundations Probabilistic PCA Gaussian Processes Some Results Extensions

Motivation

High Dimensional Data Handwritten digit: 3648 dimensions. Space contains more than just this digit.

Handwritten Digit A simple model of the digit – rotate the ‘prototype’. 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60

Projection onto Principal Components

Discontinuities

Low Dimensional Manifolds Pure rotation of a prototype is too simple. In practice the data may go through several distortions, e.g. digits undergo thinning, translation and rotation. For data with ‘structure’: we expect fewer distortions than dimensions; we therefore expect the data to live in a lower dimension manifold. Deal with high dimensional data by looking for lower dimensional non-linear embedding.

Our Options Spectral Approaches Non-spectral approaches Classical Multidimensional Scaling (MDS) Uses eigenvectors of similarity matrix. LLE and Isomap are MDS with particular proximity measures. Kernel PCA Provides an embedding and a mapping from the high dimensional space to the embedding. The mapping is implied through the use of a kernel function as the similarity matrix. Non-spectral approaches Non-metric MDS and Sammon Mappings Iterative optimisation of a stress function. A mapping can be forced (e.g. Neuroscale).

Our Options Probabilistic Approaches Probabilistic PCA A linear method. Density Networks Use importance sampling and a multi-layer perceptron. GTM Uses a grid based sample and an RBF network. The difficulty for probabilistic approaches: propagate a distribution through a non-linear mapping.

The New Model PCA has a probabilistic interpretation. It is difficult to ‘non-linearise’. We present a new probabilistic interpretation of PCA. This can be made non-linear. The result is non-linear probabilistic PCA.

Mathematical Foundations

Notation q – dimension of latent/embedded space. d – dimension of data space. N – number of data points. centred data, . latent variables, . mapping matrix, W 2 <d£q. a(i) is vector from i th row of A ai is a vector from i th column of A

Reading Notation X and Y are design matrices. Covariance given by N-1YTY. Inner product matrix given by YYT.

Linear Embeddings Represent data, Y, with a lower dimensional embedding X. Assume a linear relationship of the form where

Probabilistic PCA X W Y W Y

Maximum Likelihood Solution If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

PCA – Probabilistic Interpretation X W Y W Y

Dual Probabilistic PCA X W Y X Y

Maximum Likelihood Solution If Uq are first q eigenvectors of d-1YYT and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

Maximum Likelihood Solution If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

Equivalence of PPCA Formulations Solution for PPCA: Solution for Dual PPCA: Equivalence is from

Gaussian Processes

Gaussian Process (GP) Prior over functions. Functions are infinite dimensional. Distribution over instantiations: finite dimensional objects. Can prove by induction that GP is `consistent’. GP is defined by mean function and covariance function. Mean function often taken to be zero. Covariance function must be positive definite. Class of valid covariances is the same as Mercer Kernels.

Gaussian Processes A (zero mean) Gaussian Process likelihood is of the form where K is the covariance function or kernel. The linear kernel has the form

Dual Probabilistic PCA (revisited) X W Gaussian Process Y X Y

Dual Probabilistic PCA is a GPLVM Log-likelihood:

Non-linear Kernel Instead of linear kernel function. Use, for example, RBF kernel function. Leads to non-linear embeddings.

Pros & Cons of GPLVM Pros Cons Probabilistic Missing data straightforward. Can sample from model given X. Different noise models can be handled. Kernel parameters can be optimised. Cons Speed of optimisation. Optimisation is non-convex cf Classical MDS, kernel PCA.

Benchmark Examples

GP-LVM Optimisation Gradient based optimisation wrt X, , ,  (SCG). Example data-set Oil flow data Three phases of flow (stratified annular homogenous) Twelve measurement probes 1000 data-points We sub-sampled to 100 data-points.

PCA non-Metric MDS metric MDS GTM KPCA GP-LVM

Nearest Neighbour in X Number of errors for each method. PCA GP-LVM Non-metric MDS 20 4 13 Metric MDS GTM* Kernel PCA* 6 7 * These models required parameter selection.

Full Oil Data

Nearest Neighbour in X Number of errors for each method. PCA GTM GP-LVM 162 11 1

Applications

Applications Grochow et al Urtasun et al We’ve been looking at faces … Style Based Inverse Kinematics Urtasun et al A prior for tracking We’ve been looking at faces …

Face Animation Data from Electronic Arts OpenGL Code by Manuel Sanchez (now at Electronic Arts).

Extensions

Back Constraints GP-LVM Gives a smooth mapping from X to Y. Points close together in X will be close in Y. It does not imply points close in Y will be close in X. Kernel PCA gives a smooth mapping from Y to X. Points close together in Y will be close in X. It does not imply points close in X will be close in Y. (joint work with Joaquin Quiňonero Candela)

Back Constraints Maximise likelihood with constraint. Each latent point is given by a mapping from data space. For example the mapping could be a kernel:

Back Constrained GP-LVM X Gives a mapping in both directions, a GP mapping from X to Y and a reverse constraining mapping from Y to X. X constrained to be function of Y Y

Motion Capture with Back Constraints MATLAB demo Example in motion capture with RBF back constraints

Linear Back Constraints X X =YB Learn the projection matrix: B2<d£q Y As motivation consider PCA on a digit data set.

Reconstruction with GP

Linear Projection with GP-LVM PCA Constrained GP-LVM

Linear constrained GP-LVM Nearest Neighbour in X latent dim 2 3 4 PCA 131 115 47 Linear constrained GP-LVM 79 60 39 (c.f 24 errors for nearest neighbour in Y)

Ongoing Work Improving quality of learning in large data sets.

Conclusions

Conclusion Probabilistic non-linear interpretation of PCA. A probabilistic model for high dimensions. Back constraints can be introduced to improve visualisation. seek better linear projections