Neil Lawrence Machine Learning Group Department of Computer Science

Probabilistic Dimensional Reduction with the Gaussian Process Latent Variable Model
Neil Lawrence Machine Learning Group Department of Computer Science University of Sheffield, U.K.

Overview Motivation Mathematical Foundations Results and Extensions
Modelling high dimensional data. Smooth low dimensional embedded spaces. Mathematical Foundations Probabilistic PCA Gaussian Processes Results and Extensions

Motivation

High Dimensional Data Handwritten digit:
3648 dimensions. Space contains more than just this digit.

Handwritten Digit A simple model of the digit – rotate the ‘prototype’. 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60

MATLAB Demo demDigitsManifold([2 3], ‘all’)
demDigitsManifold([2 3], ‘sixnine’)

Projection onto Principal Components

Discontinuities

Low Dimensional Manifolds
Pure rotation of a prototype is too simple. In practice the data may go through several distortions, e.g. digits undergo thinning, translation and rotation. For data with ‘structure’: we expect fewer distortions than dimensions; we therefore expect the data to live in a lower dimension manifold. Deal with high dimensional data by looking for lower dimensional non-linear embedding.

Our Options Spectral Approaches Non-spectral approaches
Classical Multidimensional Scaling (MDS) Uses eigenvectors of similarity matrix. LLE and Isomap are MDS with particular proximity measures. Kernel PCA Provides an embedding and a mapping from the high dimensional space to the embedding. The mapping is implied through the use of a kernel function as the similarity matrix. Non-spectral approaches Non-metric MDS and Sammon Mappings Iterative optimisation of a stress function. A mapping can be forced (e.g. Neuroscale).

Our Options Probabilistic Approaches
Probabilistic PCA A linear method. Density Networks Use importance sampling and a multi-layer perceptron. GTM Uses a grid based sample and an RBF network. The difficulty for probabilistic approaches: propagate a distribution through a non-linear mapping.

The New Model PCA has a probabilistic interpretation.
It is difficult to ‘non-linearise’. We present a new probabilistic interpretation of PCA. This can be made non-linear. The result is non-linear probabilistic PCA.

Mathematical Foundations

Notation q – dimension of latent/embedded space.
d – dimension of data space. N – number of data points. centred data, latent variables, mapping matrix, W 2 <d£q. ai,: is vector from i th row of A a:,i is a vector from i th column of A

Reading Notation X and Y are design matrices.
Covariance given by N-1YTY. Inner product matrix given by YYT.

Linear Embeddings Represent data, Y, with a lower dimensional embedding X. Assume a linear relationship of the form where

Probabilistic PCA X W Y W Y

Maximum Likelihood Solution
If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

PCA – Probabilistic Interpretation
X W Y W Y

Dual Probabilistic PCA
X W Y X Y

If Uq are first q eigenvectors of d-1YYT and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

Equivalence of PPCA Formulations
Solution for PPCA: Solution for Dual PPCA: Equivalence is from

Gaussian Processes

Gaussian Process (GP) Prior over functions.
Functions are infinite dimensional. Distribution over instantiations: finite dimensional objects. Can prove by induction that GP is `consistent’. GP is defined by mean function and covariance function. Mean function often taken to be zero. Covariance function must be positive definite. Class of valid covariances is the same as Mercer Kernels.

Gaussian Processes A (zero mean) Gaussian Process likelihood is of the form where K is the covariance function or kernel. The linear kernel has the form

GP Covariance Functions
RBF RBF RBF linear MLP MLP Bias Combo

Dual Probabilistic PCA (revisited)
X W Gaussian Process Y X Y

Dual Probabilistic PCA is a GPLVM
Log-likelihood:

Non-linear Kernel Instead of linear kernel function.
Use, for example, RBF kernel function. Leads to non-linear embeddings.

Pros & Cons of GPLVM Pros Cons Probabilistic
Missing data straightforward. Can sample from model given X. Different noise models can be handled. Kernel parameters can be optimised. Cons Speed of optimisation. Optimisation is non-convex cf Classical MDS, kernel PCA.

Benchmark Examples

GP-LVM Optimisation Gradient based optimisation wrt X, , ,  (SCG).
Example data-set Oil flow data Three phases of flow (stratified annular homogenous) Twelve measurement probes 1000 data-points We sub-sampled to 100 data-points.

PCA non-Metric MDS metric MDS GTM KPCA GP-LVM

Nearest Neighbour in X Number of errors for each method.
PCA GP-LVM Non-metric MDS 20 4 13 Metric MDS GTM* Kernel PCA* 6 7 * These models required parameter selection.

Full Oil Data

PCA GTM GP-LVM 162 11 1 cf. 2 errors for nearest neighbour in Y

Applications

Applications Grochow et al Urtasun et al We’ve been looking at faces.
Style Based Inverse Kinematics Urtasun et al A prior for tracking We’ve been looking at faces.

Face Animation Data from Electronic Arts
OpenGL Code by Manuel Sanchez (now at Electronic Arts).

Extensions

Back Constraints GP-LVM Gives a smooth mapping from X to Y.
Points close together in X will be close in Y. It does not imply points close in Y will be close in X. Kernel PCA gives a smooth mapping from Y to X. Points close together in Y will be close in X. It does not imply points close in X will be close in Y. This characteristic is more common in visualisation methods. (joint work with Joaquin Quiňonero Candela)

Back Constraints Maximise likelihood with constraint.
Each latent point is given by a mapping from data space. For example the mapping could be a kernel:

Back Constrained GP-LVM
X Gives a mapping in both directions, a GP mapping from X to Y and a reverse constraining mapping from Y to X. X constrained to be function of Y Y

Motion Capture Data Data from Ohio State University
Motion capture data of a man running. 217 data points in 102 dimensions. Down sample to 55 data points in 102 dimensions. Model with Pure GP-LVM and Back constrained GP-LVM (BC-GP-LVM).

MATLAB Demo demStickResults.m

Pure GP-LVM

BC-GP-LVM

Running Angle (a) (b) (c) (d)

Vocal Joystick Data Vowel data from Jon Malkin & Jeff Bilmes.
Aim is to use vowel data in a vocal joystick. Data consists of MFCCs and deltas. Each data point is one frame (d=24). 2,700 data points (300 from each vowel).

PCA - Vocal Joystick /a/ red cross, /ae/ green circle /ao/ blue plus /e/ cyan asterix /i/ magenta square /ibar/ yellow diamond /o/ red down triangle /schwa/ green up triangle and /u/ blue left triangle.

Pure GP-LVM /a/ red cross, /ae/ green circle /ao/ blue plus /e/ cyan asterix /i/ magenta square /ibar/ yellow diamond /o/ red down triangle /schwa/ green up triangle and /u/ blue left triangle.

Back Constrained GP-LVM
/a/ red cross, /ae/ green circle /ao/ blue plus /e/ cyan asterix /i/ magenta square /ibar/ yellow diamond /o/ red down triangle /schwa/ green up triangle and /u/ blue left triangle.

Isomap (7 neighbours) /a/ red cross, /ae/ green circle /ao/ blue plus /e/ cyan asterix /i/ magenta square /ibar/ yellow diamond /o/ red down triangle /schwa/ green up triangle and /u/ blue left triangle.

PCA Pure GP-LVM BC-GP-LVM Isomap 1613 226 155 458 cf. 24 errors in data space

Motion Capture with Back Constraints
MATLAB demo Example in motion capture with RBF back constraints In OXFORD toolbox: demStickResults

Linear Back Constraints
X X =YB Learn the projection matrix: B2<d£q Y As motivation consider PCA on a digit data set.

Reconstruction with GP

Linear Projection with GP-LVM
PCA Constrained GP-LVM

Linear constrained GP-LVM
Nearest Neighbour in X latent dim 2 3 4 PCA 131 115 47 Linear constrained GP-LVM 79 60 39 cf. 24 errors for nearest neighbour in Y

Dynamics

Adding Dynamics Data often has a temporal ordering.
Markov-based dynamics are often used. For the GP-LVM Marginalising such dynamics is intractable. But: MAP solutions are trivial to implement. Many choices: Kalman filter, Markov chains etc. Recent work by Wang, Fleet and Hertzmann uses GP.

GP-LVM with Dynamics Gaussian process mapping in latent space between time points Time point 1 Time point 2

Dynamics with Stick Man
demStickResults.m

Stick Man with Dynamics

Robot SLAM Joint work with Brian Ferris and Dieter Fox
Data consists of 215 signal strength readings from 30 wireless access points. Expect the data to be inherently 2-D. Ideally Loop closure should occur.

Robot Localisation Pure GP-LVM BC-GP-LVM GP-LVM + dynamics BC-GP-LVM

Ongoing Work Improving quality of learning in large data sets.
Based on work by Snelson & Ghahramani. Already used in presented results and available on-line.

Conclusions

Conclusion Probabilistic non-linear interpretation of PCA.
A probabilistic model for high dimensions. Back constraints can be introduced to improve visualisation. seek better linear projections

Neil Lawrence Machine Learning Group Department of Computer Science

Similar presentations

Presentation on theme: "Neil Lawrence Machine Learning Group Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neil Lawrence Machine Learning Group Department of Computer Science

Similar presentations

Presentation on theme: "Neil Lawrence Machine Learning Group Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback