Non-Linear Probabilistic PCA with Gaussian Process Latent Variable Models Neil Lawrence Machine Learning Group Department of Computer Science University of Sheffield, U.K.
Overview Motivation Mathematical Foundations A Sparse Algorithm High dimensional data. Smooth low dimensional embedded spaces. Mathematical Foundations Probabilistic PCA A Sparse Algorithm Some Results Inverse Kinematics and Animation
Motivation
High Dimensional Data Handwritten digit: 3648 dimensions. Space contains more than just this digit.
Handwritten Digit A simple model of the digit – rotate the ‘prototype’. 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60
Projection onto Principal Components
Discontinuities
Low Dimensional Manifolds Pure rotation of a prototype is too simple. In practice the data may go through several distortions, e.g. digits undergo thinning, translation and rotation. For data with ‘structure’: we expect fewer distortions than dimensions; we therefore expect the data to live in a lower dimension manifold. Deal with high dimensional data by looking for lower dimensional non-linear embedding.
Our Options Spectral Approaches Non-spectral approaches Classical Multidimensional Scaling (MDS) Uses eigenvectors of similarity matrix. LLE and Isomap are MDS with particular proximity measures. Kernel PCA Provides an embedding and a mapping from the high dimensional space to the embedding. The mapping is implied through the use of a kernel function as the similarity matrix. Non-spectral approaches Non-metric MDS and Sammon Mappings Iterative optimisation of a stress function. A mapping can be forced (e.g. Neuroscale).
Our Options Probabilistic Approaches Probabilistic PCA A linear method. Density Networks Use importance sampling and a multi-layer perceptron. GTM Uses a grid based sample and an RBF network. The difficulty for probabilistic approaches: propagate a distribution through a non-linear mapping.
The New Model PCA has a probabilistic interpretation. It is difficult to ‘non-linearise’. We present a new probabilistic interpretation of PCA. This can be made non-linear. The result is non-linear probabilistic PCA.
Mathematical Foundations
Notation q – dimension of latent/embedded space. d – dimension of data space. N – number of data points. centred data, . latent variables, . mapping matrix, W 2 <d£q. a(i) is vector from i th row of A ai is a vector from i th column of A
Reading Notation X and Y are design matrices. Covariance given by N-1YTY. Inner product matrix given by YYT.
Linear Embeddings Represent data, Y, with a lower dimensional embedding X. Assume a linear relationship of the form Probabilistically we implement this as …
PCA – Probabilistic Interpretation X W Y W Y
Maximum Likelihood Solution If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.
PCA – Probabilistic Interpretation X W Y W Y
Dual Probabilistic PCA X W Y X Y
Maximum Likelihood Solution If Uq are first q eigenvectors of d-1YYT and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.
Maximum Likelihood Solution If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.
Equivalence of PPCA Formulations Solution for PPCA: Solution for Dual PPCA: Equivalence is from
Gaussian Processes A Gaussian Process (GP) likelihood is of the form where K is the covariance function or kernel. If we select the linear kernel We see Dual PPCA is a product of GPs.
Dual Probabilistic PCA is a GPLVM Log-likelihood:
Non-linear Kernel Instead of linear kernel function. Use, for example, RBF kernel function. Leads to non-linear embeddings.
Pros & Cons of GPLVM Pros Cons Probabilistic Missing data straightforward. Can sample from model given X. Different noise models can be handled. Kernel parameters can be optimised. Cons Speed of optimisation. Optimisation is non-convex cf Classical MDS, kernel PCA.
GPLVM Optimisation Gradient based optimisation wrt X, , , (SCG). Example data-set Oil flow data Three phases of flow (stratified annular homogenous) Twelve measurement probes 1000 data-points We sub-sampled to 100 data-points.
SCG GPLVM Oil Results 2-D Manifold in 12-D space (shading is precision).
A More Efficient Algorithm
Efficient GPLVM Optimisation Optimising q£N matrix X is slow. There are correlations between data-points.
‘Sparsification’ If XI is a sub-set of X. For well chosen active set, I, |I|<<N For n I we can optimise q-dimensional x(n) independently.
Algorithm We selected the active-set according the the IVM scheme Select active set. Optimise , and . For all nI Optimise xn. For small data-sets optimise XI. Repeat.
Some Results
Some Results As well as RBF kernel we will use `MLP kernel’. Revisit Oil data Full training set this time. Used RBF kernel for GPLVM Compare with GTM.
Oil Data GTM GPLVM (RBF)
Different Kernels RBF Kernel MLP Kernel
Classification in Latent Space Model Test Error GPLVM RBF 4.3 % GPLVM MLP 3.0 % GTM 2.0 % PCA 14 % Classify flow regimes in latent space.
Swiss Roll - Initialisation Aside
Local Minima PCA Initialised Isomap Initialised
Digits Data Digits data digits 0 to 4 from USPS data. 600 of each digit randomly selected. 16x16 greyscale images.
Digits RBF Kernel MLP Kernel 0 – red, 1 – green, 2 – blue, 3 – cyan, 4 – magenta.
PCA and GTM for Digits GTM PCA
Digits Classifications Model Test Error GPLVM RBF 5.9 % GPLVM MLP 5.8 % GTM 3.7 % PCA 29 %
Twos Data So far – Gaussian Noise. Can use Bernoulli likelihood. Use ADF approximation. Can easily extend to EP. Practical Consequences About d times slower, need d times more storage. Twos data – 8x8 binary images modelled with Gaussian noise model. Bernoulli noise model.
Twos Data Twos data Cedar CD ROM digits. 700 examples of 8x8 twos. Binary images.
Twos Results Gaussian Noise Model Bernoulli Noise Model
Reconstruction Experiment Reconstruction Method Pixel Error Rate GPLVM Bernoulli noise 23.5 % GPLVM Gaussian noise 35.9 % Missing pixels not ink 51.5 %
Horse Colic Data Horse colic data Ordinal data. Many missing values.
Horse Colic Data Linear MLP RBF death – green survival – red put down – blue RBF
Classifying Colic Outcome Increase dimension of latent space. Corresponding decrease outcome prediction error. repeat experiments for different train/test partitions Trend is similar for each partition
Inverse Kinematics and Animation
Inverse Kinematics Style-Based Inverse Kinematics Keith Grochow, Steve L. Martin, Aaron Hertzmann, Zoran Popović. ACM Trans. on Graphics (Proc. SIGGRAPH 2004). Learn a GPLVM on motion capture data. Use GPLVM as ‘soft style constraint’ in combination with hard kinematics constraints.
Video styleik.mov
Why GPLVM in IK? GPLVM is probabilistic (soft constraints). GPLVM can capture non-linearities in the data. Inverse Kinematics Can be viewed as missing value problem. GPLVM handles missing values well. IK involves
Face Animation Data from EA OpenGL Code by Manuel Sanchez (at Sheffield).
KL Divergence Objective Function
Kernel PCA, MDS and GPLVMs Max likelihood ´ min Kullback Leibler divergence. PCA – minimise KL divergence between Gaussians Ky non-linear – kernel PCA Kx non-linear – GPLVM
KL Divergence as an Objective Points towards what it means to use a non-positive definite similarity measure in metric MDS. Should provide an approach to dealing with missing data in Kernel PCA. Usable for kernel parameter optimisation in KPCA? Unifies GPLVM, PCA, Kernel PCA, metric MDS in one framework.
Interpretation Kernels are similarity measures. Express correlations between points. GPLVM and KPCA try to match them. cf. MDS and principal co-ordinate analysis.
Conclusions
Conclusions GPLVM is a Probabilistic Non-Linear PCA Can sample from it. Evaluate likelihoods. Handle non-continuous data Missing data no problem. Optimisation of X is the difficult part. We presented a sparse optimisation algorithm. Model has been ‘proven’ in real application. Put your source code on-line!