Download presentation
Presentation is loading. Please wait.
Published byJocelin Hardy Modified over 6 years ago
1
Non-Linear Probabilistic PCA with Gaussian Process Latent Variable Models
Neil Lawrence Machine Learning Group Department of Computer Science University of Sheffield, U.K.
2
Overview Motivation Mathematical Foundations A Sparse Algorithm
High dimensional data. Smooth low dimensional embedded spaces. Mathematical Foundations Probabilistic PCA A Sparse Algorithm Some Results Inverse Kinematics and Animation
3
Motivation
4
High Dimensional Data Handwritten digit:
3648 dimensions. Space contains more than just this digit.
5
Handwritten Digit A simple model of the digit – rotate the ‘prototype’. 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60
6
Projection onto Principal Components
7
Discontinuities
8
Low Dimensional Manifolds
Pure rotation of a prototype is too simple. In practice the data may go through several distortions, e.g. digits undergo thinning, translation and rotation. For data with ‘structure’: we expect fewer distortions than dimensions; we therefore expect the data to live in a lower dimension manifold. Deal with high dimensional data by looking for lower dimensional non-linear embedding.
9
Our Options Spectral Approaches Non-spectral approaches
Classical Multidimensional Scaling (MDS) Uses eigenvectors of similarity matrix. LLE and Isomap are MDS with particular proximity measures. Kernel PCA Provides an embedding and a mapping from the high dimensional space to the embedding. The mapping is implied through the use of a kernel function as the similarity matrix. Non-spectral approaches Non-metric MDS and Sammon Mappings Iterative optimisation of a stress function. A mapping can be forced (e.g. Neuroscale).
10
Our Options Probabilistic Approaches
Probabilistic PCA A linear method. Density Networks Use importance sampling and a multi-layer perceptron. GTM Uses a grid based sample and an RBF network. The difficulty for probabilistic approaches: propagate a distribution through a non-linear mapping.
11
The New Model PCA has a probabilistic interpretation.
It is difficult to ‘non-linearise’. We present a new probabilistic interpretation of PCA. This can be made non-linear. The result is non-linear probabilistic PCA.
12
Mathematical Foundations
13
Notation q – dimension of latent/embedded space.
d – dimension of data space. N – number of data points. centred data, latent variables, mapping matrix, W 2 <d£q. a(i) is vector from i th row of A ai is a vector from i th column of A
14
Reading Notation X and Y are design matrices.
Covariance given by N-1YTY. Inner product matrix given by YYT.
15
Linear Embeddings Represent data, Y, with a lower dimensional embedding X. Assume a linear relationship of the form Probabilistically we implement this as …
16
PCA – Probabilistic Interpretation
X W Y W Y
17
Maximum Likelihood Solution
If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.
18
PCA – Probabilistic Interpretation
X W Y W Y
19
Dual Probabilistic PCA
X W Y X Y
20
Maximum Likelihood Solution
If Uq are first q eigenvectors of d-1YYT and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.
21
Maximum Likelihood Solution
If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.
22
Equivalence of PPCA Formulations
Solution for PPCA: Solution for Dual PPCA: Equivalence is from
23
Gaussian Processes A Gaussian Process (GP) likelihood is of the form
where K is the covariance function or kernel. If we select the linear kernel We see Dual PPCA is a product of GPs.
24
Dual Probabilistic PCA is a GPLVM
Log-likelihood:
25
Non-linear Kernel Instead of linear kernel function.
Use, for example, RBF kernel function. Leads to non-linear embeddings.
26
Pros & Cons of GPLVM Pros Cons Probabilistic
Missing data straightforward. Can sample from model given X. Different noise models can be handled. Kernel parameters can be optimised. Cons Speed of optimisation. Optimisation is non-convex cf Classical MDS, kernel PCA.
27
GPLVM Optimisation Gradient based optimisation wrt X, , , (SCG).
Example data-set Oil flow data Three phases of flow (stratified annular homogenous) Twelve measurement probes 1000 data-points We sub-sampled to 100 data-points.
28
SCG GPLVM Oil Results 2-D Manifold in 12-D space (shading is precision).
29
A More Efficient Algorithm
30
Efficient GPLVM Optimisation
Optimising q£N matrix X is slow. There are correlations between data-points.
31
‘Sparsification’ If XI is a sub-set of X.
For well chosen active set, I, |I|<<N For n I we can optimise q-dimensional x(n) independently.
32
Algorithm We selected the active-set according the the IVM scheme
Select active set. Optimise , and . For all nI Optimise xn. For small data-sets optimise XI. Repeat.
33
Some Results
34
Some Results As well as RBF kernel we will use `MLP kernel’.
Revisit Oil data Full training set this time. Used RBF kernel for GPLVM Compare with GTM.
35
Oil Data GTM GPLVM (RBF)
36
Different Kernels RBF Kernel MLP Kernel
37
Classification in Latent Space
Model Test Error GPLVM RBF 4.3 % GPLVM MLP 3.0 % GTM 2.0 % PCA 14 % Classify flow regimes in latent space.
38
Swiss Roll - Initialisation Aside
39
Local Minima PCA Initialised Isomap Initialised
40
Digits Data Digits data digits 0 to 4 from USPS data.
600 of each digit randomly selected. 16x16 greyscale images.
41
Digits RBF Kernel MLP Kernel
0 – red, 1 – green, 2 – blue, 3 – cyan, 4 – magenta.
42
PCA and GTM for Digits GTM PCA
43
Digits Classifications
Model Test Error GPLVM RBF 5.9 % GPLVM MLP 5.8 % GTM 3.7 % PCA 29 %
44
Twos Data So far – Gaussian Noise. Can use Bernoulli likelihood.
Use ADF approximation. Can easily extend to EP. Practical Consequences About d times slower, need d times more storage. Twos data – 8x8 binary images modelled with Gaussian noise model. Bernoulli noise model.
45
Twos Data Twos data Cedar CD ROM digits. 700 examples of 8x8 twos.
Binary images.
46
Twos Results Gaussian Noise Model Bernoulli Noise Model
47
Reconstruction Experiment
Reconstruction Method Pixel Error Rate GPLVM Bernoulli noise 23.5 % GPLVM Gaussian noise 35.9 % Missing pixels not ink 51.5 %
48
Horse Colic Data Horse colic data Ordinal data. Many missing values.
49
Horse Colic Data Linear MLP RBF death – green survival – red
put down – blue RBF
50
Classifying Colic Outcome
Increase dimension of latent space. Corresponding decrease outcome prediction error. repeat experiments for different train/test partitions Trend is similar for each partition
51
Inverse Kinematics and Animation
52
Inverse Kinematics Style-Based Inverse Kinematics Keith Grochow, Steve L. Martin, Aaron Hertzmann, Zoran Popović. ACM Trans. on Graphics (Proc. SIGGRAPH 2004). Learn a GPLVM on motion capture data. Use GPLVM as ‘soft style constraint’ in combination with hard kinematics constraints.
53
Video styleik.mov
54
Why GPLVM in IK? GPLVM is probabilistic (soft constraints).
GPLVM can capture non-linearities in the data. Inverse Kinematics Can be viewed as missing value problem. GPLVM handles missing values well. IK involves
55
Face Animation Data from EA
OpenGL Code by Manuel Sanchez (at Sheffield).
56
KL Divergence Objective Function
57
Kernel PCA, MDS and GPLVMs
Max likelihood ´ min Kullback Leibler divergence. PCA – minimise KL divergence between Gaussians Ky non-linear – kernel PCA Kx non-linear – GPLVM
58
KL Divergence as an Objective
Points towards what it means to use a non-positive definite similarity measure in metric MDS. Should provide an approach to dealing with missing data in Kernel PCA. Usable for kernel parameter optimisation in KPCA? Unifies GPLVM, PCA, Kernel PCA, metric MDS in one framework.
59
Interpretation Kernels are similarity measures.
Express correlations between points. GPLVM and KPCA try to match them. cf. MDS and principal co-ordinate analysis.
60
Conclusions
61
Conclusions GPLVM is a Probabilistic Non-Linear PCA
Can sample from it. Evaluate likelihoods. Handle non-continuous data Missing data no problem. Optimisation of X is the difficult part. We presented a sparse optimisation algorithm. Model has been ‘proven’ in real application. Put your source code on-line!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.