Neil Lawrence Machine Learning Group Department of Computer Science

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Active Shape Models Suppose we have a statistical shape model –Trained from sets of examples How do we use it to interpret new images? Use an “Active Shape.
Learning deformable models Yali Amit, University of Chicago Alain Trouvé, CMLA Cachan.
Lecture 9 Support Vector Machines
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
Computer vision: models, learning and inference
Computer vision: models, learning and inference
Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Support Vector Machines (and Kernel Methods in general)
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Dimensional reduction, PCA
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
3D Geometry for Computer Graphics
Continuous Latent Variables --Bishop
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Summarized by Soo-Jin Kim
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Learning Human Pose and Motion Models for Animation Aaron Hertzmann University of Toronto.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Multifactor GPs Suppose now we wish to model different mappings for different styles. We will add a latent style vector s along with x, and define the.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y.
Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.
Manifold learning: MDS and Isomap
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Linear Models for Classification
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
High Dimensional Probabilistic Modelling through Manifolds
Neil Lawrence Machine Learning Group Department of Computer Science
Spectral Methods for Dimensionality
Nonlinear Dimensionality Reduction
Neil Lawrence Machine Learning Group Department of Computer Science
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
Data Mining, Neural Network and Genetic Programming
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Principle Component Analysis (PCA) Networks (§ 5.8)
Neil Lawrence Machine Learning Group Department of Computer Science
Motion Segmentation with Missing Data using PowerFactorization & GPCA
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
CS 2750: Machine Learning Dimensionality Reduction
Machine Learning Basics
Spectral Methods Tutorial 6 1 © Maks Ovsjanikov
Latent Variables, Mixture Models and EM
Statistical Learning Dong Liu Dept. EEIS, USTC.
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Principal Component Analysis
Parallelization of Sparse Coding & Dictionary Learning
Neil Lawrence Machine Learning Group Department of Computer Science
Feature space tansformation methods
Biointelligence Laboratory, Seoul National University
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Non-Linear Probabilistic PCA with Gaussian Process Latent Variable Models Neil Lawrence Machine Learning Group Department of Computer Science University of Sheffield, U.K.

Overview Motivation Mathematical Foundations A Sparse Algorithm High dimensional data. Smooth low dimensional embedded spaces. Mathematical Foundations Probabilistic PCA A Sparse Algorithm Some Results Inverse Kinematics and Animation

Motivation

High Dimensional Data Handwritten digit: 3648 dimensions. Space contains more than just this digit.

Handwritten Digit A simple model of the digit – rotate the ‘prototype’. 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60

Projection onto Principal Components

Discontinuities

Low Dimensional Manifolds Pure rotation of a prototype is too simple. In practice the data may go through several distortions, e.g. digits undergo thinning, translation and rotation. For data with ‘structure’: we expect fewer distortions than dimensions; we therefore expect the data to live in a lower dimension manifold. Deal with high dimensional data by looking for lower dimensional non-linear embedding.

Our Options Spectral Approaches Non-spectral approaches Classical Multidimensional Scaling (MDS) Uses eigenvectors of similarity matrix. LLE and Isomap are MDS with particular proximity measures. Kernel PCA Provides an embedding and a mapping from the high dimensional space to the embedding. The mapping is implied through the use of a kernel function as the similarity matrix. Non-spectral approaches Non-metric MDS and Sammon Mappings Iterative optimisation of a stress function. A mapping can be forced (e.g. Neuroscale).

Our Options Probabilistic Approaches Probabilistic PCA A linear method. Density Networks Use importance sampling and a multi-layer perceptron. GTM Uses a grid based sample and an RBF network. The difficulty for probabilistic approaches: propagate a distribution through a non-linear mapping.

The New Model PCA has a probabilistic interpretation. It is difficult to ‘non-linearise’. We present a new probabilistic interpretation of PCA. This can be made non-linear. The result is non-linear probabilistic PCA.

Mathematical Foundations

Notation q – dimension of latent/embedded space. d – dimension of data space. N – number of data points. centred data, . latent variables, . mapping matrix, W 2 <d£q. a(i) is vector from i th row of A ai is a vector from i th column of A

Reading Notation X and Y are design matrices. Covariance given by N-1YTY. Inner product matrix given by YYT.

Linear Embeddings Represent data, Y, with a lower dimensional embedding X. Assume a linear relationship of the form Probabilistically we implement this as …

PCA – Probabilistic Interpretation X W Y W Y

Maximum Likelihood Solution If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

PCA – Probabilistic Interpretation X W Y W Y

Dual Probabilistic PCA X W Y X Y

Maximum Likelihood Solution If Uq are first q eigenvectors of d-1YYT and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

Maximum Likelihood Solution If Uq are first q eigenvectors of N-1YTY and the corresponding eigenvalues are q. where V is an arbitrary rotation matrix.

Equivalence of PPCA Formulations Solution for PPCA: Solution for Dual PPCA: Equivalence is from

Gaussian Processes A Gaussian Process (GP) likelihood is of the form where K is the covariance function or kernel. If we select the linear kernel We see Dual PPCA is a product of GPs.

Dual Probabilistic PCA is a GPLVM Log-likelihood:

Non-linear Kernel Instead of linear kernel function. Use, for example, RBF kernel function. Leads to non-linear embeddings.

Pros & Cons of GPLVM Pros Cons Probabilistic Missing data straightforward. Can sample from model given X. Different noise models can be handled. Kernel parameters can be optimised. Cons Speed of optimisation. Optimisation is non-convex cf Classical MDS, kernel PCA.

GPLVM Optimisation Gradient based optimisation wrt X, , ,  (SCG). Example data-set Oil flow data Three phases of flow (stratified annular homogenous) Twelve measurement probes 1000 data-points We sub-sampled to 100 data-points.

SCG GPLVM Oil Results 2-D Manifold in 12-D space (shading is precision).

A More Efficient Algorithm

Efficient GPLVM Optimisation Optimising q£N matrix X is slow. There are correlations between data-points.

‘Sparsification’ If XI is a sub-set of X. For well chosen active set, I, |I|<<N For n I we can optimise q-dimensional x(n) independently.

Algorithm We selected the active-set according the the IVM scheme Select active set. Optimise ,  and . For all nI Optimise xn. For small data-sets optimise XI. Repeat.

Some Results

Some Results As well as RBF kernel we will use `MLP kernel’. Revisit Oil data Full training set this time. Used RBF kernel for GPLVM Compare with GTM.

Oil Data GTM GPLVM (RBF)

Different Kernels RBF Kernel MLP Kernel

Classification in Latent Space Model Test Error GPLVM RBF 4.3 % GPLVM MLP 3.0 % GTM 2.0 % PCA 14 % Classify flow regimes in latent space.

Swiss Roll - Initialisation Aside

Local Minima PCA Initialised Isomap Initialised

Digits Data Digits data digits 0 to 4 from USPS data. 600 of each digit randomly selected. 16x16 greyscale images.

Digits RBF Kernel MLP Kernel 0 – red, 1 – green, 2 – blue, 3 – cyan, 4 – magenta.

PCA and GTM for Digits GTM PCA

Digits Classifications Model Test Error GPLVM RBF 5.9 % GPLVM MLP 5.8 % GTM 3.7 % PCA 29 %

Twos Data So far – Gaussian Noise. Can use Bernoulli likelihood. Use ADF approximation. Can easily extend to EP. Practical Consequences About d times slower, need d times more storage. Twos data – 8x8 binary images modelled with Gaussian noise model. Bernoulli noise model.

Twos Data Twos data Cedar CD ROM digits. 700 examples of 8x8 twos. Binary images.

Twos Results Gaussian Noise Model Bernoulli Noise Model

Reconstruction Experiment Reconstruction Method Pixel Error Rate GPLVM Bernoulli noise 23.5 % GPLVM Gaussian noise 35.9 % Missing pixels not ink 51.5 %

Horse Colic Data Horse colic data Ordinal data. Many missing values.

Horse Colic Data Linear MLP RBF death – green survival – red put down – blue RBF

Classifying Colic Outcome Increase dimension of latent space. Corresponding decrease outcome prediction error. repeat experiments for different train/test partitions Trend is similar for each partition

Inverse Kinematics and Animation

Inverse Kinematics Style-Based Inverse Kinematics Keith Grochow, Steve L. Martin, Aaron Hertzmann, Zoran Popović. ACM Trans. on Graphics (Proc. SIGGRAPH 2004). Learn a GPLVM on motion capture data. Use GPLVM as ‘soft style constraint’ in combination with hard kinematics constraints.

Video styleik.mov

Why GPLVM in IK? GPLVM is probabilistic (soft constraints). GPLVM can capture non-linearities in the data. Inverse Kinematics Can be viewed as missing value problem. GPLVM handles missing values well. IK involves

Face Animation Data from EA OpenGL Code by Manuel Sanchez (at Sheffield).

KL Divergence Objective Function

Kernel PCA, MDS and GPLVMs Max likelihood ´ min Kullback Leibler divergence. PCA – minimise KL divergence between Gaussians Ky non-linear – kernel PCA Kx non-linear – GPLVM

KL Divergence as an Objective Points towards what it means to use a non-positive definite similarity measure in metric MDS. Should provide an approach to dealing with missing data in Kernel PCA. Usable for kernel parameter optimisation in KPCA? Unifies GPLVM, PCA, Kernel PCA, metric MDS in one framework.

Interpretation Kernels are similarity measures. Express correlations between points. GPLVM and KPCA try to match them. cf. MDS and principal co-ordinate analysis.

Conclusions

Conclusions GPLVM is a Probabilistic Non-Linear PCA Can sample from it. Evaluate likelihoods. Handle non-continuous data Missing data no problem. Optimisation of X is the difficult part. We presented a sparse optimisation algorithm. Model has been ‘proven’ in real application. Put your source code on-line!