Continuous Latent Variables --Bishop

Name: Continuous Latent Variables --Bishop
Uploaded: 2017-07-14T17:03:08+00:00
Duration: PTM25S44
Channel: Magnus Evans
Description: Continuous Latent Variables --Bishop

Continuous Latent Variables --Bishop
Xue Tian

Continuous Latent Variables
Explore models in which some, or all of the latent variables are continuous Motivation is in many data sets dimensionality of the original data space is very high the data points all lie close to a manifold of much lower dimensionality Motivation for such models

Example data set: 100x100 pixel grey-level images
dimensionality of the original data space is 100x100 digit 3 is embedded, the location and orientation of the digit is varied at random 3 degrees of freedom of variability vertical translation horizontal translation rotation intrinsic dimensionality is 3

Outline PCA-principal component analysis Kernel PCA Probabilistic PCA
two commonly used definitions of PCA give rise to the same algorithm Outline PCA-principal component analysis maximum variance formulation minimum-error formulation application of PCA PCA for high-dimensional data Kernel PCA Probabilistic PCA well know technique – PCA Briefly -- Probabilistic PCA

PCA-maximum variance formulation
PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space-principal subspace s.t. the variance of the projected data is maximized goal

red dots: data points purple line: principal subspace green dots: projected points

data set: {xn} n=1,2,…N xn: D dimensions goal: project the data onto a space having dimensionality M < D maximize the variance of the projected data Consider data set…

D-dimension unit vector u1: the direction u1T u1=1 xn u1Txn mean of the projected data: variance of the projected data: project To begin with, consider M=1 For convenience-unit vector 1 covariance matrix

goal: maximize variance of projected data maximize variance u1TSu1 with respect to u1 introduce a Lagrange multiplier λ1 a constrained maximization to prevent ||u1|| constraint comes from u1T u1=1 set derivative equal to zero u1: an eigenvector of S max variance: largest λ1 u1 2

define additional PCs in an incremental fashion choose new directions maximize the projected variance orthogonal to those already considered general case: M-dimensional the optimal linear projection defined by M eigenvectors u1, ... ,uM of S M largest eigenvalues λ1,...,λM

two commonly used definitions of PCA give rise to the same algorithm Outline PCA-principal component analysis maximum variance formulation minimum-error formulation application of PCA PCA for high-dimensional data Kernel PCA Probabilistic PCA

PCA-minimum error formulation
PCA can be defined as the linear projection minimizes the average projection cost average projection cost: mean squared distance between the data points and their projections goal

projection error - distance between the data points and their projections red dots: data points purple line: principal subspace green dots: projected points blue lines: projection error

complete orthonormal set of basis vectors {ui} i=1,…D, D-dimensional each data point can be represented by a linear combination of the basis vectors take the inner produce with uj 3 4

to approximate data points using a M-dimensional subspace - depend on the particular data points - constant, same for all data points goal: minimize the mean squared distance set derivative with respect to to zero j=1,…,M 5

set derivative with respect to to zero j=M+1,…,D remaining task: minimize J with respect to ui 6 7 8

M=1 D=2 introduce a Lagrange multiplier λ2 a constrained minimization to prevent ||u2||0 constraint comes from u2T u2=1 set derivative equal to zero u2: an eigenvector of S min error: smallest λ2 u2

general case: i=M+1,…,D J: sum of the eigenvalues of those eigenvectors that are orthogonal to the principal subspace obtain the min value of J: select eigenvectors corresponding to the D - M smallest eigenvalues the eigenvectors defining the principal subspace are those corresponding to the M largest eigenvalues

PCA is unsupervised and depends only on the values xn
PCA-application dimensionality reduction lossy data compression feature extraction data visualization example PCA is unsupervised and depends only on the values xn

PCA-example go through the steps to perform PCA on a set of data
Principal Components Analysis by Lindsay Smith

PCA-example Step 1: get data set D=2 N=10

PCA-example Step 2: subtract the mean Zero mean data set

PCA-example Step 3: calculate the covariance matrix S S: 2x2 9

PCA-example Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix S the eigenvector with the highest eigenvalue is the first principle component of the data set

PCA-example two eigenvectors
go through the middle of the points, like drawing a line of best fit extract lines to characterize the data

PCA-example in general, once eigenvectors are found
the next step is to order them by eigenvalues, highest to lowest this gives us the PCs in order of significance decide to ignore the less significant components here is where the notion of data compression and reduced dimensionality comes

PCA-example Step 5: derive the new data set
newDataT=eigenvectorsT x originalDataAdjustT originalDataAdjustT= newData: 10x1

PCA-example newData

PCA-example newData: 10x2

PCA-example Step 6: get back old data data compression
took all the eigenvectors in transformation, get exactly the original data back otherwise, lose some information

PCA-example newDataT=eigenvectorsT x originalDataAdjustT
newDataT=eigenvectors-1 x originalDataAdjustT originalDataAdjustT=eigenvectors x newDataT originalDataT=eigenvectors x newDataT + mean take all the eigenvectors inverse of the eigenvectors matrix is equal to the transpose of it unit vectors

PCA-example newData: 10x1

PCA-high dimensional data
number of data points is smaller than the dimensionality of the data space N < D example: data set: a few hundred images dimensionality: several million corresponding to three color values for each pixel In some application…

standard algorithm for finding eigenvectors for a DxD matrix is O(D3)O(MD2) if D is really high, a direct PCA is computationally infeasible

N < D a set of N points defines a linear subspace whose dimensionality is at most N – 1 there is little point to apply PCA for M > N – 1 if M > N-1 at least D-N+1 of the eigenvalues are 0 eigenvectors has 0 variance of the data set

solution: define X: NxD dimensional centred data matrix nth row: DxD

define NxN eigenvector equation for matrix have the same N-1 eigenvalues has D-N+1 zero eigenvalues O(D3)O(N3) eigenvectors

Kernel Kernel function inner product in feature space
feature space M ≥ input space N feature space mapping is implicit mapping of x into a feature space A kernel or a kernel function is defined as: x has n dimensions or it has n attributes Then kernel is defined as the inner product in feature space the kernel function computes the inner product in the feature space implicitly without transforming the data from the input space into the higher dimensional feature space. Two properties:

PCA-linear maximum variance formulation minimum error formulation
the orthogonal projection of the data onto a lower dimensional linear space s.t. the variance of the projected data is maximized minimum error formulation the linear projection minimizes the average projection distance linear linear Then let’s see why we need kernel in PCA Standard PCA is linear To break this limitation, we introduce Kernel PCA

Kernel PCA data set: {xn} n=1,2,…N xn: D dimensions
assume: the mean has been subtracted from xn (zero mean) PCs are defined by the eigenvectors ui of S i=1,…,D Consider data set… To make the notation clear, we assume… Recall that…

Kernel PCA a nonlinear transformation into an M-dimensional feature space xn perform standard PCA in the feature space implicitly defines a nonlinear PC in the original data space project

green lines: linear projection onto the first PC
original data space feature space perform standard PCA in the feature space green lines: linear projection onto the first PC nonlinear projection in the original data space

Kernel PCA assume: the projected data has zero mean MxM i=1,…,M
given , vi is a linear combination of perform standard PCA in the feature space

Kernel PCA express this in terms of kernel function in matrix notation
ai: column vector The reason is we want to solve problem without working explicitly in the feature space ai is the eigenvector we want K is NxN the solutions of these two eigenvector equations differ only by eigenvectors of K having zero eigenvalues

Kernel PCA normalization condition for ai

Kernel PCA in feature space: what is the projected data points after PCA i=1,…M’ M’: reduced dimensionality

Kernel PCA original data space feature space
dimensionality: D D eigenvectors at most D linear PCs feature space dimensionality: M M>>D (even infinite) M eigenvectors a number of nonlinear PCs then can exceed D the number of nonzero eigenvalues can not exceed N In the original… M much larger than D

Kernel PCA assume: the projected data has zero mean nonzero mean
cannot simply compute and then subtract off the mean avoid working directly in feature space formulate the algorithm purely in terms of kernel function So far … Centralized projected data points

Kernel PCA in matrix notation 1N: NxN matrix 1/N Gram matrix
Use K tilde to determine the eigenvalues and eigenvectors Gram matrix in matrix notation 1N: NxN matrix 1/N

Kernel PCA linear kernel: standard PCA Gaussian kernel:
example: kernel PCA

D=2 The first two eigenvectors separate the three clusters In original data space

Kernel PCA contours: lines along which the projection onto the corresponding PC is constant

Kernel PCA disadvantage: determine the eigenvectors of , NxN
for large data sets, approximations are used

Probabilistic PCA standard PCA: a linear projection of the data onto a lower dimensional subspace probabilistic PCA: the maximum likelihood solution of a probabilistic latent variable model PCA can be defined as Several advantages, only show 3 here

Probabilistic PCA the combination of a probabilistic model and EM allows us to deal with missing values in the data set EM: expectation-maximization algorithm a method to find maximum likelihood solutions for models with latent variables EM: an algorithm introduced in chapter 9

Probabilistic PCA probabilistic PCA forms the basis for a Bayesian treatment of PCA in Bayesian PCA, the dimensionality of the principal subspace can be found automatically DM

Probabilistic PCA the probabilistic PCA model can be run generatively to provide samples from the distribution the simplest continuous latent variable model assumes Gaussian distribution for both the latent and observed variables makes use of a linear-Gaussian dependence of the observed variables on the state of the latent variables

Probabilistic PCA an explicit latent variable z
corresponding to the PC subspace a Gaussian prior distribution p(z) over the latent variable a Gaussian conditional distribution p(x|z) W: DxM matrix columns of W: principal subspace : D-dimensional vector Mx1 Now let’s look at p… PCA model Firstly, introduce z Then, define p(z) zero-mean unit-covariance Gaussian Also, define p(x|z) columns of W: corresponding to principal subspace sigma^2: another parameter---scalar observed variable Dx1

Probabilistic PCA get a sample value of the observed variables by
choosing a value for the latent variable sampling the observed variable given the latent value x is defined by a linear transformation of z plus additive Gaussian noise : D-dimensional noise If we look at p…. PCA from a generative viewpoint, Epsilon: zero-mean sigma^2 I-covariance Gaussian-distribution noise

Probabilistic PCA data space: 2-dimensional latent space: 1-dimensional get a value for the latent variable z get a value for x from an isotropic Gaussian distribution green ellipses: density contours for the marginal distribution p(x) These 3 figures show the whole process. an isotropic Gaussian distribution—red circles w\hat{z} + mu---mean simag^2 I----covariance

Probabilistic PCA a mapping from latent space to data space
in contrast to the standard PCA Note that we have

Probabilistic PCA Gaussian conditional distribution
maximum likelihood PCA - determine 3 parameters we need an expression for p(x) To write down the likelihood function p(x) is again Gaussian C is the DxD covariance matrix, defined by…

Probabilistic PCA so far, we assumed the value M is given
in practice, choose a suitable value for M for visualization: M=2 or M=3 plot the eigenvalue spectrum for the data set seek a significant gap indicating a choice for M in practice, such a gap is often not seen Bayesian PCA employ cross-validation to determine the value of M by selecting the largest log likelihood on a validation data set That is always not the case… Then we can plot the data set easily.

Probabilistic PCA the only clear break is between the 1st and 2nd PCs
the 1st PC explains less than 40% of the variance more components are probably needed the first 3 PCs explain two thirds of the total variability 3 might be a reasonable value of M the variance explained by the corresponding principal component

the basic idea is …

Continuous Latent Variables --Bishop

Similar presentations

Presentation on theme: "Continuous Latent Variables --Bishop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Continuous Latent Variables --Bishop

Similar presentations

Presentation on theme: "Continuous Latent Variables --Bishop"— Presentation transcript:

Similar presentations

About project

Feedback