Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y.

Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y. Cho Biointelligence Lab Seoul National University http://bi.snu.ac.kr/

2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/Motivation  64  64 pixel images  Embedding it in a larger image of size 100  100  Three degrees of freedom  Vertical and horizontal translations and the rotations  In practice, the data points will not be confined precisely to a smooth low-dimensional manifold.  We can interpret the departures of data points from the manifold as ‘noise’.

3 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1 Principal Component Analysis Principal Component Analysis (PCA)  PCA is used for applications such as dimensionality reduction, lossy data compression, feature extraction and data visualization.  Also known as Karhunen-Loeve transform  PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized. Principal subspace Orthogonal projection of the data points The subspace maximizes the variance of the projected points Minimizing the sum-of-squares of the projection errors

4 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.1 Maximum variance formulation (1/2) Consider data set of observations {x n } where n = 1,…,N, and x n is a Euclidean variable with dimensionality D.  To project the data onto a space having dimensionality M < D while maximizing the variance of the projected data.  One-dimensional space (M = 1)  Define the direction of this space using a D-dimensional vector u 1  The mean of the projected data is where is the sample set mean given by  The variance of the projected data is given by Maximize the projected variance S is the data covariance matrix

5 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.1 Maximum variance formulation (2/2) Lagrange multiplier  Make an unconstrained maximization of (u 1 must be an eigenvector of S)  The variance will be maximum when we set u 1 equal to the eigenvector having the largest eigenvalue λ 1. PCA involves evaluating the mean x and the covariance matrix S of the data set and then finding the M eigenvectors of S corresponding to the M largest eigenvalues.  The cost of computing the full eigenvector decomposition for a matrix of size D x D is O(D 3 ) (this eigenvector is the first principal component)

6 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.2 Minimum-error formulation (1/4) Based on projection error minimization  A complete orthonormal set of D-dimensional basis vectors {u i } where i = 1,…,D that satisfy  Each data point can be represented exactly by a linear combination of the basis vectors  To approximate this data point using a representation involving a restricted number M < D of variables corresponding to projection onto a lower-dimensional subspace. coefficient (Without loss of generality)

7 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.2 Minimum-error formulation (2/4) We approximate each data point x n by Distortion measure  the squared distance between the original data point x n and its approximation, so that goal is to minimize  the minimization with respect to the quantities {z nj } where j = 1,…, M and similarly, where j = M+1, …, D. Depend on particular data point Constants that are the same for all data points to choose the {u i }, the {z ni }, and the {b j } so as to minimize the distortion The displacement vector from x n to lies in the space orthogonal to the principal subspace (Fig. 12.2.).

8 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.2 Minimum-error formulation (3/4)  Therefore, distortion measure J The case of a 2-dimensional data space D=2 and 1-dimensional principal subspace M=1  To choose a direction u 2 so as to minimize  Using a Lagrange multiplier λ 2 to enforce the constraint  The minimum value of J by choosing u 2 to be the eigenvector corresponding to the smaller of the two eigenvalues  choose the principal subspace to be aligned with the eigenvector having the larger eigenvalues (in order to minimize the average squared projection distance)

9 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.2 Minimum-error formulation (4/4) General solution to the minimization of J for arbitrary M < D  The corresponding value of the distortion measure is simply the sum of the eigenvalues of those eigenvectors that are orthogonal to the principal subspace. Canonical correlation analysis (CCA)  Linear dimensionality reduction technique  PCA works with a single random variable, CCA considers two (or more) variables and tries to find a corresponding pair of linear subspaces that high cross-correlation.

10 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.3 Applications of PCA (1/4) Data compression  The first five eigenvectors, along with the corresponding eigenvalues The distortion measure J introduced by projecting the data onto a principal component subspace of dimensionality M the eigenvalue spectrum

11 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.3 Applications of PCA (2/4) PCA approximation to a data vector x n in the form PCA reconstruction of data points by M principal components

12 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.3 Applications of PCA (3/4) Data pre-processing  The transformation of a data set in order to standardize certain of its properties (allowing subsequent pattern recognition algorithm)  PCA makes a more substantial normalization of the data to give it zero mean and unit covariance. L, D  D diagonal matrix with elements λ i U, D  D orthogonal matrix with columns given by u i

13 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.3 Applications of PCA (4/4) Compare to PCA with the Fisher linear discriminant for linear dimensionality reduction Data Visualization  Oil flow data set obtained by projecting the data onto the first two principle components PCA chooses the direction of maximum variance Fisher linear discriminant takes account of the class labels information

14 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 12.1.4 PCA for high-dimensional data N < D Computationally infeasible - O(D 3 ) Resolve this problem  Let us define X to be the (N  D)-dimensional centered data matrix, whose n th row is given by This is an eigenvector of S with eigenvalue λ i (v i =Xu i ) Solve the eigenvector problem in spaces of lower dimensionality with computational cost O(N 3 ) instead of O(D 3 )

15 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic PCA PCA can express naturally as the maximum likelihood solution to a particular form of linear-Gaussian latent variable model. This probabilistic reformulation brings many advantages;  the use of EM for parameter estimation  principled extensions to mixture of PCA models  Bayesian formulations that allow the number of principal component to be determined automatically from the data  Etc.

16 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic PCA is a simple example of the linear- Gaussian framework. It can be formulate by first introducing an explicit latent variable z corresponding to the principal-component subspace. Prior distribution over z The conditional distribution of the observed variable x, conditioned on the value of the latent variable z

17 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ D-dimensional observed variable x is defined by a linear transformation of the M-dimensional latent variable z plus additive Gaussian ‘noise’.  D-dimensional zero-mean Gaussian-distributed noise with covariance  2 I This framework is based on a mapping from latent space to data space, in contrast to the more conventional view of PCA.

18 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Determining the values of parameters W, μ and σ 2 using maximum likelihood marginal distribution p(x) is again Gaussian mean & covariance

19 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ When we evaluate the predictive distribution, we require C -1, which involves the inversion of a D x D matrix. Because we invert M rather than inverting C -1 directly, the cost of evaluating C -1 is reduced from O(D 3 ) to O(M 3 ). As well as the predictive distribution p(x), we will also require the posterior distribution p(z|x) M is M x M matrix

20 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Maximum likelihood PCA log-likelihood function The log-likelihood function is a quadratic function of μ, this solution represents the unique maximum, as can confirmed by computing second derivatives.

21 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Maximization with respect to W and σ 2 is more complex but nonetheless has an exact closed-form solution. All of the stationary points of the log likelihood function can be written as U M is a D x M matrix whose columns are given by any subset (of size M) of the eigenvectors of the data covariance matrix S L M is a M x M diagonal matrix whose elements are given by corresponding eigenvalues λ i. R is an arbitrary M x M orthogonal matrix.

22 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Conventional PCA is generally formulated as a projection of points from the D-dimensional data space onto an M-dimensional linear subspace. Probabilistic PCA, however, is most naturally expressed as a mapping from the latent space into the data space. We can reverse this mapping using Bayes’ theorem. The number of degrees of freedom in the covariance matrix  Cf) Full covariance matrix of Gaussian distribution

23 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ EM algorithms for PCA Using the EM algorithm to find maximum likelihood estimates of the model parameters  This may seem rather pointless because we have already obtained an exact closed-form solution for the maximum likelihood parameter values.  In space of high dimensionality, there may be computational advantages in using an iterative EM procedure rather than working directly with the sample covariance matrix.  This EM procedure can also be extended to the factor analysis model, for which there is no closed-form solution.  It allows missing data to be handled in a principled way.

26 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Benefits of the EM algorithm for PCA E step M step Another elegent feature of the EM approach is that we can take the limit σ 2 →0, corresponding to standard PCA, and still obtain a valid EM-like algorithm.

28 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Model selection In practice, we must choose a value, M, for dimensionality of principal subspace according to application. Probabilistic PCA model has a well-defined likelihood function, we could employ cross-validation to determine the value of dimensionality by selection the largest log likelihood on a validation data set. However, this approach can become computationally costly. Given that we have a probabilistic formulation of PCA, it seems natural to seek a Bayesian approach to model selection.

29 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Bayesian PCA Specifically, we define an independent Gaussian prior over each column of W, which represent the vectors defining the principal subspace. Each Gaussian has an independent variance governed by a precision hyperparameter α i The effective dimensionality of the principal subspace is determined by the number of finite α i values, and the corresponding vector w i can be thought of as ‘relevant’ for modeling the data distribution.

30 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Bayesian PCA E step M step The Bayesian approach is automatically making the trade-off between improving the fit to the data, by using a larger number of vectors w i with their corresponding eigenvalues i each tuned to the data, and reducing the complexity of the model by suppressing some of the w i vectors.

31 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 300 data points (D = 10) sampled from a Gaussian distribution having standard deviation 1.0 in 3 directions and 0.5 in the remaining 7 directions A data set (D = 4, M = 3) generated from a probabilistic PCA model having one direction of high variance, with the remaining directions comprising low variance noise Gibbs sampling for Bayesian PCA

32 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Factor analysis Factor analysis is a linear-Gaussian latent variable model that is closely related to probabilistic PCA. Its definition differs from that of probabilistic PCA only in that the conditional distribution of the observed variable x given the latent variable z is taken to have a diagonal rather than an isotropic covariance. Ψ is a D x D diagonal matirx

34 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Differences between FA and PCA Unlike probabilistic PCA, there is no longer a closed-form maximum likelihood solution for W, which must therefore be found iteratively. Difference behavior under transformations of the data set  For PCA and probabilistic PCA, if we rotate the coordinate system in data space, then we obtain exactly the same fit to the data but with the W matrix transformed by the corresponding rotation matrix.  For factor analysis, the analogous property is that if we make a component- wise re-scaling of the data vectors, then this is absorbed into a corresponding re-scaling of the elements of .

35 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Kernel PCA Applying the technique of kernel substitution to principal component analysis, to obtain a nonlinear generalization

36 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Conventional PCA vs. Kernel PCA Conventional PCA (Assume that the mean of vectors x n is zero) A nonlinear transformation  (x) into an M-dimensional feature space (Assume that the projected data set also has zero mean)

38 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ The normalization condition for the coefficient a i is obtained by requiring that the eigenvectors in feature space be normalized. Having solved the eigenvector problem, the resulting principal component projections can then also be cast in terms of the kernel function so that the projection of a point x onto eigenvector i is given by

39 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ In the original D-dimensional x space  D orthogonal eigenvectors  at most D linear principal components. In the M-dimensional feature space  We can find a number of nonlinear principal components that can exceed D.  However, the number of nonzero eigenvalues cannot exceed the number N of data points, because (even if M > N) the covariance matrix in feature space has rank at most equal to N.

40 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ In general cases 1 N denote the N x N matrix in which every element takes the value 1/N Mean is not zero. Use this to determine the eigenvalues and eigenvectors.

42 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Nonlinear Latent Variable Models Some generalizations of continuous latent variables framework to models that are either nonlinear or non- Gaussian, or both.  Independent component analysis (ICA)  Autoassociative neural networks  Etc.

43 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Independent component analysis Eg. blind source separation The observed variables are related linearly to the latent variables, but the latent distribution is non-Gaussian

44 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Neural networks have been applied to unsupervised learning where they have been used for dimensionality reduction. This is achieved by using a network having the same number of outputs as inputs, and optimizing the weights so as to minimize some measure of the reconstruction error between inputs and outputs with respect to a set of training data Autoassociative neural networks The vectors of weights which lead into the hidden units form a basis set which spans the principal subspace. These vectors need not be orthogonal or normalized.

46 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Modeling nonlinear manifolds One way to model the nonlinear structure is through a combination of linear model, so that we make a piece- wise linear approximation to the manifold. An alternative to considering a mixture of linear models is to consider a single nonlinear model.  Conventional PCA finds a linear subspace that passes close to the data in a least-squares sense.  This concept can be extended to one-dimensional nonlinear surfaces in the form of principal curves.  Principal curves can be generalized to multidimensional manifolds called principal surfaces.

47 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Multidimensional scaling (MDS) Locally linear embedding (LLE) Isometric feature mapping (isomap) The models having continuous latent variables together with discrete observed variables  Latent trait models Density network

48 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Generative topological mapping (GTM)  It uses a latent distribution that is defined by a finite regular grid of delta functions over the (typically two-dimensional) latent space. Self-organizing map (SOM)  The GTM can be seen as a probabilistic version of an earlier model, SOM.  The SOM represents a two-dimensional nonlinear manifold as a regular array of discrete points.

Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y.

Similar presentations

Presentation on theme: "Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y.

Similar presentations

Presentation on theme: "Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y."— Presentation transcript:

Similar presentations

About project

Feedback