Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model : Sparse Coding, Independent Component Analysis, and.

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model : Sparse Coding, Independent Component Analysis, and Minimum Entropy Segmentation Jason Palmer Department of Electrical and Computer Engineering University of California San Diego

Introduction 1.Unsupervised learning of structure in continuous sensor data 1.Data must be analyzed into component parts – reduced to set of states of the world which are active or not active in various combinations 2.Probabilistic modeling – states 1.Linear model 1.Basis sets 2.Hierarchical Linear processes 3.Also kernel non-linear 2.Probability model 1.Distributions of input variables – types of densities 2.Conditionally independent inputs – Markov connection of states 3.Thesis topics 1.Types of distributions and representations that lead to efficient and monotonic algorithms using non-Gaussian densities 2.Calculating probabilities in linear process model

Linear Process Model General form of the model: Sparse coding: x(t)=As(t), A overcomplete ICA: x(t)=As(t), A invertible, s(t)=A -1 x(t) Blind deconvolution:

Voice and Microphone HT(z)HT(z)HR(z)HR(z)HM(z)HM(z) IID processObserved processLinear filters

Sensor array Source 1 Source 2 Sensor Arrays

Convolutive Model

Different Impulse Response

EEG sources

Binocular Color Image - Represented as 2D field of 6D vectors - Binocular video can be represented as a 3D field of 6D vectors - Use block basis or mixture of filters RLRL Channels rgbrgb rgbrgb

x(t)x(t) WORLD  (t) (t) REPRESENTATION Biological Systems WORLD

Linear Process Mixture Model S-s-s-s-s-s-i-i-i-k---ks-s-s-s-s-s----t-t-t-t-e-e-e-e-n-n-n-n Plot of speech signal: woman speaking the word “sixteen” Clearly speech is non-stationary, but seems to be locally stationary

Source model 1(z) 1(z)  11 (t)  1m (t) s(t)s(t) r(z) r(z)  r1 (t)  rm (t)

Observation Segment Model A(z) x(t)x(t)

Generative Model x(t)x(t) A1(z)A1(z) AM(z)AM(z) x1(t)x1(t) xM(t)xM(t)

Outline I.Types of Probability densities A.Sub- and Super-Gaussianity B.Representation in terms of Gaussian 1.Convex variational representation – Strong Super-Gaussians 2.Gaussian Scale mixtures a.Multivariate Gaussian Scale mixtures (ISA / IVA) 3.Relationship between representations II.Sparse Coding and Dictionary Learning A.Optimization with given (overcomplete) basis 1.MAP – Generalized FOCUSS a.Global Convergence of Iteratively Re-weighted Least Squares (IRLS) b.Convergence rates 2.Variational Bayes (VB) – Sparse Bayesian Learning B.Dictionary Learning 1.Lagrangian Newton Algorithm 2.Comparison in Monte Carlo experiment III.Independent Component Analysis A.Convexity of the ICA optimization problem – stability of ICA 1.Fisher Information and Cramer-Rao lower bound on variance B.Super-Gaussian Mixture Source Model 1.Comparison between Gaussian and Super-Gaussian updates IV.Linear Process Mixture Model A.Probability of signal segments B.Mixture model segmentation

Sub- and Super-Gaussianity Component density determines shape along direction of vector Super-Gaussian = concentrated near zero, some large values Sub-Gaussian = uniform around zero, no large values Super-Gaussian = more peaked, than Gaussian, heavier tail Sub-Gaussian = flatter, more uniform, shorter tail than Gaussian Generalized Gaussian  exp(-|x| p ): Laplacian (p = 1.0 ), Gaussian (p=2.0), sub-Gaussian (p=10.0) Super-Gaussians, represent sparse random variables Most often zero, occasionally large magnitudes Sparse random variables model variables with on / off, active / inactive states Super-Gaussian Sub-Gaussian Gaussian Sub- AND Super-Gaussian

Convex / concave functions are pointwise supremum / infimum of linear functions Convex function f (x) may be concave in x 2, i.e. f (x) = g(x 2 ), and g is concave on (0,  ). Example: |x| 3/2 convex |x| 3/4 concave Example: |x| 4 convex |x| 2 still convex Convex Variational Representation If f (x) is concave in x 2, and p(x)= exp(-f (x)): x2x2 x 3/2 x4x4 x2x2 concave convex x 3/2 concave in x 2 x 4 convex in x 2 If f (x) is convex in x 2, and p(x) = exp(-f (x)): Convex: Concave: We say p(x) is Strongly Super-Gaussian

Scale Mixture Representation Gaussian Scale Mixtures (GSMs) are sums of Gaussians densities with different variances, but all zero mean: A random variable with a GSM density can be represented as a product of Standard Normal random variable Z, and an arbitrary non-negative random variable W: X = Z W -1/2 Gaussians Gaussian Scale Mixture Multivariate densities can be modeled by product non-negative scalar and Gaussian random vector: Contribution: general formula for multivariate GSMs:

Relationship between Representations Criterion for p(x) = exp(-f (x)) = exp(-g(x 2 )) to be have convex variational representation: Criterion for GSM representation given by Bernstein-Widder Theorem on complete monotonicity (CM): For Gaussian representation, need CM CM relationship (Bochner): If is CM, then, and thus the GSM representation implies the convex variational representation.

Sparse Regression –Variational MAP Bayesian Linear Model x=As+v: basis A, sources s, noise v For Strongly Super-Gaussian priors, p(s) = exp(-f (s)): Sources are independent, cost function f(s)=  i f (s i ),  (s) diagonal: Can always put in form: min f (s) subject to As = x, A overcomplete Solve: s old satisfies As = x, so right side is negative, so left side is negative

Sparse Regression – MAP – GSM For Gaussian Scale Mixture p(s), we have s = z  -1/2, and s is conditionally Gaussian given  EM algorithm The complete log likelihood is quadratic since s is conditionally Gaussian: Linear in . For EM we need expected value of  given x. But   s  x is a Markov chain: GSM EM algorithm is thus the same as the Strong Super-Gaussian algorithm – both are Iteratively Reweighted Least Squares (IRLS)

Generalized FOCUSS The FOCUSS algorithm is a particular MAP algorithm for sparse regression f (s) = |s| p or f (s) = log s. It was derived by Gorodnitsky and Rao (1997), and Rao and Kreutz-Delgado (1998) With arbitrary Strongly Super-Gaussian source prior, Generalized FOCUSS: Convergence is proved using Zangwill’s Global Convergence Theorem, which requires: (1) Descent function (2) Boundedness of iterates, and (3) closure (continuity) of algorithm mapping. We prove a general theorem on boundedness of IRLS iterations with diagonal weight matrix: least squares solution always lies in bounded part of orthant intersection We also derive the convergence rate of Generalized FOCUSS for f (s) is convex. Convergence rate for concave f(s) was proved by Gorodnitsky and Rao. We give an alternative proof. Bounded orthant- constraint intersection Least squares solution Unbounded orthant- constraint intersection

Variational Bayes General form of Sparse Bayesian Learning / Type II ML: –Find Normal density (mean and covariance) that minimizes an upper bound on KL divergence from true posterior density: –OR: MAP estimate of hyperparameters, , in GSM (instead of s). –OR: Variational Bayes algorithm which finds the separable posterior q(s|x)q(  |x) that minimizes KL divergence from true posterior p(s,  |x). The bound is derived using a modified Jensen’s inequality: Then minimize the bound by coordinate descent as before. Also IRLS, same functional form but now diagonal weights are:

Sparse Regression Example An example of sparse regression with an overcomplete basis The line is the one dimensional solution space (translated null space) Below the posterior density p(s|x) in null space plotted for Generalized Gaussian with p=1.0, p=0.5, and p=0.2

Dictionary Learning Problem: Given data x 1,…,x N find an (overcomplete) basis A for which As=x and the sources are sparse. Three algorithms: (1) Lewicki-Sejnowski ICA (2) Kreutz-Delgado FOCUSS based (3) Girolami VB based algorithm We derived a Lagrangian Newton algorithm similar to Kreutz-Delgado’s algorithm These algorithms have the general form: (1)Lewicki-Sejnowski (2)Kreutz-Delgado (3)Girolami VB (4)Lagrangian Newton

Dictionary Learning Monte Carlo Experiment: generate random A matrices, sparse sources s, and data x=As, N=100 m. Test algorithms: –Girolami, p=1.0, Jeffrey’s –Lagrangian Newton, p=1.0, p=1.1 –Kreutz-Delgado, (non-)normalized –Lewicki-Sejnowski, p=1.1, Logistic A 4 x 8, sparsity 2 A 2 x 3, sparsity 1 A 10 x 20, sparsity 1-5

Sparse Coding of EEG Goal: find synchronous “events” in multiple interesting components Learn basis for segments, length 100, across 5 channels Events are rare, so the prior density is sparse EEG scalp maps:

EEG Segment Basis: Subspace 1 Experimental task: subject sees sequence of letters, click left mouse if the letter is same as two letters back, if not click right Each column is a basis vector: segment of length 100 x 5 channels Only the second channel active in this subspace – related to incorrect response by subject – subject hears buzzer when wrong response given Dictionary learning with time series: must learn phase shifts

EEG Segment Basis: Subspace 2 In this subspace, channels 1 and 3 are active Channel 3 crest slightly precedes channel 1 crest This subspace is associated with correct response

EEG Segment Basis: Subspace 3 In this subspace, channels 1 and 2 have phase shifted 20 Hz bursts Not obviously associated with any recorded event

ICA ICA model: x=As, with A invertible, s=Wx Maximum Likelihood estimate of W=A -1 : For independent sources: Source densities unknown, must be adapted – Quasi-ML (Pham 92) Since ML minimizes KL divergence over parametric family, ML with ICA model is equivalent to minimizing Mutual Information If sources are Gaussian, A cannot be identified, only covariance If sources are Non-Gaussian, A can be identified (Cheng, Rosenblatt)

ICA Hessian Remarkably, the expected value of the Hessian of the ICA ML cost function can be calculated. Work with the “global system” C = WA, whose optimum is always identity, C * = I. Using independence of sources at the optimum, we can block diagonalize the Hessian linear operator H(B)=D in the global space into 2 x 2 blocks: Main condition for positive definiteness and convexity of ML problem at the optimum: Expected Hessian is the Fisher Information matrix Inverse is Cramer-Rao lower bound on unbiased estimator variance Plot shows bound for off-diagonal element with Gen. Gauss. prior Hessian also allows Newton method For EM stability, replaced by:

Super-Gaussian Mixture Model Variational formulation also allows derivation of generalization of Gaussian mixture model to strongly super-Gaussian mixtures: The update rules are similar to the Gaussian mixture model, but include the variational parameters 

Source Mixture Model Examples

ICA Mixture Model – Images Goal: find an efficient basis for representing image patches. Data vectors are 12 x 12 blocks.

Covariance Square Root Sphere Basis

ICA: Single Basis

ICA Mixture Model: Model 1

ICA Mixture Model: Model 2

Image Segmentation 1 Using the learned models, we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency ground

Image Segmentation 2 Again we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency tree bark

Image model 1 basis densities Low frequency components are not sparse, and may be multimodal Edge filters in Model 1 are not as sparse as the higher frequency components of Model 2

Image Model 2 densities Densities are very sparse Higher frequency components occur less often in the data Convergence is less smooth

Gen.Gauss. shape parameter histograms Image basesEEG bases More sparse, edge filters, etc.Less sparse, biological signals 1.22.01.22.0

Rate Distortion Theory Theorem of Gray shows that given a finite autoregressive process, the optimal rate transform is the inverse of the mixing filter For difference distortion measures: Proof seems to extend to general linear systems, and potentially mixture models To the extent that Linear Process Mixture Models can model arbitrary piecewise linear random processes, linear mixture deconvolution is a general coding scheme with optimal rate H(z)H(z)H -1 (z)Z(t)Z(t)X(t)X(t)Z(t)Z(t) R Z (D)  R X (D)

Time Series Segment Likelihood Multichannel convolution is a linear operation Matrix is block Toeplitz To calculate likelihood, need determinant Extension of Szegö limit theorem Can be extended to multi-dimensional fields, e.g. image convolution

Segmented EEG source time series Linear Process Mixture Model run on several source – 2 models Coherent activity is identified and segmented blindly Spectral density resolution greatly enhanced by eliminating noise

Spectral Density Enhancement Spectra before segmentation / rejection (left) and after (right). Spectral peaks invisible in all series spectrum becom visible in segmented spectrum All series spectrumSegmented spectrum Source A channel Source B channel

Future Work Fully implement hierarchical linear process model Implement Hidden Markov Model to learn relationships among various model states Test new multivariate dependent density models Implement multivariate convolutive model, e.g. on images to learn wavelets, and test video coding rates Implement Linear Process Mixture Model in VLSI circuits

Publications “A Globally Convergent Algorithm for MAP Estimation with Non-Gaussian Priors,” Proceedings of the 36th Asilomar Conference on Signals and Systems, 2002. “A General Framework for Component Estimation,” Proceedings of the 4th International Symposium on Independent Component Analysis, 2003. “Variational EM Algorithms for Non-Gaussian Latent Variable Models,” Advances in Neural Information Processing Systems, 2005. “Super-Gaussian Mixture Source Model for ICA,” Proceedings of the 6th International Symposium on Independent Component Analysis, 2006. “Linear Process Mixture Model for Piecewise Stationary Multichannel Blind Deconvolution,” submitted ICASSP 2007

Summary of Contributions Convergence proof for Generalized FOCUSS algorithm Proposal of notion of Strong Super-Gaussianity and clarification of relationship to Gaussian Scale Mixtures and Kurtosis Extension of Gaussian Scale Mixtures to general multivariate dependency models using derivatives of univariate density Derivation of Super-Gaussian Mixture Model, generalizes monotonic Gaussian mixture algorithm with same complexity Derivation of Lagrangian Newton algorithm for overcomplete dictionary learning – best performance in Monte Carlo simulations Analysis of convexity of EM based ICA Proposal of Linear Process Mixture Model, and derivation of segment probablity enabling probabilistic modeling of non- stationary time series

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model : Sparse Coding, Independent Component Analysis, and.

Similar presentations

Presentation on theme: "Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model : Sparse Coding, Independent Component Analysis, and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model : Sparse Coding, Independent Component Analysis, and.

Similar presentations

Presentation on theme: "Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model : Sparse Coding, Independent Component Analysis, and."— Presentation transcript:

Similar presentations

About project

Feedback