Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco, CA Tutorial 6th International Conference on Independent.

Similar presentations


Presentation on theme: "Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco, CA Tutorial 6th International Conference on Independent."— Presentation transcript:

1 Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco, CA Tutorial 6th International Conference on Independent Component Analysis and Blind Source Separation, Charleston, SC, March 2006

2 ICA / BSS is 15 Years Old First pair of papers: Comon, Jutten & Herault, Signal Processing, 1991 First papers on a statistical machine learning approach to ICA/BSS: Bell & Sejnowski 1995; Cardoso 1996; Pearlmutter & Parra 1997 First conference on ICA/BSS: Helsinki, 2000 Lesson drawn by many: ICA is a cool problem. Let’s find many approaches to it and many places where it’s useful. Lesson drawn by some: statistical machine learning is a cool framework. Let’s use it to transform adaptive signal processing. ICA is a good start.

3 Noise Cancellation Microphone TV ANC Background

4 From Noise Cancellation to ICA Microphone TV ICA Background TV

5 Noise Cancellation: Derivation y = sensor, x = sources, n = time point y 1 (n) = x 1 (n) + w(n) * x 2 (n) y 2 (n) = x 2 (n) Joint probability distribution of observed sensor data p(y) = p x (x 1 = y 1 – w*y 2, x 2 = y 2 ) Assume the sources are independent, identically distributed Gaussians, with mean 0 and precisions v 1, v 2 Observed data likelihood L = log p(y) = -0.5 v 1 (y 1 – w*y 2 ) 2 + const. dL / dw = 0  linear equation for w

6 Noise Cancellation  ICA: Derivation y = sensor, x = sources, n = time point y(n) = A x(n), A = square mixing matrix x(n) = G y(n), G = square unmixing matrix Probability distribution of observed sensor data p(y) = |G| p x (G y) Assume the sources are i.i.d. non-Gaussians Observed data likelihood L = log p(y) = log |G| + log p(x 1 ) + log p(x 2 ) dL / dG = 0  non-linear equation for G

7 Sensor Noise and Hidden Variables y = sensor, x = sources, u = noise, n = time point y(n) = A x(n) + u(n) x are now hidden variables: even if A is known, one cannot obtain x exactly from y However, one can compute the posterior probability of x conditioned on y p(x|y) = p(y|x) p(x) / p(y) where p(y|x) = p u (y – A x) To learn A from data, one must use an expectation maximization (EM) algorithm (and often approximate it)

8 Probabilistic Graphical Models Model the distribution of observed data Graph structure determines the probabilistic dependence between variables We focus on DAGs = directed acyclic graphs Node = variable Arrow = probabilistic dependence xxy p(x) p(y,x) = p(y|x) p(x)

9 Linear Classification c = class label; discrete, multinomial y = data; continuous, Gaussian p(c) = π c, p(y|c) = N( y | μ c, ν c ) Training set: pairs {y,c} Learn parameters by maximum likelihood L = log p(y,c) = log p(y|c) + log p(c) Test set: {y}, classify using p(c|y) = p(y,c) / p(y) p(y,c) = p(y|c) p(c) c y

10 Linear Regression x = predictor; continuous, Gaussian y = dependent; continuous, Gaussian p(x) = N(x | μ, ν ), p(y|x) = N( y | Ax, Λ ) Training set: pairs {y,x} Learn parameters by maximum likelihood L = log p(y,x) = log p(y|x) + log p(x) Test set: {x}, predict using p(y|x) x y p(y,x) = p(y|x) p(x)

11 Clustering c = class label; discrete, multinomial y = data; continuous, Gaussian p(c) = π c, p(y|c) = N( y | μ c, ν c ) Training set: {y} p(y) is a mixture of Gaussians (MoG) Learn parameters by expectation maximization (EM) Test set: {y}, cluster using p(c|y) = p(y,c) / p(y) Limit of zero variance: vector quantization (VQ) p(y,c) = p(y|c) p(c) But now c is hidden c y

12 Factor Analysis x = factors; continuous, Gaussian y = data; continuous, Gaussian p(x) = N( x | 0, I ), p(y|x) = N( y | Ax, Λ ) Training set: {y} p(y) is Gaussian with covariance AA’+Λ -1 Learn parameters by expectation maximization (EM) Test set: {y}, obtain factors by p(x|y) = p(y,x) / p(x) Limit of zero noise: principal component analysis (PCA) p(y,x) = p(y|x) p(x) But now x is hidden x y

13 Dynamical Models s(1) y(1) s(2) y(2) s(3) y(3) x(1) y(1) x(2) y(2) x(3) y(3) y(1) x(2) y(2) x(3) y(3) x(1) s(1)s(2)s(3) Hidden Markov Model (Baum-Welch) State Space Model (Kalman Smoothing) Switching State Space Model (Intractable)

14 Probabilistic Inference Nodes inside frame: variables, vary in time Nodes outside frame: parameters, constant in time Parameters have prior distributions p(A), p(Λ) Bayesian Inference: compute full posterior distribution p(x,A,Λ|y) over all hidden nodes conditioned on observed nodes Bayes’ rule: p(x,A,Λ|y)=p(y|x,A,Λ)p(x)p(A)p(Λ)/p(y) In hidden variable models, joint posterior can generally not be computed exactly. The normalization factor p(y) is instractable x y A Λ Factor analysis model p(x) = N(x|0,I) p(y|x,A,Λ) = N(y|Ax,Λ)

15 MAP and Maximum Likelihood MAP = maximum aposteriori; consider only the parameter values the maximize the posterior p(x,A,Λ|y) This is the maximum likelihood method: compute A,Λ that maximize L = log p(y|A,Λ) However, in hidden variable models L is a complicated function of the parameters; direct maximization would require gradient based techniques which are slow Solution: the EM algorithm Iterative algorithm, each iteration has an E-step and an M-step E-step: compute posterior over hidden variables p(x|y) M-step: maximize complete data likelihood E log p(y,x,A,Λ) w.r.t. the parameters A,Λ; E = posterior average over x

16 Derivation of the EM Algorithm Instead of the likelihood L = log p(y), consider F(q) = E log p(y,x) – E log q(x|y) where q(x|y) is a ‘trial’ posterior and E averaged over x w.r.t. q Can show: F(q) = L – KL [ q(x|y) || p(x|y) ] <= L Hence F is upper bounded by L, and F=L when q=true posterior EM performs an alternate maximization of F The E-step maximizes F w.r.t. the posterior q The M-step maximizes F w.r.t. the parameters A,Λ Hence EM performs maximum likelihood

17 ICA by EM: MoG Sources Each source distribution p(x) is a 1-dim mixture of Gaussians The Gaussian labels s are hidden variables The data y = A x, hence x = G y are not hidden Likelihood: L = log |G| + log p(x) F(q) = log |G| + E log p(x,s) – E log q(s|y) E-step: q(s|y) = p(x,s) / z M-step: G  G + ε(I-Φ(x)x’)G (natural gradient) Φ(x) is linear in x and q Can also learn the source parameters MoG 1, MoG 2 at the M-step y1y1 s 1 x1x1 y2y2 s 2 x2x2 MoG 2 MoG 1 A

18 Noisy, Non-Square ICA: Independent Factor Analysis The Gaussian labels s are hidden variables The data y = A x + u, hence x are also hidden p(y|x) = N( y | Ax, Λ ) Likelihood: L = log p(y) must marginalize over x,s F(q) = E log p(y,x,s) – E log q(x,s|y) E-step: q(x,s|y) = q(x|s,y)q(s|y) M-step: linear eqs for A, Λ Can also learn the source parameters MoG 1, MoG 2 at the M-step Convergence problem in low noise y1y1 s 1 y2y2 s 2 MoG 2 MoG 1 A,Λ x1x1 x2x2

19 Intractability of Inference In many models of interest the E-step is computationally intractable Switching state space model: posterior over discrete state p(s|y) is exponential in time Independent factor analysis: posterior over Gaussian labels is exponential in number of sources Approximations must be made MAP approximation: consider only the most likely state configuration(s) Markov Chain Monte Carlo: convergence may be quite slow and hard to determine y(1) x(2) y(2) x(3) y(3) x(1) s(1)s(2)s(3)

20 Variational EM Idea: use an approximate posterior which has a factorized form Example: switching state space model factorize the continuous states from the discrete states p(x,s|y) ~ q(x,s|y) = q(x|y) q(s|y) make no other assumptions (e.g., functional forms) To derive, consider F(q) from the derivation of EM F(q) = E log p(y,x,s) - E log q(x|y) – E log q(s|y) E performs posterior averaging w.r.t. q Maximize F alternately w.r.t. q(x|y) and q(s|y) q(x|y) = E s p(y,x,s) / z s q(s|y) = E x p(y,x,s) / z x This adds an internal loop in the E-step; M-step is unchanged Convergence is guaranteed since F(q) is upper bounded by L

21 Switching Model: 2 Variational Approximations Model: Variational approximation I: y(1) x(2) y(2) x(3) y(3) x(1) s(1)s(2)s(3) x(2)x(3)x(1) s(1)s(2)s(3) x(2)x(3)x(1) s(1)s(2)s(3) Variational approximation II: I: Baum-Welch, Kalman; Gaussian, smoothed II: Baum-Welch, MoG; Multimodal, not smoothed

22 IFA: 2 Variational Approximations Model: Variational approximation I: y1y1 s 1 y2y2 s 2 x1x1 x2x2 s 1 s 2 x1x1 x2x2 s 1 s 2 x1x1 x2x2 Variational approximation II: I: Source posterior is Gaussian, correlated II: Source posterior is multimodal, independent

23 Model Order Selection How does one determine the optimal number of factors in FA? Maximum likelihood would always prefer more complex models, since they fit the data better; but they overfit The probabilistic inference approach: place a prior p(A) over the model parameters, and consider the marginal likelihood L = log p(y) = E log p(y,A) – E log p(A|y) Compute L for each number of factors. Choose the number that maximizes L An alternative approach: place a prior p(A) assuming a maximum number of factors. The prior has a hyperparameter for each column of A – its precision α. Optimize the precisions by maximizing L. Unnecessary columns will have α  infinity Both approaches require computing the parameter posterior p(A|y), which is usually intractable

24 Variational Bayesian EM Idea: use an approximate posterior which factorizes the parameters from the hidden variables Example: factor analysis p(x,A|y) ~ q(x,A|y) = q(x|y) q(A|y) make no other assumptions (e.g., functional forms) To derive, consider F(q) from the derivation of EM F(q) = E log p(y,x,A) - E log q(x|y) – E log q(A|y) E performs posterior averaging w.r.t. q Maximize F alternately w.r.t. q(x|y) and q(A|y) E-step: q(x|y) = E A p(y,x,A) / z A M-step: q(A|y) = E x p(y,x,A) / z x Plus, maximize F w.r.t. the noise precision Λ and hyperparameters α (MAP approximation)

25 VB Approximation for IFA Model: VB approximation: y1y1 s 1 y2y2 s 2 A x1x1 x2x2 s 1 s 2 A x1x1 x2x2

26 Conjugate Priors Which form should one choose for prior distributions? Conjugate prior idea: Choose a prior such that the resulting posterior distribution would have the same functional form as the prior Single Gaussian: posterior over mean is p(μ|y) = p(y|μ) p(μ) / p(y) conjugate prior is Gaussian Single Gaussian: posterior over mean + precision is p(μ,ν|y) = p(y|μ,ν) p(μ,ν) conjugate prior is Normal-Wishart Factor analysis: VB posterior over mixing matrix is q(A|y) = E x p(y,x|A) p(A) / z conjugate prior is Gaussian

27 Separation of Convoluted Mixtures of Speech Sources Blind separation methods use extremely simple models for source distributions Speech signals have a rich structure. Models that capture aspects of it could result in improved separation, deconvolution, and noise robustness One such model: work in the windowed FFT domain x(n,k) = G(k) y(n,k) where n=frame index, k=frequency Train a MoG model on the x(n,k) such that different components capture different speech spectra Plus this model into IFA and use EM to obtain separation of convoluted mixtures

28 Noise Suppression in Speech Signals Existing methods based on, e.g., spectral subtraction and array processing, often produce unsatisfactory noise suppression Algorithms based on probabilistic models can (1) exploit rich speech models, (2) learn the noise from noisy data (not just silent segments) (3) can work with one or more microphones Use speech model in the windowed FFT domain Λ(k) = noise precision per frequency (inverse spectrum) y1y1 s 1 y2y2 s 2 MoG 2 MoG 1 A,Λ x1x1 x2x2

29 Interference Suppression and Evoked Source Separation in MEG data y(n) = MEG sensor data, x(n) = evoked brain sources, u(n) = interference sources, v(n) = sensor noise Evoked stimulus experimental paradigm: evoked sources are active only after the stimulus onset pre-stimulus: y(n) = B u(n) + v(n) post-stimulus: y(n) = A x(n) + B u(n) + v(n) SEIFA is an extension of IFA to this case: model x by MoG, model u by Gaussians N(0,I), model v by Gaussian N(0,Λ) Use pre-stimulus to learn interference mixing matrix B and noise precision Λ; use post-stimulus to learn evoked mixing matrix A Use VB-EM to infer from data the optimal number of interference factors u and of evoked factors x; also protect from overfitting Cleaned data: y = A x ; Contribution of factor j: y i = A ij x j Advantages over ICA: no need to discard information by dim reduction; can exploit stimulus onset information; superior noise suppression

30 Stimulus Evoked Independent Factor Analysis Pre-stimulus Post-stimulus y s MoG A,B xu y B u

31 Brain Source Localization using MEG Problem: localize brain sources that respond to a stimulus Response model is simple: y(n) = F s(n) + v(n) F = lead field (known), s = brain voxel activity However, the number of voxels (~3000-1000) is much larger than the number of sensors (~100-300) One approach: fit multiple dipole sources; cost is exponential in the number of sources Idea: loop over voxels; for each one, use VB-EM to learn a modified FA model y(n) = F’ z(n) + A x(n) + v(n) where F’ = lead field for that voxel, z = voxel activity, x = response from all other active voxels Obtain a localization map by plotting per voxel Superior results to exising (beamforming based) methods; can handle correlated sources

32 MEG Localization Model Pre-stimulus Post-stimulus y A,B zu y B ux F’

33 Conclusion Statistical machine learning provides a principled framework for formulating and solving adaptive signal processing problems Process: (1) design a probabilistic model that corresponds to the problem (2) use machinery for exact and approximate inference to learn the model from data, including model order (3) extend the model, by e.g. incorporating rich signal models, to improve performance Problems treated here: noise suppression, source separation, source localization Domains: Speech, audio, biomedical data Domains outside this tutorial: image, video, text, coding, ….. Future: algorithms derived from probabilistic models take over and completely transform adaptive signal processing


Download ppt "Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco, CA Tutorial 6th International Conference on Independent."

Similar presentations


Ads by Google