Bayesian Models in Machine Learning

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Mixture Models and the EM Algorithm
Unsupervised Learning
Clustering Beyond K-means
Expectation Maximization
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Lecture 5: Learning models using EM
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Expectation Maximization Algorithm
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Expectation-Maximization
Visual Recognition Tutorial
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Lecture 19: More EM Machine Learning April 15, 2010.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.
Lecture 2: Statistical learning primer for biologists
CSE 517 Natural Language Processing Winter 2015
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Design and Implementation of Speech Recognition Systems Fall 2014 Ming Li Special topic: the Expectation-Maximization algorithm and GMM Sep Some.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
CS479/679 Pattern Recognition Dr. George Bebis
Expectation-Maximization (EM)
Lecture 18 Expectation Maximization
Model Inference and Averaging
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Multimodal Learning with Deep Boltzmann Machines
Latent Variables, Mixture Models and EM
Expectation-Maximization
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Probabilistic Models with Latent Variables
Introduction to EM algorithm
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Stochastic Optimization Maximization for Latent Variable Models
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Expectation-Maximization & Belief Propagation
LECTURE 15: REESTIMATION, EM AND MIXTURES
Biointelligence Laboratory, Seoul National University
Expectation Maximization Eric Xing Lecture 10, August 14, 2010
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017

Bayesian Networks The graph corresponds to a particular factorization of a joint probability distribution over a set of random variables Nodes are random variables, but the graph does not say what are the distributions of the variables The graph represents a set of distributions that conform to the factorization It is recipe for building more complex models out of simpler probability distributions Describes the generative process Generally no closed form solutions for inferences in such models

Conditional independence Bayesian Networks allow us to see conditional independence properties. Blue nodes corresponds to observed random variables and empty nodes to latent (or hidden) random variables But the opposite is true for:

Gaussian Mixture Model (GMM) 𝑝 𝑥|𝜼 = 𝑐 𝒩 𝑥; 𝜇 𝑐 , 𝜎 𝑐 2 𝜋 𝑐  p(x) where 𝜼={ 𝜋 𝑐 , 𝜇 𝑐 , 𝜎 𝑐 2 } 𝑐 𝜋 𝑐 =1  x We can see the sum above just as a function defining the shape of the probability density function or …

Multivariate GMM 𝑝 𝐱|𝜼 = 𝑐 𝒩 𝐱; 𝝁 𝑐 , 𝚺 𝑐 𝜋 𝑐 𝜼={ 𝜋 𝑐 , 𝝁 𝑐 , 𝚺 𝑐 } 𝑝 𝐱|𝜼 = 𝑐 𝒩 𝐱; 𝝁 𝑐 , 𝚺 𝑐 𝜋 𝑐 where 𝜼={ 𝜋 𝑐 , 𝝁 𝑐 , 𝚺 𝑐 } 𝑐 𝜋 𝑐 =1 We can see the sum above just as a function defining the shape of the probability density function or …

Gaussian Mixture Model 𝑝 𝑥 = 𝑧 𝑝 𝑥 𝑧 𝑃(𝑧) = 𝑐 𝒩 𝑥; 𝜇 𝑐 , 𝜎 𝑐 2 Cat 𝑧=𝑐 𝝅 or we can see it as a generative probabilistic model described by Bayesian network with Categorical latent random variable 𝑧 identifying Gaussian distribution generating the observation 𝑥 Observations are assumed to be generated as follows: randomly select Gaussian component according probabilities 𝑃(𝑧) generate observation 𝑥 form the selected Gaussian distribution To evaluate 𝑝 𝑥 , we have to marginalize out 𝑧 No close form solution for training 𝑧 𝑝 𝑥,𝑧 =𝑝 𝑥 𝑧 𝑃(𝑧) 𝑥

Bayesian Networks for GMM Multiple observations: z1 z2 zN-1 zN zi or x1 x2 xN-1 xN xi 𝑖=1..𝑁 𝑝 𝑥 1 , 𝑥 2 ,…, 𝑥 N , 𝑧 1 , 𝑧 2 ,… 𝑧 N = 𝑖=1 𝑁 𝑝 𝑥 i 𝑧 i 𝑃( 𝑧 𝑖 )

Training GMM –Viterbi training Intuitive and Approximate iterative algorithm for training GMM parameters. Using current model parameters, let Gaussians classify data as if the Gaussians were different classes (Even though all the data corresponds to only one class modeled by the GMM) Re-estimate parameters of Gaussians using the data assigned to them in the previous step. New weights will be proportional to the number of data points assigned to the Gaussians. Repeat the previous two steps until the algorithm converges.

Training GMM – EM algorithm Expectation Maximization is a general tool applicable do different generative models with latent (hidden) variables. Here, we only see the result of its application to the problem of re-estimating GMM parameters. It guarantees to increase the likelihood of training data in every iteration, however, it does not guarantee to find the global optimum. The algorithm is very similar to the Viterbi training presented above. However, instead of hard alignments of frames to Gaussian components, the posterior probabilities 𝑃 𝑐| 𝑥 𝑖 (calculated given the old model) are used as soft weights. Parameters 𝜇 𝑐 , 𝜎 𝑐 2 are then calculated using a weighted average. 𝛾 𝑧𝑖 = 𝒩 𝑥 𝑖 | 𝜇 𝑧 𝑖 (𝑜𝑙𝑑) , 𝜎 2 𝑧 𝑖 (𝑜𝑙𝑑) 𝜋 𝑧 𝑖 (𝑜𝑙𝑑) 𝑘 𝒩 x 𝑖 | 𝜇 𝑘 (𝑜𝑙𝑑) , 𝜎 2 𝑘 (𝑜𝑙𝑑) 𝜋 𝑘 (𝑜𝑙𝑑) = 𝑝 x 𝑖 | 𝑧 𝑖 𝑃( 𝑧 𝑖 ) 𝑘 𝑝 𝑥 𝑖 | 𝑧 𝑖 =𝑘 𝑃( 𝑧 𝑖 =𝑘) =𝑃 𝑧 𝑖 | x 𝑖 𝜇 𝑘 𝑛𝑒𝑤 = 1 𝑖 𝛾 𝑘𝑖 𝑖 𝛾 𝑘𝑖 x 𝑖 𝜋 𝑘 𝑛𝑒𝑤 = 𝑖 𝛾 𝑘𝑖 𝑘 𝑖 𝛾 𝑘𝑖 = 𝑖 𝛾 𝑘𝑖 𝑁 𝜎 2 z 𝑛𝑒𝑤 = 1 𝑖 𝛾 𝑘𝑖 𝑖 𝛾 𝑘𝑖 x i − 𝜇 𝑘 2

GMM to be learned

EM algorithm

EM algorithm

EM algorithm

EM algorithm

EM algorithm

Expectation maximization algorithm where 𝑞 𝐙 is any distribution over the latent variable Kullback-Leibler divergence D KL (𝑞| 𝑝 measures “unsimilarity” between two distributions 𝑞,𝑝 𝐷 𝐾𝐿 (𝑞| 𝑝 ≥0 and 𝐷 KL (𝑞| 𝑝 =0⇔𝑞=𝑝 ⇒ Evidence lower bound (ELBO) ℒ 𝑞 𝐙 ,𝜼 ≤𝑝(𝐗|𝜼) 𝐻 𝑞(𝐙) is (non-negative) Entropy of distribution 𝑞(𝐙) 𝒬 𝑞 𝐙 ,𝜼 is called auxiliary function.

Expectation maximization algorithm We aim to find parameters 𝜼 that maximize ln 𝑝 𝐗 𝜼 E-step: 𝑞 𝐙 ≔𝑃(𝐙|𝐗, 𝜼 𝑜𝑙𝑑 ) makes the 𝐷 KL (𝑞| 𝑝 term 0 makes ℒ 𝑞 𝐙 ,𝜼 = ln 𝑝 𝐗 𝜼 M-step: 𝜼 𝑛𝑒𝑤 = arg max 𝜼 𝒬 𝑞 𝐙 ,𝜼 𝐷 𝐾𝐿 (𝑞| 𝑝 increases as 𝑃 𝐗 𝐙,𝜼 deviates from 𝑞 𝐙 𝐻 𝑞 𝐙 does not change for fixed 𝑞 𝐙 ℒ 𝑞 𝐙 ,𝜼 increases like 𝒬(𝑞 𝐙 ,𝜼) ln 𝑝 𝐗 𝜼 increases more than 𝒬(𝑞 𝐙 ,𝜼)

Expectation maximization algorithm ⇩ E-step: 𝑞 𝐙 ≔𝑃(𝐙|𝐗, 𝜼 𝑜𝑙𝑑 ) M-step: ⇨ 𝜼 𝑛𝑒𝑤 = arg max 𝜼 𝒬 𝑞 𝐙 ,𝜼

Expectation maximization algorithm 𝒬 𝑞 𝐙 ,𝜼 and ℒ 𝑞 𝐙 ,𝜼 will be easy to optimize (e.g. quadratic function) compared to ln 𝑝 𝐗 𝜼 𝜼 𝑜𝑙𝑑 𝜼 𝑛𝑒𝑤

EM for GMM Now, we aim to train parameters 𝜼= 𝜇 𝑧 , 𝜎 𝑧 2 , 𝜋 𝑧 of Gaussian Mixture model Given training observations 𝐱=[ 𝑥 1 , 𝑥 2 , …, 𝑥 𝑁 ] we search for ML estimate of 𝜼 that maximizes log likelihood of the training data. Direct maximization of this objective function w.r.t. 𝜼 is intractable. We will use EM algorithm, where we maximize the auxiliary function which is (for simplicity) sum of per-observation auxiliary functions Again, in M-step has to increase more than

EM for GMM – E-step 𝛾 𝑛𝑐 is the so called responsibility of Gaussian component 𝑧 for observation 𝑛. It is the probability for an observation 𝑛 being generated from component 𝑐

EM for GMM – M-step In M-step, the auxiliary function is maximized w.r.t. all GMM parameters

EM for GMM –update of means Update for component mean means: Update for variances can be derived similarly.

Flashback: ML estimate for Gaussian arg max 𝜇, 𝜎 2 𝑝 𝐱 𝜇, 𝜎 2 =arg max 𝜇, 𝜎 2 log 𝑝 𝐱 𝜇, 𝜎 2 = 𝑖 log 𝒩 𝑥 𝑖 ;𝜇, 𝜎 2 =− 1 2 𝜎 2 𝑖 𝑥 𝑖 2 + 𝜇 𝜎 2 𝑖 𝑥 𝑖 −𝑁 𝜇 2 2 𝜎 2 − log 2𝜋 2 𝜕 𝜕𝜇 log 𝑝 𝐱 𝜇, 𝜎 2 = 𝜕 𝜕𝜇 − 1 2 𝜎 2 𝑖 𝑥 𝑖 2 + 𝜇 𝜎 2 𝑖 𝑥 𝑖 −𝑁 𝜇 2 2 𝜎 2 − log 2𝜋 2 = 1 𝜎 2 𝑖 𝑥 𝑖 −𝑁𝜇 =0 ⇒ 𝜇 𝑀𝐿 = 1 𝑁 𝑖 𝑥 𝑖 𝜎 2 𝑀𝐿 = 1 𝑁 𝑖 (𝑥 𝑖 −𝜇) 2 and similarly:

EM for GMM –update of weights Weights 𝜋 𝑐 need to sum up to one. When updating weights, Lagrange multiplier 𝜆 is used to enforce this constraint.

Factorization of the auxiliary function more formally Before, we have introduced the per-observation auxiliary functions We can show that such factorization comes naturally even if we directly write the auxiliary function as defined for the EM algorithm: 𝒬 𝑞 𝐙 ,𝜼 = 𝐙 𝑞(𝐙) ln 𝑝 𝐗,𝐙|𝜼 = 𝐙 𝑛 𝑞(𝐙) 𝑛 𝑝 𝑥 𝑛 , 𝑧 𝑛 |𝜼 = 𝑐 𝑛 𝑞(𝐙)𝑝 𝑥 𝑛 , 𝑧 𝑛 |𝜼 See the next slide for proof

Factorization over components Example with only 3 fames (i.e 𝐳=[ z 1 , z 2 , z 3 ])

EM for continuous latent variable Same equations, where sums over the latent variable 𝐙 are simply replaced by integrals

PLDA model for speaker verification distribution of speaker means within class (channel) variability Same speaker hypothesis likelihood: Different speaker hyp. Likelihood: Verification score based on Bayesian model comparison: 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑠 =∫𝑝 𝐢 1 𝐫 𝑝 𝐢 2 𝐫 𝑝 𝐫 d𝐫 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑑 =𝑝 𝐢 1 𝑝 𝐢 2 𝑝 𝐢 =∫𝑝 𝐢 𝐫 𝑝 𝐫 d𝐫 𝑠= log 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑠 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑑