Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017
Bayesian Networks The graph corresponds to a particular factorization of a joint probability distribution over a set of random variables Nodes are random variables, but the graph does not say what are the distributions of the variables The graph represents a set of distributions that conform to the factorization It is recipe for building more complex models out of simpler probability distributions Describes the generative process Generally no closed form solutions for inferences in such models
Conditional independence Bayesian Networks allow us to see conditional independence properties. Blue nodes corresponds to observed random variables and empty nodes to latent (or hidden) random variables But the opposite is true for:
Gaussian Mixture Model (GMM) 𝑝 𝑥|𝜼 = 𝑐 𝒩 𝑥; 𝜇 𝑐 , 𝜎 𝑐 2 𝜋 𝑐 p(x) where 𝜼={ 𝜋 𝑐 , 𝜇 𝑐 , 𝜎 𝑐 2 } 𝑐 𝜋 𝑐 =1 x We can see the sum above just as a function defining the shape of the probability density function or …
Multivariate GMM 𝑝 𝐱|𝜼 = 𝑐 𝒩 𝐱; 𝝁 𝑐 , 𝚺 𝑐 𝜋 𝑐 𝜼={ 𝜋 𝑐 , 𝝁 𝑐 , 𝚺 𝑐 } 𝑝 𝐱|𝜼 = 𝑐 𝒩 𝐱; 𝝁 𝑐 , 𝚺 𝑐 𝜋 𝑐 where 𝜼={ 𝜋 𝑐 , 𝝁 𝑐 , 𝚺 𝑐 } 𝑐 𝜋 𝑐 =1 We can see the sum above just as a function defining the shape of the probability density function or …
Gaussian Mixture Model 𝑝 𝑥 = 𝑧 𝑝 𝑥 𝑧 𝑃(𝑧) = 𝑐 𝒩 𝑥; 𝜇 𝑐 , 𝜎 𝑐 2 Cat 𝑧=𝑐 𝝅 or we can see it as a generative probabilistic model described by Bayesian network with Categorical latent random variable 𝑧 identifying Gaussian distribution generating the observation 𝑥 Observations are assumed to be generated as follows: randomly select Gaussian component according probabilities 𝑃(𝑧) generate observation 𝑥 form the selected Gaussian distribution To evaluate 𝑝 𝑥 , we have to marginalize out 𝑧 No close form solution for training 𝑧 𝑝 𝑥,𝑧 =𝑝 𝑥 𝑧 𝑃(𝑧) 𝑥
Bayesian Networks for GMM Multiple observations: z1 z2 zN-1 zN zi or x1 x2 xN-1 xN xi 𝑖=1..𝑁 𝑝 𝑥 1 , 𝑥 2 ,…, 𝑥 N , 𝑧 1 , 𝑧 2 ,… 𝑧 N = 𝑖=1 𝑁 𝑝 𝑥 i 𝑧 i 𝑃( 𝑧 𝑖 )
Training GMM –Viterbi training Intuitive and Approximate iterative algorithm for training GMM parameters. Using current model parameters, let Gaussians classify data as if the Gaussians were different classes (Even though all the data corresponds to only one class modeled by the GMM) Re-estimate parameters of Gaussians using the data assigned to them in the previous step. New weights will be proportional to the number of data points assigned to the Gaussians. Repeat the previous two steps until the algorithm converges.
Training GMM – EM algorithm Expectation Maximization is a general tool applicable do different generative models with latent (hidden) variables. Here, we only see the result of its application to the problem of re-estimating GMM parameters. It guarantees to increase the likelihood of training data in every iteration, however, it does not guarantee to find the global optimum. The algorithm is very similar to the Viterbi training presented above. However, instead of hard alignments of frames to Gaussian components, the posterior probabilities 𝑃 𝑐| 𝑥 𝑖 (calculated given the old model) are used as soft weights. Parameters 𝜇 𝑐 , 𝜎 𝑐 2 are then calculated using a weighted average. 𝛾 𝑧𝑖 = 𝒩 𝑥 𝑖 | 𝜇 𝑧 𝑖 (𝑜𝑙𝑑) , 𝜎 2 𝑧 𝑖 (𝑜𝑙𝑑) 𝜋 𝑧 𝑖 (𝑜𝑙𝑑) 𝑘 𝒩 x 𝑖 | 𝜇 𝑘 (𝑜𝑙𝑑) , 𝜎 2 𝑘 (𝑜𝑙𝑑) 𝜋 𝑘 (𝑜𝑙𝑑) = 𝑝 x 𝑖 | 𝑧 𝑖 𝑃( 𝑧 𝑖 ) 𝑘 𝑝 𝑥 𝑖 | 𝑧 𝑖 =𝑘 𝑃( 𝑧 𝑖 =𝑘) =𝑃 𝑧 𝑖 | x 𝑖 𝜇 𝑘 𝑛𝑒𝑤 = 1 𝑖 𝛾 𝑘𝑖 𝑖 𝛾 𝑘𝑖 x 𝑖 𝜋 𝑘 𝑛𝑒𝑤 = 𝑖 𝛾 𝑘𝑖 𝑘 𝑖 𝛾 𝑘𝑖 = 𝑖 𝛾 𝑘𝑖 𝑁 𝜎 2 z 𝑛𝑒𝑤 = 1 𝑖 𝛾 𝑘𝑖 𝑖 𝛾 𝑘𝑖 x i − 𝜇 𝑘 2
GMM to be learned
EM algorithm
EM algorithm
EM algorithm
EM algorithm
EM algorithm
Expectation maximization algorithm where 𝑞 𝐙 is any distribution over the latent variable Kullback-Leibler divergence D KL (𝑞| 𝑝 measures “unsimilarity” between two distributions 𝑞,𝑝 𝐷 𝐾𝐿 (𝑞| 𝑝 ≥0 and 𝐷 KL (𝑞| 𝑝 =0⇔𝑞=𝑝 ⇒ Evidence lower bound (ELBO) ℒ 𝑞 𝐙 ,𝜼 ≤𝑝(𝐗|𝜼) 𝐻 𝑞(𝐙) is (non-negative) Entropy of distribution 𝑞(𝐙) 𝒬 𝑞 𝐙 ,𝜼 is called auxiliary function.
Expectation maximization algorithm We aim to find parameters 𝜼 that maximize ln 𝑝 𝐗 𝜼 E-step: 𝑞 𝐙 ≔𝑃(𝐙|𝐗, 𝜼 𝑜𝑙𝑑 ) makes the 𝐷 KL (𝑞| 𝑝 term 0 makes ℒ 𝑞 𝐙 ,𝜼 = ln 𝑝 𝐗 𝜼 M-step: 𝜼 𝑛𝑒𝑤 = arg max 𝜼 𝒬 𝑞 𝐙 ,𝜼 𝐷 𝐾𝐿 (𝑞| 𝑝 increases as 𝑃 𝐗 𝐙,𝜼 deviates from 𝑞 𝐙 𝐻 𝑞 𝐙 does not change for fixed 𝑞 𝐙 ℒ 𝑞 𝐙 ,𝜼 increases like 𝒬(𝑞 𝐙 ,𝜼) ln 𝑝 𝐗 𝜼 increases more than 𝒬(𝑞 𝐙 ,𝜼)
Expectation maximization algorithm ⇩ E-step: 𝑞 𝐙 ≔𝑃(𝐙|𝐗, 𝜼 𝑜𝑙𝑑 ) M-step: ⇨ 𝜼 𝑛𝑒𝑤 = arg max 𝜼 𝒬 𝑞 𝐙 ,𝜼
Expectation maximization algorithm 𝒬 𝑞 𝐙 ,𝜼 and ℒ 𝑞 𝐙 ,𝜼 will be easy to optimize (e.g. quadratic function) compared to ln 𝑝 𝐗 𝜼 𝜼 𝑜𝑙𝑑 𝜼 𝑛𝑒𝑤
EM for GMM Now, we aim to train parameters 𝜼= 𝜇 𝑧 , 𝜎 𝑧 2 , 𝜋 𝑧 of Gaussian Mixture model Given training observations 𝐱=[ 𝑥 1 , 𝑥 2 , …, 𝑥 𝑁 ] we search for ML estimate of 𝜼 that maximizes log likelihood of the training data. Direct maximization of this objective function w.r.t. 𝜼 is intractable. We will use EM algorithm, where we maximize the auxiliary function which is (for simplicity) sum of per-observation auxiliary functions Again, in M-step has to increase more than
EM for GMM – E-step 𝛾 𝑛𝑐 is the so called responsibility of Gaussian component 𝑧 for observation 𝑛. It is the probability for an observation 𝑛 being generated from component 𝑐
EM for GMM – M-step In M-step, the auxiliary function is maximized w.r.t. all GMM parameters
EM for GMM –update of means Update for component mean means: Update for variances can be derived similarly.
Flashback: ML estimate for Gaussian arg max 𝜇, 𝜎 2 𝑝 𝐱 𝜇, 𝜎 2 =arg max 𝜇, 𝜎 2 log 𝑝 𝐱 𝜇, 𝜎 2 = 𝑖 log 𝒩 𝑥 𝑖 ;𝜇, 𝜎 2 =− 1 2 𝜎 2 𝑖 𝑥 𝑖 2 + 𝜇 𝜎 2 𝑖 𝑥 𝑖 −𝑁 𝜇 2 2 𝜎 2 − log 2𝜋 2 𝜕 𝜕𝜇 log 𝑝 𝐱 𝜇, 𝜎 2 = 𝜕 𝜕𝜇 − 1 2 𝜎 2 𝑖 𝑥 𝑖 2 + 𝜇 𝜎 2 𝑖 𝑥 𝑖 −𝑁 𝜇 2 2 𝜎 2 − log 2𝜋 2 = 1 𝜎 2 𝑖 𝑥 𝑖 −𝑁𝜇 =0 ⇒ 𝜇 𝑀𝐿 = 1 𝑁 𝑖 𝑥 𝑖 𝜎 2 𝑀𝐿 = 1 𝑁 𝑖 (𝑥 𝑖 −𝜇) 2 and similarly:
EM for GMM –update of weights Weights 𝜋 𝑐 need to sum up to one. When updating weights, Lagrange multiplier 𝜆 is used to enforce this constraint.
Factorization of the auxiliary function more formally Before, we have introduced the per-observation auxiliary functions We can show that such factorization comes naturally even if we directly write the auxiliary function as defined for the EM algorithm: 𝒬 𝑞 𝐙 ,𝜼 = 𝐙 𝑞(𝐙) ln 𝑝 𝐗,𝐙|𝜼 = 𝐙 𝑛 𝑞(𝐙) 𝑛 𝑝 𝑥 𝑛 , 𝑧 𝑛 |𝜼 = 𝑐 𝑛 𝑞(𝐙)𝑝 𝑥 𝑛 , 𝑧 𝑛 |𝜼 See the next slide for proof
Factorization over components Example with only 3 fames (i.e 𝐳=[ z 1 , z 2 , z 3 ])
EM for continuous latent variable Same equations, where sums over the latent variable 𝐙 are simply replaced by integrals
PLDA model for speaker verification distribution of speaker means within class (channel) variability Same speaker hypothesis likelihood: Different speaker hyp. Likelihood: Verification score based on Bayesian model comparison: 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑠 =∫𝑝 𝐢 1 𝐫 𝑝 𝐢 2 𝐫 𝑝 𝐫 d𝐫 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑑 =𝑝 𝐢 1 𝑝 𝐢 2 𝑝 𝐢 =∫𝑝 𝐢 𝐫 𝑝 𝐫 d𝐫 𝑠= log 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑠 𝑝 𝐢 1 , 𝐢 2 ℋ 𝑑