Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers.

Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers for inviting me. I’m going to present some new work on expectation propagation, which shines a new light on this algorithm, by showing that it performs a smoothed gradient descent

Computational troubles in Bayesia
If we want to approximate 𝑝 𝜽 : Gaussian approximations: Laplace approximation + Gradient Descent Variational Bayes (and a variant) Expectation Propagation The problem: Bayesian methods are susceptible to computational trouble You should think of the first choice as conservative, and a bit boring, sort of like Jeb Bush While the next two are the opposite of that, so maybe like 2007 Senator Barrack Obama What I will show in this talk is that these three methods are closely linked.

Laplace + Gradient Descent
Laplace = Gaussian approximation at the mode Computed using Gradient Descent on 𝜓=− log 𝑝 Probability So here is one example of a posterior distribution The laplace approximation consists in finding its maximum and fitting a purely local Gaussian approximation there 𝜃

Laplace + Gradient Descent
Laplace = Gaussian approximation at the mode Computed using Gradient Descent on 𝜓=− log 𝑝 The mathematically conservative choice: Gradient Descent is well-understood Laplace is exact in the large-data limit

Physical intuitions Gradient Descent ≈ dynamics of a sliding object
- Log probability - Log probability Final great feature: Gradient descent appeals to our physical intuitions because it matches the dynamics of an object sliding down a slope

Linking GD, VB and EP VB and EP iterate Gaussian approximations
We can define an algorithm that: Iterates Gaussian Computes the Laplace Does Gradient Descent

Algorithm 1: disguised gradient descent
Initialize with any Gaussian 𝑞 0 Loop: 𝜇 𝑛 = 𝐸 𝑞 𝑛 𝜃 𝑟= 𝜓 ′ 𝜇 𝑛 𝛽= 𝜓 ′′ 𝜇 𝑛 𝑞 𝑛+1 𝜃 ∝ exp −𝑟 𝜃− 𝜇 𝑛 − 𝛽 2 𝜃− 𝜇 𝑛 2 𝝁 𝒏+𝟏 = 𝝁 𝒏 − 𝝍 ′ 𝝁 𝒏 𝝍 ′′ 𝝁 𝒏 This is Newton’s method !

Algorithm 1: disguised gradient descent
Newton’s method DGD 𝜓≈𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 𝑝≈ exp −𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 𝑝≈𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛

Variational Bayes Gaussian approximation
The Variational Bayes approach: Minimize KL 𝑞,𝑝 = 𝐸 𝑞 log 𝑞 𝑝 for 𝑞 a Gaussian Local minima respect (Opper, Archambeau, 2007): 𝐸 𝑞 ∗ 𝜓 ′ =0 𝐸 𝑞 ∗ 𝜓 ′′ = 𝑣𝑎 𝑟 𝑞 ∗ −1

Algorithm 2: smoothed gradient descent
Initialize with any Gaussian 𝑞 0 Loop: 𝜇 𝑛 = 𝐸 𝑞 𝑛 𝜃 𝑟= 𝐸 𝑞 𝑛 𝜓 ′ 𝜃 𝛽= 𝐸 𝑞 𝑛 𝜓 ′′ 𝜃 𝑞 𝑛+1 𝜃 ∝ exp −𝑟 𝜃− 𝜇 𝑛 − 𝛽 2 𝜃− 𝜇 𝑛 2 ≈ 𝝍 ′ ( 𝝁 𝒏 ) ≈ 𝝍 ′′ ( 𝝁 𝒏 )

Algorithm 2: smoothed gradient descent

𝛼-Divergence minimization
If instead of KL, we minimize: 𝐷 𝛼 𝑝,𝑞 =∫ 𝑝 1−𝛼 𝑞 𝛼 Then, local minima 𝑞 ∗ are such that: ℎ ∗ ∝ 𝑝 1−𝛼 𝑞 ∗ 𝛼 𝐸 ℎ ∗ 𝜓 ′ =0 𝐸 ℎ ∗ 𝜃− 𝜇 ℎ 𝜓 ′ =1

Algorithm 3: hybrid smoothing GD
Initialize with any Gaussian 𝑞 0 Loop: ℎ 𝑛 ∝ 𝑝 1−𝛼 𝑞 𝑛 𝛼 𝜇 𝑛 = 𝐸 ℎ 𝑛 𝜃 𝑟= 𝐸 ℎ 𝑛 𝜓 ′ 𝜃 𝛽= 𝑣𝑎 𝑟 ℎ 𝑛 −1 𝐸 ℎ 𝑛 𝜃− 𝜇 ℎ 𝜓 ′ 𝜃 𝑞 𝑛+1 𝜃 ∝ exp −𝑟 𝜃− 𝜇 𝑛 − 𝛽 2 𝜃− 𝜇 𝑛 2 ≈ 𝝍 ′ ( 𝝁 𝒏 ) ≈ 𝝍 ′′ ( 𝝁 𝒏 ) ≈ 𝑬 𝒒 𝒏 𝝍 ′′ = 𝒗𝒂 𝒓 𝒒 𝒏 −𝟏 𝑬 𝒒 𝒏 𝜽− 𝝁 𝒏 𝝍 ′

Interpreting algorithm 3
The only difference (not obvious for 𝛽-term): Replacing 𝑞 𝑛 , a poor approximation to 𝑝 By a superior hybrid approximation: ℎ 𝑛 ∝ 𝑝 1−𝛼 𝑞 𝑛 𝛼 ≈𝑝

Expectation Propagation
Assume that the target can be factorized: 𝑝 𝜃 ∝ 𝑖 𝑓 𝑖 𝜃 Then EP seeks a Gaussian approximation for each 𝑓 𝑖 : 𝑔 𝑖 𝜃 ≈ 𝑓 𝑖 𝜃 They are improved iteratively

Algorithm 4: classic Expectation Propagation
Loop: Compute the 𝑖 𝑡ℎ hybrid: ℎ 𝑖 ∝ 𝑓 𝑖 𝜃 𝑗≠𝑖 𝑔 𝑗 𝜃 and its mean and variance: 𝜇 𝑖 = 𝐸 ℎ 𝑖 𝜃 𝑣 𝑖 =𝑣𝑎 𝑟 ℎ 𝑖 New 𝑖 𝑡ℎ approximation: 𝑔 𝑖 𝜃 = exp − 𝜃− 𝜇 𝑖 𝑣𝑎 𝑟 ℎ 𝑖 𝑗≠𝑖 𝑔 𝑗 𝜃 ≈𝒑 ≈ 𝒇 𝒊 ∏ 𝒈 𝒋

Algorithm 5: smooth EP Factorizing 𝑝 has split the energy landscape: 𝜓 𝜃 = i 𝜙 𝑖 𝜃 For each component 𝜙 𝑖 𝜃 , use a different smoothing: ℎ 𝑖 ∝ 𝑓 𝑖 𝑗≠𝑖 𝑔 𝑗 ≈𝑝 Then, update 𝑔 𝑖 ≈ 𝑓 𝑖 = exp (− 𝜙 𝑖 )

𝜇 𝑖 = 𝐸 ℎ 𝑖 𝜃 𝑟= 𝐸 ℎ 𝑖 𝜙 𝑖 ′ 𝜃 𝛽= 𝑣𝑎 𝑟 ℎ 𝑖 −1 𝐸 ℎ 𝑖 𝜃− 𝜇 𝑖 𝜙 𝑖 ′ 𝜃
Algorithm 5: smooth EP Initialize with any Gaussians 𝑔 1 , 𝑔 2 … 𝑔 𝑛 Loop: ℎ 𝑖 ∝ 𝑓 𝑖 𝑗≠𝑖 𝑔 𝑗 𝜇 𝑖 = 𝐸 ℎ 𝑖 𝜃 𝑟= 𝐸 ℎ 𝑖 𝜙 𝑖 ′ 𝜃 𝛽= 𝑣𝑎 𝑟 ℎ 𝑖 −1 𝐸 ℎ 𝑖 𝜃− 𝜇 𝑖 𝜙 𝑖 ′ 𝜃 𝑔 𝑖 𝜃 ∝ exp −𝑟 𝜃− 𝜇 𝑖 − 𝛽 2 𝜃− 𝜇 𝑖 2

Classic vs Smooth EP Algorithm 4: Computationally efficient
Completely unintuitive Algorithm 5: Intuitive: linked to Newton’s method Tractable to analysis Which should we choose?

Conclusion Algorithm 1: iterating on Gaussians to perform GD Algorithm 2: smoothed GD computes VB approx. Algorithm 3: hybrid smoothing compute 𝐷 𝛼 approx Algorithm 5: complicated hybrid smoothing which computes EP approximation We can re-use our understanding of Newton’s when we think of EP Possible path towards improved EP algorithms?

Conclusion This might prove a path towards theoretical results on EP:
Intuitively proves the link between EP and VB: The only difference between 2 and 5: 𝑞 𝑛 or ℎ 𝑖 smoothing In the limit where all ℎ 𝑖 ≈ 𝑞 𝑛 , EP ≈ VB Corresponds to a large-number of weak factors I hope that this can help you understand EP better and apply it on your own problems. Thank you very much for your attention.

Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers.

Similar presentations

Presentation on theme: "Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers.

Similar presentations

Presentation on theme: "Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers."— Presentation transcript:

Similar presentations

About project

Feedback