Presentation is loading. Please wait.

Presentation is loading. Please wait.

Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers.

Similar presentations


Presentation on theme: "Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers."β€” Presentation transcript:

1 Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers for inviting me. I’m going to present some new work on expectation propagation, which shines a new light on this algorithm, by showing that it performs a smoothed gradient descent

2 Computational troubles in Bayesia
If we want to approximate 𝑝 𝜽 : Gaussian approximations: Laplace approximation + Gradient Descent Variational Bayes (and a variant) Expectation Propagation The problem: Bayesian methods are susceptible to computational trouble You should think of the first choice as conservative, and a bit boring, sort of like Jeb Bush While the next two are the opposite of that, so maybe like 2007 Senator Barrack Obama What I will show in this talk is that these three methods are closely linked.

3 Laplace + Gradient Descent
Laplace = Gaussian approximation at the mode Computed using Gradient Descent on πœ“=βˆ’ log 𝑝 Probability So here is one example of a posterior distribution The laplace approximation consists in finding its maximum and fitting a purely local Gaussian approximation there πœƒ

4 Laplace + Gradient Descent
Laplace = Gaussian approximation at the mode Computed using Gradient Descent on πœ“=βˆ’ log 𝑝 The mathematically conservative choice: Gradient Descent is well-understood Laplace is exact in the large-data limit

5 Physical intuitions Gradient Descent β‰ˆ dynamics of a sliding object
- Log probability - Log probability Final great feature: Gradient descent appeals to our physical intuitions because it matches the dynamics of an object sliding down a slope

6 Linking GD, VB and EP VB and EP iterate Gaussian approximations
We can define an algorithm that: Iterates Gaussian Computes the Laplace Does Gradient Descent

7 Algorithm 1: disguised gradient descent
Initialize with any Gaussian π‘ž 0 Loop: πœ‡ 𝑛 = 𝐸 π‘ž 𝑛 πœƒ π‘Ÿ= πœ“ β€² πœ‡ 𝑛 𝛽= πœ“ β€²β€² πœ‡ 𝑛 π‘ž 𝑛+1 πœƒ ∝ exp βˆ’π‘Ÿ πœƒβˆ’ πœ‡ 𝑛 βˆ’ 𝛽 2 πœƒβˆ’ πœ‡ 𝑛 2 𝝁 𝒏+𝟏 = 𝝁 𝒏 βˆ’ 𝝍 β€² 𝝁 𝒏 𝝍 β€²β€² 𝝁 𝒏 This is Newton’s method !

8 Algorithm 1: disguised gradient descent
Newton’s method DGD πœ“β‰ˆπ‘žπ‘’π‘Žπ‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘ π‘β‰ˆ exp βˆ’π‘žπ‘’π‘Žπ‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘ π‘β‰ˆπΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘›

9 Variational Bayes Gaussian approximation
The Variational Bayes approach: Minimize KL π‘ž,𝑝 = 𝐸 π‘ž log π‘ž 𝑝 for π‘ž a Gaussian Local minima respect (Opper, Archambeau, 2007): 𝐸 π‘ž βˆ— πœ“ β€² =0 𝐸 π‘ž βˆ— πœ“ β€²β€² = π‘£π‘Ž π‘Ÿ π‘ž βˆ— βˆ’1

10 Algorithm 2: smoothed gradient descent
Initialize with any Gaussian π‘ž 0 Loop: πœ‡ 𝑛 = 𝐸 π‘ž 𝑛 πœƒ π‘Ÿ= 𝐸 π‘ž 𝑛 πœ“ β€² πœƒ 𝛽= 𝐸 π‘ž 𝑛 πœ“ β€²β€² πœƒ π‘ž 𝑛+1 πœƒ ∝ exp βˆ’π‘Ÿ πœƒβˆ’ πœ‡ 𝑛 βˆ’ 𝛽 2 πœƒβˆ’ πœ‡ 𝑛 2 β‰ˆ 𝝍 β€² ( 𝝁 𝒏 ) β‰ˆ 𝝍 β€²β€² ( 𝝁 𝒏 )

11 Algorithm 2: smoothed gradient descent

12 𝛼-Divergence minimization
If instead of KL, we minimize: 𝐷 𝛼 𝑝,π‘ž =∫ 𝑝 1βˆ’π›Ό π‘ž 𝛼 Then, local minima π‘ž βˆ— are such that: β„Ž βˆ— ∝ 𝑝 1βˆ’π›Ό π‘ž βˆ— 𝛼 𝐸 β„Ž βˆ— πœ“ β€² =0 𝐸 β„Ž βˆ— πœƒβˆ’ πœ‡ β„Ž πœ“ β€² =1

13 Algorithm 3: hybrid smoothing GD
Initialize with any Gaussian π‘ž 0 Loop: β„Ž 𝑛 ∝ 𝑝 1βˆ’π›Ό π‘ž 𝑛 𝛼 πœ‡ 𝑛 = 𝐸 β„Ž 𝑛 πœƒ π‘Ÿ= 𝐸 β„Ž 𝑛 πœ“ β€² πœƒ 𝛽= π‘£π‘Ž π‘Ÿ β„Ž 𝑛 βˆ’1 𝐸 β„Ž 𝑛 πœƒβˆ’ πœ‡ β„Ž πœ“ β€² πœƒ π‘ž 𝑛+1 πœƒ ∝ exp βˆ’π‘Ÿ πœƒβˆ’ πœ‡ 𝑛 βˆ’ 𝛽 2 πœƒβˆ’ πœ‡ 𝑛 2 β‰ˆ 𝝍 β€² ( 𝝁 𝒏 ) β‰ˆ 𝝍 β€²β€² ( 𝝁 𝒏 ) β‰ˆ 𝑬 𝒒 𝒏 𝝍 β€²β€² = 𝒗𝒂 𝒓 𝒒 𝒏 βˆ’πŸ 𝑬 𝒒 𝒏 πœ½βˆ’ 𝝁 𝒏 𝝍 β€²

14 Interpreting algorithm 3
The only difference (not obvious for 𝛽-term): Replacing π‘ž 𝑛 , a poor approximation to 𝑝 By a superior hybrid approximation: β„Ž 𝑛 ∝ 𝑝 1βˆ’π›Ό π‘ž 𝑛 𝛼 β‰ˆπ‘

15 Expectation Propagation
Assume that the target can be factorized: 𝑝 πœƒ ∝ 𝑖 𝑓 𝑖 πœƒ Then EP seeks a Gaussian approximation for each 𝑓 𝑖 : 𝑔 𝑖 πœƒ β‰ˆ 𝑓 𝑖 πœƒ They are improved iteratively

16 Algorithm 4: classic Expectation Propagation
Loop: Compute the 𝑖 π‘‘β„Ž hybrid: β„Ž 𝑖 ∝ 𝑓 𝑖 πœƒ 𝑗≠𝑖 𝑔 𝑗 πœƒ and its mean and variance: πœ‡ 𝑖 = 𝐸 β„Ž 𝑖 πœƒ 𝑣 𝑖 =π‘£π‘Ž π‘Ÿ β„Ž 𝑖 New 𝑖 π‘‘β„Ž approximation: 𝑔 𝑖 πœƒ = exp βˆ’ πœƒβˆ’ πœ‡ 𝑖 π‘£π‘Ž π‘Ÿ β„Ž 𝑖 𝑗≠𝑖 𝑔 𝑗 πœƒ β‰ˆπ’‘ β‰ˆ 𝒇 π’Š ∏ π’ˆ 𝒋

17 Algorithm 5: smooth EP Factorizing 𝑝 has split the energy landscape: πœ“ πœƒ = i πœ™ 𝑖 πœƒ For each component πœ™ 𝑖 πœƒ , use a different smoothing: β„Ž 𝑖 ∝ 𝑓 𝑖 𝑗≠𝑖 𝑔 𝑗 β‰ˆπ‘ Then, update 𝑔 𝑖 β‰ˆ 𝑓 𝑖 = exp (βˆ’ πœ™ 𝑖 )

18 πœ‡ 𝑖 = 𝐸 β„Ž 𝑖 πœƒ π‘Ÿ= 𝐸 β„Ž 𝑖 πœ™ 𝑖 β€² πœƒ 𝛽= π‘£π‘Ž π‘Ÿ β„Ž 𝑖 βˆ’1 𝐸 β„Ž 𝑖 πœƒβˆ’ πœ‡ 𝑖 πœ™ 𝑖 β€² πœƒ
Algorithm 5: smooth EP Initialize with any Gaussians 𝑔 1 , 𝑔 2 … 𝑔 𝑛 Loop: β„Ž 𝑖 ∝ 𝑓 𝑖 𝑗≠𝑖 𝑔 𝑗 πœ‡ 𝑖 = 𝐸 β„Ž 𝑖 πœƒ π‘Ÿ= 𝐸 β„Ž 𝑖 πœ™ 𝑖 β€² πœƒ 𝛽= π‘£π‘Ž π‘Ÿ β„Ž 𝑖 βˆ’1 𝐸 β„Ž 𝑖 πœƒβˆ’ πœ‡ 𝑖 πœ™ 𝑖 β€² πœƒ 𝑔 𝑖 πœƒ ∝ exp βˆ’π‘Ÿ πœƒβˆ’ πœ‡ 𝑖 βˆ’ 𝛽 2 πœƒβˆ’ πœ‡ 𝑖 2

19 Classic vs Smooth EP Algorithm 4: Computationally efficient
Completely unintuitive Algorithm 5: Intuitive: linked to Newton’s method Tractable to analysis Which should we choose?

20 Conclusion Algorithm 1: iterating on Gaussians to perform GD Algorithm 2: smoothed GD computes VB approx. Algorithm 3: hybrid smoothing compute 𝐷 𝛼 approx Algorithm 5: complicated hybrid smoothing which computes EP approximation We can re-use our understanding of Newton’s when we think of EP Possible path towards improved EP algorithms?

21 Conclusion This might prove a path towards theoretical results on EP:
Intuitively proves the link between EP and VB: The only difference between 2 and 5: π‘ž 𝑛 or β„Ž 𝑖 smoothing In the limit where all β„Ž 𝑖 β‰ˆ π‘ž 𝑛 , EP β‰ˆ VB Corresponds to a large-number of weak factors I hope that this can help you understand EP better and apply it on your own problems. Thank you very much for your attention.


Download ppt "Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers."

Similar presentations


Ads by Google