Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits

Similar presentations


Presentation on theme: "Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits"β€” Presentation transcript:

1 Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits

2 Online Convex Optimization Problem
Convex set 𝑆 In every iteration we choose π‘₯ 𝑑 βˆˆπ‘† and then get a convex cost function 𝑐 𝑑 :𝑆→ βˆ’πΆ,𝐢 for some 𝐢>0 We want to minimize the regret 𝑑=1 𝑇 𝑐 𝑑 π‘₯ 𝑑 βˆ’ min π‘₯βˆˆπ‘† 𝑑=1 𝑇 𝑐 𝑑 (π‘₯)

3 Bandit Setting The Gradient Descent approach: π‘₯ 𝑑+1 = π‘₯ 𝑑 βˆ’πœ‚π›» 𝑐 𝑑 ( π‘₯ 𝑑 ) Last week we saw 𝑂( 𝑇 ) regret bound But now instead of 𝑐 𝑑 we get 𝑐 𝑑 ( π‘₯ 𝑑 ) Can’t compute 𝛻 𝑐 𝑑 ( π‘₯ 𝑑 )! But we still want to use Gradient Descent Solution: estimate gradient using one point We will show 𝑂( 𝑇 3/4 ) regret bound

4 Notation and Assumptions
𝔹= π‘₯∈ ℝ 𝑑 : π‘₯ ≀1 ; π•Š= π‘₯∈ ℝ 𝑑 : π‘₯ =1 Expected regret 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ min π‘₯βˆˆπ‘† 𝑑=1 𝑇 𝑐 𝑑 (π‘₯) Projection of point π‘₯ onto convex set 𝑆 𝑃 𝑆 π‘₯ = arg min π‘§βˆˆπ‘† π‘₯βˆ’π‘§ Assume 𝑆 is a convex set such that π‘Ÿπ”ΉβŠ†π‘†βŠ†π‘…π”Ή 1βˆ’π›Ό 𝑆= 1βˆ’π›Ό π‘₯:π‘₯βˆˆπ‘† βŠ†π‘† is also convex and 0∈(1βˆ’π›Ό)π‘†βŠ†π‘…π”Ή π‘¦βˆˆ 1βˆ’π›Ό 𝑆 ↓ 𝑦= 1βˆ’π›Ό π‘₯=𝛼0+ 1βˆ’π›Ό π‘₯βˆˆπ‘†

5 Part 1 Gradient Estimation

6 Gradient Estimation For a function 𝑐 𝑑 and 𝛿>0 define
𝑐 𝑑 𝑦 = 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 (𝑦+𝛿𝑣) Lemma: 𝛻 𝑐 𝑑 𝑦 = 𝑑 𝛿 𝔼 π‘’βˆˆπ•Š 𝑐 𝑑 𝑦+𝛿𝑒 𝑒 To get and unbiased estimator for βˆ‡ 𝑐 𝑑 𝑦 we can sample a unit vector 𝑒 uniformly and compute 𝑑 𝛿 𝑐 𝑑 𝑦+𝛿𝑒 𝑒

7 Proof 𝑐 𝑑 𝑦 = 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 (𝑦+𝛿𝑣) For 𝑑=1 𝑐 𝑑 𝑦 = 𝔼 π‘£βˆˆ[βˆ’1,1] 𝑐 𝑑 (𝑦+𝛿𝑣) = 𝔼 π‘£βˆˆ[βˆ’π›Ώ,𝛿] 𝑐 𝑑 (𝑦+𝑣) = 1 2𝛿 βˆ’π›Ώ 𝛿 𝑐 𝑑 𝑦+𝑣 𝑑𝑣 Differentiate using the fundamental theorem of calculus 𝛻 𝑐 𝑑 𝑦 = 𝑐 𝑑 β€² 𝑦 = 𝑐 𝑑 𝑦+𝛿 βˆ’ 𝑐 𝑑 π‘¦βˆ’π›Ώ 2𝛿 = 1 𝛿 𝔼 π‘’βˆˆ{βˆ’1,1} 𝑐 𝑑 𝑦+𝛿𝑒 𝑒

8 Proof Cont. For 𝑑>1 Stoke’s theorem gives
𝛻 𝛿𝔹 𝑐 𝑑 𝑦+𝑣 𝑑𝑣 = π›Ώπ•Š 𝑐 𝑑 𝑦+𝑒 𝑒 𝑒 𝑑𝑒 Vol 𝑑 𝛿𝔹 𝛻 𝛿𝔹 𝑐 𝑑 𝑦+𝑣 𝑑𝑣 Vol 𝑑 𝛿𝔹 = Vol π‘‘βˆ’1 π›Ώπ•Š π›Ώπ•Š 𝑐 𝑑 𝑦+𝑒 𝑒 𝑒 𝑑𝑒 Vol π‘‘βˆ’1 π›Ώπ•Š Vol 𝑑 𝛿𝔹 𝛻 𝔼 π‘£βˆˆπ›Ώπ”Ή 𝑐 𝑑 (𝑦+𝑣) = Vol π‘‘βˆ’1 π›Ώπ•Š 𝔼 π‘’βˆˆΞ΄π•Š 𝑐 𝑑 𝑦+𝑒 𝑒 𝑒 Vol 𝑑 𝛿𝔹 𝛻 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 (𝑦+𝛿𝑣) = Vol π‘‘βˆ’1 π›Ώπ•Š 𝔼 π‘’βˆˆπ•Š 𝑐 𝑑 𝑦+𝛿𝑒 𝑒

9 Proof Cont. Vol 𝑑 𝛿𝔹 𝛻 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 (𝑦+𝛿𝑣) = Vol π‘‘βˆ’1 π›Ώπ•Š 𝔼 π‘’βˆˆπ•Š 𝑐 𝑑 𝑦+𝛿𝑒 𝑒
𝑐 𝑑 𝑦 = 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 (𝑦+𝛿𝑣) Vol 𝑑 𝛿𝔹 𝛻 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 (𝑦+𝛿𝑣) = Vol π‘‘βˆ’1 π›Ώπ•Š 𝔼 π‘’βˆˆπ•Š 𝑐 𝑑 𝑦+𝛿𝑒 𝑒 Vol 𝑑 𝛿𝔹 𝛻 𝑐 𝑑 𝑦 = Vol π‘‘βˆ’1 π›Ώπ•Š 𝔼 π‘’βˆˆπ•Š 𝑐 𝑑 𝑦+𝛿𝑒 𝑒 𝛻 𝑐 𝑑 𝑦 = Vol π‘‘βˆ’1 π›Ώπ•Š Vol 𝑑 𝛿𝔹 𝔼 π‘’βˆˆπ•Š 𝑐 𝑑 𝑦+𝛿𝑒 𝑒 The following fact concludes the proof Vol π‘‘βˆ’1 π›Ώπ•Š Vol 𝑑 𝛿𝔹 = 𝑑 𝛿 For example in ℝ 2 : 2πœ‹π›Ώ πœ‹ 𝛿 2 = 2 𝛿

10 Part 2 Regret Bound for Estimated Gradients

11 Zinkevich’s Theorem Let β„Ž 1 ,…, β„Ž 𝑇 :(1βˆ’π›Ό)𝑆→ℝ be convex, differentiable functions Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1βˆ’π›Ό)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑑+1 = 𝑃 (1βˆ’π›Ό)𝑆 ( 𝑦 𝑑 βˆ’πœ‚π›» β„Ž 𝑑 𝑦 𝑑 ) Let 𝐺= max 𝑑 𝛻 β„Ž 𝑑 𝑦 𝑑 Then for πœ‚= 𝑅 𝐺 𝑇 and for every y∈(1βˆ’π›Ό)𝑆 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 (𝑦) ≀𝑅𝐺 𝑇

12 Expected Zinkevich’s Theorem
Let 𝑐 1 ,…, 𝑐 𝑇 :(1βˆ’π›Ό)𝑆→ℝ be convex, differentiable functions Let 𝑔 1 ,…, 𝑔 𝑇 be random vectors such that 𝔼 𝑔 𝑑 𝑦 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 𝑔 𝑑 ≀𝐺 (also implies βˆ‡ 𝑐 𝑑 𝑦 𝑑 ≀𝐺) Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1βˆ’π›Ό)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑑+1 = 𝑃 (1βˆ’π›Ό)𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 Then for πœ‚= 𝑅 𝐺 𝑇 and for every y∈(1βˆ’π›Ό)𝑆 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( 𝑦 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀𝑅𝐺 𝑇

13 Proof 𝑦 𝑑+1 = 𝑃 (1βˆ’π›Ό)𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 𝑦 1 =0 Define β„Ž 𝑑 :(1βˆ’π›Ό)𝑆→ℝ by β„Ž 𝑑 𝑦 = 𝑐 𝑑 𝑦 + 𝑦 𝑇 πœ‰ 𝑑 where πœ‰ 𝑑 = 𝑔 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 Notice that 𝛻 β„Ž 𝑑 𝑦 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 + πœ‰ 𝑑 = 𝑔 𝑑 So our updates are the same as running regular gradient descent on β„Ž 𝑑 From Zinkevich’s Theorem 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 (𝑦) ≀𝑅𝐺 𝑇 (1)

14 β„Ž 𝑑 𝑦 = 𝑐 𝑑 𝑦 + 𝑦 𝑇 πœ‰ 𝑑 πœ‰ 𝑑 = 𝑔 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 𝔼 𝑔 𝑑 𝑦 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 Proof Cont. Notice that 𝔼 πœ‰ 𝑑 | 𝑦 𝑑 =𝔼 𝑔 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 | 𝑦 𝑑 =𝔼 𝑔 𝑑 | 𝑦 𝑑 βˆ’βˆ‡ 𝑐 𝑑 𝑦 𝑑 =0 Therefore 𝔼 𝑦 𝑑 𝑇 πœ‰ 𝑑 =𝔼 𝔼 𝑦 𝑑 𝑇 πœ‰ 𝑑 | 𝑦 𝑑 =𝔼 𝑦 𝑑 𝑇 𝔼 πœ‰ 𝑑 | 𝑦 𝑑 =0 𝔼 𝑦 𝑇 πœ‰ 𝑑 = 𝑦 𝑇 𝔼 πœ‰ 𝑑 = 𝑦 𝑇 𝔼 𝔼 πœ‰ 𝑑 | 𝑦 𝑑 =0 We get the following connections 𝔼 β„Ž 𝑑 ( 𝑦 𝑑 ) =𝔼 𝑐 𝑑 ( 𝑦 𝑑 ) +𝔼 𝑦 𝑑 𝑇 πœ‰ 𝑑 =𝔼 𝑐 𝑑 ( 𝑦 𝑑 ) (3) 𝔼 β„Ž 𝑑 (𝑦) =𝔼 𝑐 𝑑 (𝑦) +𝔼 𝑦 𝑇 πœ‰ 𝑑 = 𝑐 𝑑 𝑦 (2)

15 Proof Cont. 1 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 (𝑦) ≀𝑅𝐺 𝑇
𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 (𝑦) ≀𝑅𝐺 𝑇 𝔼 β„Ž 𝑑 (𝑦) = 𝑐 𝑑 (𝑦) 𝔼 β„Ž 𝑑 ( 𝑦 𝑑 ) =𝔼 𝑐 𝑑 ( 𝑦 𝑑 ) Proof Cont. 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( 𝑦 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 = (3) 𝑑=1 𝑇 𝔼 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 = (2) 𝔼 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 𝔼 β„Ž 𝑑 𝑦 = 𝔼 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 𝑦 ≀ (1) 𝑅𝐺 𝑇

16 Part 3 BGD Algorithm

17 Ideal World Algorithm 𝑦 1 ←0 For π‘‘βˆˆ{1,…,𝑇}:
𝑔 𝑑 = 𝑑 𝛿 𝑐 𝑑 𝑦 𝑑 +𝛿 𝑒 𝑑 𝑒 𝑑 𝔼 𝑔 𝑑 𝑦 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 Ideal World Algorithm 𝑦 1 ←0 For π‘‘βˆˆ{1,…,𝑇}: Select unit vector 𝑒 𝑑 uniformly at random Play 𝑦 𝑑 and observe cost 𝑐 𝑑 𝑦 𝑑 Compute 𝑔 𝑑 using 𝑒 𝑑 𝑦 𝑑+1 ← 𝑃 𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 To compute 𝑔 𝑑 we need 𝑐 𝑑 𝑦 𝑑 +𝛿 𝑒 𝑑 So we need to play π‘₯ 𝑑 = 𝑦 𝑑 +𝛿 𝑒 𝑑 instead But now we have problems: π‘₯ 𝑑 βˆˆπ‘†?? The regret is for c t ( π‘₯ 𝑑 ) although we are doing Estimated Gradient Descent for 𝑐 𝑑 𝑦 𝑑

18 Bandit Gradient Descent Algorithm (BGD)
Parameters: πœ‚>0,𝛿>0,0<𝛼<1 𝑦 1 ←0 For π‘‘βˆˆ{1,…,𝑇}: Select unit vector 𝑒 𝑑 uniformly at random π‘₯ 𝑑 ← 𝑦 𝑑 +𝛿 𝑒 𝑑 Play π‘₯ 𝑑 and observe cost 𝑐 𝑑 π‘₯ 𝑑 = 𝑐 𝑑 ( 𝑦 𝑑 +𝛿 𝑒 𝑑 ) 𝑔 𝑑 ← 𝑑 𝛿 𝑐 𝑑 π‘₯ 𝑑 𝑒 𝑑 = 𝑑 𝛿 𝑐 𝑑 𝑦 𝑑 +𝛿 𝑒 𝑑 𝑒 𝑑 𝑦 𝑑+1 ← 𝑃 1βˆ’π›Ό 𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 𝑦 1 ←0 For π‘‘βˆˆ{1,…,𝑇}: Select random unit vector 𝑒 𝑑 Play 𝑦 𝑑 and observe 𝑐 𝑑 𝑦 𝑑 Compute 𝑔 𝑑 using 𝑒 𝑑 𝑦 𝑑+1 ← 𝑃 𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 π‘₯ 𝑑 βˆˆπ‘†? We have low regret for 𝑐 𝑑 ( 𝑦 𝑑 ) in 1βˆ’π›Ό 𝑆 Need to convert it to low regret for 𝑐 𝑑 π‘₯ 𝑑 in 𝑆

19 Observation 1 For any π‘₯βˆˆπ‘† 𝑑=1 𝑇 𝑐 𝑑 1βˆ’π›Ό π‘₯ ≀ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ +2𝛼𝐢𝑇
𝑑=1 𝑇 𝑐 𝑑 1βˆ’π›Ό π‘₯ ≀ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ +2𝛼𝐢𝑇 Proof. From convexity 𝑐 𝑑 1βˆ’π›Ό π‘₯ = 𝑐 𝑑 𝛼0+ 1βˆ’π›Ό π‘₯ ≀ 𝛼𝑐 𝑑 βˆ’π›Ό 𝑐 𝑑 π‘₯ = 𝑐 𝑑 π‘₯ +𝛼 𝑐 𝑑 0 βˆ’ 𝑐 𝑑 π‘₯ ≀ 𝑐 𝑑 π‘₯ +2𝛼𝐢

20 Observation 2 For any y∈ 1βˆ’π›Ό 𝑆 and any xβˆˆπ‘†
𝑐 𝑑 π‘₯ βˆ’ 𝑐 𝑑 𝑦 ≀ 2𝐢 π›Όπ‘Ÿ | π‘¦βˆ’π‘₯ | Proof. Denote Ξ”=π‘₯βˆ’π‘¦ If Ξ” β‰₯π›Όπ‘Ÿ the observation follows from 2𝐢≀ 2𝐢 π›Όπ‘Ÿ | π‘¦βˆ’π‘₯ | Otherwise Ξ” <π›Όπ‘Ÿ, let 𝑧=𝑦+π›Όπ‘Ÿ Ξ” | Ξ” | and π‘§βˆˆπ‘† from y∈ 1βˆ’π›Ό 𝑆 π›Όπ‘Ÿ Ξ” Ξ” βˆˆπ›Όπ‘Ÿπ”ΉβŠ†π›Όπ‘† ↓ π‘§βˆˆ 1βˆ’π›Ό 𝑆+π›Όπ‘†βŠ†π‘†

21 Proof Cont. Notice that x= Ξ” π›Όπ‘Ÿ 𝑧+ 1βˆ’ Ξ” π›Όπ‘Ÿ 𝑦 So from convexity
𝑐 𝑑 π‘₯ = 𝑐 𝑑 Ξ” π›Όπ‘Ÿ 𝑧+ 1βˆ’ Ξ” π›Όπ‘Ÿ 𝑦 ≀ Ξ” π›Όπ‘Ÿ 𝑐 𝑑 𝑧 + 1βˆ’ Ξ” π›Όπ‘Ÿ 𝑐 𝑑 𝑦 = 𝑐 𝑑 𝑦 + 𝑐 𝑑 𝑧 βˆ’ 𝑐 𝑑 (𝑦) π›Όπ‘Ÿ Ξ” ≀ 𝑐 𝑑 𝑦 + 2𝐢 π›Όπ‘Ÿ Ξ” Other direction is also true 𝑧=𝑦+π›Όπ‘Ÿ Ξ” Ξ” Ξ”=π‘₯βˆ’π‘¦

22 BGD Regret Theorem For any 𝑇β‰₯ 3𝑅𝑑 2π‘Ÿ 2 and for the following parameters πœ‚= 𝛿𝑅 𝑑𝐢 𝑇 𝛿= 3 π‘Ÿ 𝑅 2 𝑑 2 12𝑇 𝛼= 3 3𝑅𝑑 2π‘Ÿ 𝑇 For every π‘₯βˆˆπ‘† the BGD achieves regret 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ ≀3𝐢 𝑇 5/6 3 𝑑𝑅 π‘Ÿ

23 Proof 𝑔 𝑑 = 𝑑 𝛿 𝑐 𝑑 π‘₯ 𝑑 𝑒 𝑑 π‘₯ 𝑑 = 𝑦 𝑑 +𝛿 𝑒 𝑑
𝑔 𝑑 = 𝑑 𝛿 𝑐 𝑑 π‘₯ 𝑑 𝑒 𝑑 π‘₯ 𝑑 = 𝑦 𝑑 +𝛿 𝑒 𝑑 𝑦 𝑑+1 = 𝑃 1βˆ’π›Ό 𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 Proof First we need to show that π‘₯ 𝑑 βˆˆπ‘† Notice that 1βˆ’π›Ό 𝑆+π›Όπ‘Ÿπ”ΉβŠ† 1βˆ’π›Ό 𝑆+π›Όπ‘†βŠ†π‘† Since 𝑦 𝑑 ∈ 1βˆ’π›Ό 𝑆, we just need to show that π›Ώβ‰€π›Όπ‘Ÿ 𝛿= 3 π‘Ÿ 𝑅 2 𝑑 2 12𝑇 π›Όπ‘Ÿ= 3 3𝑅𝑑 2π‘Ÿ 𝑇 π‘Ÿ= 3 3𝑅 π‘Ÿ 2 𝑑 2 𝑇 This is true because 𝑇β‰₯ 3𝑅𝑑 2π‘Ÿ 2

24 Proof Cont. 𝑐 𝑑 𝑦 𝑑 = 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 ( 𝑦 𝑑 +𝛿𝑣)
𝑐 𝑑 𝑦 𝑑 = 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 ( 𝑦 𝑑 +𝛿𝑣) 𝛻 𝑐 𝑑 𝑦 𝑑 = 𝑑 𝛿 𝔼 π‘’βˆˆπ•Š 𝑐 𝑑 𝑦 𝑑 +𝛿𝑒 𝑒 𝑔 𝑑 = 𝑑 𝛿 𝑐 𝑑 π‘₯ 𝑑 𝑒 𝑑 π‘₯ 𝑑 = 𝑦 𝑑 +𝛿 𝑒 𝑑 𝑦 𝑑+1 = 𝑃 1βˆ’π›Ό 𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 Proof Cont. Now we want to bound the regret We have 𝔼 𝑔 𝑑 𝑦 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 𝑔 𝑑 = 𝑑 𝛿 𝑐 𝑑 π‘₯ 𝑑 𝑒 𝑑 ≀ 𝑑𝐢 𝛿 =:𝐺 Expected Zinkevich’s Theorem says 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( 𝑦 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀𝑅𝐺 𝑇 = 𝑅𝑑𝐢 𝑇 𝛿 (1) Where y∈ 1βˆ’π›Ό 𝑆 and πœ‚= 𝑅 𝐺 𝑇 = 𝛿𝑅 𝑑𝐢 𝑇

25 Proof Cont. From observation 2 we get
𝑐 𝑑 𝑦 βˆ’ 𝑐 𝑑 π‘₯ ≀ 2𝐢 π›Όπ‘Ÿ | π‘¦βˆ’π‘₯ | For π‘¦βˆˆ 1βˆ’π›Ό 𝑆 𝑐 𝑑 𝑦 𝑑 = 𝔼 π‘£βˆˆπ”Ή 𝑐 𝑑 ( 𝑦 𝑑 +𝛿𝑣) From observation 2 we get 𝑐 𝑑 𝑦 𝑑 βˆ’ 𝑐 𝑑 π‘₯ 𝑑 ≀ 𝑐 𝑑 𝑦 𝑑 βˆ’ 𝑐 𝑑 𝑦 𝑑 + 𝑐 𝑑 𝑦 𝑑 βˆ’ 𝑐 𝑑 π‘₯ 𝑑 ≀2 2𝐢 π›Όπ‘Ÿ 𝛿 𝑐 𝑑 𝑦 βˆ’ 𝑐 𝑑 𝑦 ≀ 2𝐢 π›Όπ‘Ÿ 𝛿 Now we get for π‘¦βˆˆ 1βˆ’π›Ό 𝑆 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀𝔼 𝑑=1 𝑇 𝑐 𝑑 𝑦 𝑑 +2 2𝐢 π›Όπ‘Ÿ 𝛿 βˆ’ 𝑑=1 𝑇 ( 𝑐 𝑑 𝑦 βˆ’ 2𝐢 π›Όπ‘Ÿ 𝛿)=

26 1 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( 𝑦 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀ 𝑅𝑑𝐢 𝑇 𝛿
Proof Cont. 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀𝔼 𝑑=1 𝑇 𝑐 𝑑 𝑦 𝑑 +2 2𝐢 π›Όπ‘Ÿ 𝛿 βˆ’ 𝑑=1 𝑇 ( 𝑐 𝑑 𝑦 βˆ’ 2𝐢 π›Όπ‘Ÿ 𝛿)= 𝔼 𝑑=1 𝑇 𝑐 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 +3𝑇 2𝐢 π›Όπ‘Ÿ 𝛿≀ 𝑅𝑑𝐢 𝑇 𝛿 +3𝑇 2𝐢 π›Όπ‘Ÿ 𝛿 (2)

27 Proof Cont. 𝑦= 1βˆ’π›Ό π‘₯ for some π‘₯βˆˆπ‘† so we can use observation 1
𝑑=1 𝑇 𝑐 𝑑 1βˆ’π›Ό π‘₯ ≀ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ +2𝛼𝐢𝑇 2 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀ 𝑅𝑑𝐢 𝑇 𝛿 +3𝑇 2𝐢 π›Όπ‘Ÿ 𝛿 Proof Cont. 𝑦= 1βˆ’π›Ό π‘₯ for some π‘₯βˆˆπ‘† so we can use observation 1 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ ≀𝔼 𝑑=1 𝑇 𝑐 𝑑 π‘₯ 𝑑 βˆ’ 𝑑=1 𝑇 𝑐 𝑑 1βˆ’π›Ό π‘₯ +2𝛼𝐢𝑇= 𝔼 𝑑=1 𝑇 𝑐 𝑑 π‘₯ 𝑑 βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 +2𝛼𝐢𝑇≀ 𝑅𝑑𝐢 𝑇 𝛿 +𝑇 6𝐢 π›Όπ‘Ÿ 𝛿+2𝛼𝐢𝑇 Substituting the parameters finishes the proof. 𝛿= 3 π‘Ÿ 𝑅 2 𝑑 2 12𝑇 𝛼= 3 3𝑅𝑑 2π‘Ÿ 𝑇

28 BGD with Lipschitz Regret Theorem
If all 𝑐 𝑑 are πΏβˆ’Lipschitz then for 𝑇 sufficiently large and the parameters πœ‚= 𝛿𝑅 𝑑𝐢 𝑇 𝛿= 𝑇 βˆ’1/4 π‘…π‘‘πΆπ‘Ÿ 3(πΏπ‘Ÿ+𝐢) 𝛼= 𝛿 π‘Ÿ The BGD achieves regret 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ ≀2 𝑇 3/4 3𝑅𝑑𝐢(𝐿+ 𝐢 π‘Ÿ )

29 Part 4 Reshaping

30 Removing Dependence in 1/π‘Ÿ
There are algorithms that for a convex set π‘Ÿπ”ΉβŠ†π‘†βŠ†π‘…π”Ή find an affine transformation 𝑇 that puts 𝑆 in near-isotropic position and run in time 𝑂 𝑑 4 π‘π‘œπ‘™π‘¦π‘™π‘œπ‘”(𝑑, 𝑅 π‘Ÿ ) 𝑇 𝑆 βŠ† ℝ 𝑑 is in isotropic position if the covariance matrix of a random sample from 𝑇 𝑆 is the identity matrix This gives us π”ΉβŠ†π‘‡(𝑆)βŠ†π‘‘π”Ή So we have new 𝑅=𝑑 and π‘Ÿ=1 Also if 𝑐 𝑑 is πΏβˆ’ Lipschitz then 𝑐 𝑑 βƒ˜ 𝑇 βˆ’1 is πΏπ‘…βˆ’ Lipschitz

31 Removing Dependence in 1/π‘Ÿ
So if we first put 𝑆 in near-isotropic position we get the regret bound 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ ≀6 𝑇 𝑑( 𝐢𝐿𝑅 +𝐢) And without the Lipschitz condition 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( π‘₯ 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 π‘₯ ≀6 𝑇 𝑑𝐢 2 𝑇 3/4 3𝑅𝑑𝐢(𝐿+ 𝐢 π‘Ÿ ) 3𝐢 𝑇 5/6 3 𝑑𝑅 π‘Ÿ

32 Part 5 Adaptive Adversary

33 Expected Adaptive Zinkevich’s Theorem
Let 𝑐 1 ,…, 𝑐 𝑇 :(1βˆ’π›Ό)𝑆→ℝ be convex, differentiable functions ( 𝑐 𝑑 depends on 𝑦 1 ,…, 𝑦 π‘‘βˆ’1 ) Let 𝑔 1 ,…, 𝑔 𝑇 be random vectors such that 𝔼 𝑔 𝑑 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 𝑔 𝑑 ≀𝐺 (also implies βˆ‡ 𝑐 𝑑 ( 𝑦 𝑑 ) ≀𝐺) Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1βˆ’π›Ό)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑑+1 = 𝑃 (1βˆ’π›Ό)𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 Then for πœ‚= 𝑅 𝐺 𝑇 and for every π‘¦βˆˆ(1βˆ’π›Ό)𝑆 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( 𝑦 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀3𝑅𝐺 𝑇

34 Proof 𝑦 𝑑+1 = 𝑃 (1βˆ’π›Ό)𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 𝑦 1 =0 Define β„Ž 𝑑 :(1βˆ’π›Ό)𝑆→ℝ by β„Ž 𝑑 𝑦 = 𝑐 𝑑 𝑦 + 𝑦 𝑇 πœ‰ 𝑑 where πœ‰ 𝑑 = 𝑔 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 Notice that 𝛻 β„Ž 𝑑 𝑦 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 + πœ‰ 𝑑 = 𝑔 𝑑 So our updates are the same as running regular gradient descent on β„Ž 𝑑 From Zinkevich’s Theorem 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 (𝑦) ≀𝑅𝐺 𝑇 (1)

35 Proof Cont. β„Ž 𝑑 𝑦 = 𝑐 𝑑 𝑦 + 𝑦 𝑇 πœ‰ 𝑑 πœ‰ 𝑑 = 𝑔 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 𝔼 𝑔 𝑑 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 Notice that 𝔼 πœ‰ 𝑑 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 =𝔼 𝑔 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 = 𝔼 𝑔 𝑑 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 =0 𝔼 𝑦 𝑑 𝑇 πœ‰ 𝑑 =𝔼 𝔼 𝑦 𝑑 𝑇 πœ‰ 𝑑 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 =𝔼 𝑦 𝑑 𝑇 𝔼 πœ‰ 𝑑 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 =0 We get the following connection between 𝔼 β„Ž 𝑑 ( 𝑦 𝑑 ) and 𝔼 𝑐 𝑑 ( 𝑦 𝑑 ) 𝔼 β„Ž 𝑑 ( 𝑦 𝑑 ) =𝔼 𝑐 𝑑 ( 𝑦 𝑑 ) +𝔼 𝑦 𝑑 𝑇 πœ‰ 𝑑 =𝔼 𝑐 𝑑 ( 𝑦 𝑑 ) (3)

36 Proof Cont. 𝑦 𝑑+1 = 𝑃 (1βˆ’π›Ό)𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑
𝑦 𝑑+1 = 𝑃 (1βˆ’π›Ό)𝑆 𝑦 𝑑 βˆ’πœ‚ 𝑔 𝑑 𝔼 𝑔 𝑑 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 =𝛻 𝑐 𝑑 𝑦 𝑑 πœ‰ 𝑑 = 𝑔 𝑑 βˆ’π›» 𝑐 𝑑 𝑦 𝑑 πœ‰ 𝑑 ≀ 𝑔 𝑑 + 𝛻 𝑐 𝑑 𝑦 𝑑 ≀2𝐺 For every 1≀𝑠<𝑑≀𝑇 we have 𝔼 πœ‰ 𝑠 𝑇 πœ‰ 𝑑 =𝔼 𝔼 πœ‰ 𝑠 𝑇 πœ‰ 𝑑 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 Given 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 we know 𝑔 𝑠 (so also πœ‰ 𝑠 ) and therefore 𝔼 πœ‰ 𝑠 𝑇 πœ‰ 𝑑 =𝔼 πœ‰ 𝑠 𝑇 𝔼 πœ‰ 𝑑 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑑 , 𝑐 𝑑 =0 We use it to get 𝔼 𝑑=1 𝑇 πœ‰ 𝑑 ≀𝔼 𝑑=1 𝑇 πœ‰ 𝑑 = 𝑑=1 𝑇 𝔼 πœ‰ 𝑑 ≀𝑠<𝑑≀𝑇 𝔼 πœ‰ 𝑠 𝑇 πœ‰ 𝑑 = 𝑑=1 𝑇 𝔼 πœ‰ 𝑑 ≀ 𝑑=1 𝑇 𝔼 2𝐺 2 =4𝑇 𝐺 2

37 Proof Cont. Now we connect between 𝔼 β„Ž 𝑑 (𝑦) and 𝔼 𝑐 𝑑 (𝑦)
β„Ž 𝑑 𝑦 = 𝑐 𝑑 𝑦 + 𝑦 𝑇 πœ‰ 𝑑 𝔼 𝑑=1 𝑇 πœ‰ 𝑑 ≀2𝐺 𝑇 π‘†βŠ†π‘…π”Ή Proof Cont. Now we connect between 𝔼 β„Ž 𝑑 (𝑦) and 𝔼 𝑐 𝑑 (𝑦) 𝔼 𝑑=1 𝑇 β„Ž 𝑑 𝑦 βˆ’π”Ό 𝑑=1 𝑇 𝑐 𝑑 (𝑦) ≀𝔼 𝑑=1 𝑇 β„Ž 𝑑 𝑦 βˆ’ 𝑐 𝑑 (𝑦) = 𝔼 𝑦 𝑇 𝑑=1 𝑇 πœ‰ 𝑑 ≀𝔼 𝑦 𝑑=1 𝑇 πœ‰ 𝑑 ≀𝑅𝔼 𝑑=1 𝑇 πœ‰ 𝑑 ≀2𝑅𝐺 𝑇 (2)

38 Proof Cont. 1 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 (𝑦) ≀𝑅𝐺 𝑇
𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 (𝑦) ≀𝑅𝐺 𝑇 2 𝔼 𝑑=1 𝑇 β„Ž 𝑑 𝑦 βˆ’π”Ό 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀2𝑅𝐺 𝑇 𝔼 β„Ž 𝑑 ( 𝑦 𝑑 ) =𝔼 𝑐 𝑑 ( 𝑦 𝑑 ) Proof Cont. 𝔼 𝑑=1 𝑇 𝑐 𝑑 ( 𝑦 𝑑 ) βˆ’ 𝑑=1 𝑇 𝑐 𝑑 𝑦 = 𝑑=1 𝑇 𝔼 𝑐 𝑑 𝑦 𝑑 βˆ’π”Ό 𝑑=1 𝑇 𝑐 𝑑 𝑦 = (3) 𝑑=1 𝑇 𝔼 β„Ž 𝑑 𝑦 𝑑 βˆ’π”Ό 𝑑=1 𝑇 𝑐 𝑑 𝑦 ≀ (2) 𝔼 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’π”Ό 𝑑=1 𝑇 β„Ž 𝑑 𝑦 +2 𝑅𝐺 𝑇 = 𝔼 𝑑=1 𝑇 β„Ž 𝑑 𝑦 𝑑 βˆ’ 𝑑=1 𝑇 β„Ž 𝑑 𝑦 +2𝑅𝐺 𝑇 ≀ (1) 3𝑅𝐺 𝑇


Download ppt "Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits"

Similar presentations


Ads by Google