Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Online Convex Optimization Problem Convex set 𝑆 In every iteration we choose 𝑥 𝑡 ∈𝑆 and then get a convex cost function 𝑐 𝑡 :𝑆→ −𝐶,𝐶 for some 𝐶>0 We want to minimize the regret 𝑡=1 𝑇 𝑐 𝑡 𝑥 𝑡 − min 𝑥∈𝑆 𝑡=1 𝑇 𝑐 𝑡 (𝑥)
Bandit Setting The Gradient Descent approach: 𝑥 𝑡+1 = 𝑥 𝑡 −𝜂𝛻 𝑐 𝑡 ( 𝑥 𝑡 ) Last week we saw 𝑂( 𝑇 ) regret bound But now instead of 𝑐 𝑡 we get 𝑐 𝑡 ( 𝑥 𝑡 ) Can’t compute 𝛻 𝑐 𝑡 ( 𝑥 𝑡 )! But we still want to use Gradient Descent Solution: estimate gradient using one point We will show 𝑂( 𝑇 3/4 ) regret bound
Notation and Assumptions 𝔹= 𝑥∈ ℝ 𝑑 : 𝑥 ≤1 ; 𝕊= 𝑥∈ ℝ 𝑑 : 𝑥 =1 Expected regret 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − min 𝑥∈𝑆 𝑡=1 𝑇 𝑐 𝑡 (𝑥) Projection of point 𝑥 onto convex set 𝑆 𝑃 𝑆 𝑥 = arg min 𝑧∈𝑆 𝑥−𝑧 Assume 𝑆 is a convex set such that 𝑟𝔹⊆𝑆⊆𝑅𝔹 1−𝛼 𝑆= 1−𝛼 𝑥:𝑥∈𝑆 ⊆𝑆 is also convex and 0∈(1−𝛼)𝑆⊆𝑅𝔹 𝑦∈ 1−𝛼 𝑆 ↓ 𝑦= 1−𝛼 𝑥=𝛼0+ 1−𝛼 𝑥∈𝑆
Part 1 Gradient Estimation
Gradient Estimation For a function 𝑐 𝑡 and 𝛿>0 define 𝑐 𝑡 𝑦 = 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) Lemma: 𝛻 𝑐 𝑡 𝑦 = 𝑑 𝛿 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 To get and unbiased estimator for ∇ 𝑐 𝑡 𝑦 we can sample a unit vector 𝑢 uniformly and compute 𝑑 𝛿 𝑐 𝑡 𝑦+𝛿𝑢 𝑢
Proof 𝑐 𝑡 𝑦 = 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) For 𝑑=1 𝑐 𝑡 𝑦 = 𝔼 𝑣∈[−1,1] 𝑐 𝑡 (𝑦+𝛿𝑣) = 𝔼 𝑣∈[−𝛿,𝛿] 𝑐 𝑡 (𝑦+𝑣) = 1 2𝛿 −𝛿 𝛿 𝑐 𝑡 𝑦+𝑣 𝑑𝑣 Differentiate using the fundamental theorem of calculus 𝛻 𝑐 𝑡 𝑦 = 𝑐 𝑡 ′ 𝑦 = 𝑐 𝑡 𝑦+𝛿 − 𝑐 𝑡 𝑦−𝛿 2𝛿 = 1 𝛿 𝔼 𝑢∈{−1,1} 𝑐 𝑡 𝑦+𝛿𝑢 𝑢
Proof Cont. For 𝑑>1 Stoke’s theorem gives 𝛻 𝛿𝔹 𝑐 𝑡 𝑦+𝑣 𝑑𝑣 = 𝛿𝕊 𝑐 𝑡 𝑦+𝑢 𝑢 𝑢 𝑑𝑢 Vol 𝑑 𝛿𝔹 𝛻 𝛿𝔹 𝑐 𝑡 𝑦+𝑣 𝑑𝑣 Vol 𝑑 𝛿𝔹 = Vol 𝑑−1 𝛿𝕊 𝛿𝕊 𝑐 𝑡 𝑦+𝑢 𝑢 𝑢 𝑑𝑢 Vol 𝑑−1 𝛿𝕊 Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝛿𝔹 𝑐 𝑡 (𝑦+𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈δ𝕊 𝑐 𝑡 𝑦+𝑢 𝑢 𝑢 Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢
Proof Cont. Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 𝑐 𝑡 𝑦 = 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 Vol 𝑑 𝛿𝔹 𝛻 𝑐 𝑡 𝑦 = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 𝛻 𝑐 𝑡 𝑦 = Vol 𝑑−1 𝛿𝕊 Vol 𝑑 𝛿𝔹 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 The following fact concludes the proof Vol 𝑑−1 𝛿𝕊 Vol 𝑑 𝛿𝔹 = 𝑑 𝛿 For example in ℝ 2 : 2𝜋𝛿 𝜋 𝛿 2 = 2 𝛿
Part 2 Regret Bound for Estimated Gradients
Zinkevich’s Theorem Let ℎ 1 ,…, ℎ 𝑇 :(1−𝛼)𝑆→ℝ be convex, differentiable functions Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1−𝛼)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 ( 𝑦 𝑡 −𝜂𝛻 ℎ 𝑡 𝑦 𝑡 ) Let 𝐺= max 𝑡 𝛻 ℎ 𝑡 𝑦 𝑡 Then for 𝜂= 𝑅 𝐺 𝑇 and for every y∈(1−𝛼)𝑆 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇
Expected Zinkevich’s Theorem Let 𝑐 1 ,…, 𝑐 𝑇 :(1−𝛼)𝑆→ℝ be convex, differentiable functions Let 𝑔 1 ,…, 𝑔 𝑇 be random vectors such that 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝑔 𝑡 ≤𝐺 (also implies ∇ 𝑐 𝑡 𝑦 𝑡 ≤𝐺) Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1−𝛼)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Then for 𝜂= 𝑅 𝐺 𝑇 and for every y∈(1−𝛼)𝑆 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝑅𝐺 𝑇
Proof 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 1 =0 Define ℎ 𝑡 :(1−𝛼)𝑆→ℝ by ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 where 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 Notice that 𝛻 ℎ 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 + 𝜉 𝑡 = 𝑔 𝑡 So our updates are the same as running regular gradient descent on ℎ 𝑡 From Zinkevich’s Theorem 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 (1)
ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 Proof Cont. Notice that 𝔼 𝜉 𝑡 | 𝑦 𝑡 =𝔼 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 | 𝑦 𝑡 =𝔼 𝑔 𝑡 | 𝑦 𝑡 −∇ 𝑐 𝑡 𝑦 𝑡 =0 Therefore 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 | 𝑦 𝑡 =𝔼 𝑦 𝑡 𝑇 𝔼 𝜉 𝑡 | 𝑦 𝑡 =0 𝔼 𝑦 𝑇 𝜉 𝑡 = 𝑦 𝑇 𝔼 𝜉 𝑡 = 𝑦 𝑇 𝔼 𝔼 𝜉 𝑡 | 𝑦 𝑡 =0 We get the following connections 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) +𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) (3) 𝔼 ℎ 𝑡 (𝑦) =𝔼 𝑐 𝑡 (𝑦) +𝔼 𝑦 𝑇 𝜉 𝑡 = 𝑐 𝑡 𝑦 (2)
Proof Cont. 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 2 𝔼 ℎ 𝑡 (𝑦) = 𝑐 𝑡 (𝑦) 3 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) Proof Cont. 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 = (3) 𝑡=1 𝑇 𝔼 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 𝑦 = (2) 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 𝔼 ℎ 𝑡 𝑦 = 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 𝑦 ≤ (1) 𝑅𝐺 𝑇
Part 3 BGD Algorithm
Ideal World Algorithm 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑢 𝑡 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 Ideal World Algorithm 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: Select unit vector 𝑢 𝑡 uniformly at random Play 𝑦 𝑡 and observe cost 𝑐 𝑡 𝑦 𝑡 Compute 𝑔 𝑡 using 𝑢 𝑡 𝑦 𝑡+1 ← 𝑃 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 To compute 𝑔 𝑡 we need 𝑐 𝑡 𝑦 𝑡 +𝛿 𝑢 𝑡 So we need to play 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 instead But now we have problems: 𝑥 𝑡 ∈𝑆?? The regret is for c t ( 𝑥 𝑡 ) although we are doing Estimated Gradient Descent for 𝑐 𝑡 𝑦 𝑡
Bandit Gradient Descent Algorithm (BGD) Parameters: 𝜂>0,𝛿>0,0<𝛼<1 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: Select unit vector 𝑢 𝑡 uniformly at random 𝑥 𝑡 ← 𝑦 𝑡 +𝛿 𝑢 𝑡 Play 𝑥 𝑡 and observe cost 𝑐 𝑡 𝑥 𝑡 = 𝑐 𝑡 ( 𝑦 𝑡 +𝛿 𝑢 𝑡 ) 𝑔 𝑡 ← 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑢 𝑡 𝑦 𝑡+1 ← 𝑃 1−𝛼 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: Select random unit vector 𝑢 𝑡 Play 𝑦 𝑡 and observe 𝑐 𝑡 𝑦 𝑡 Compute 𝑔 𝑡 using 𝑢 𝑡 𝑦 𝑡+1 ← 𝑃 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑥 𝑡 ∈𝑆? We have low regret for 𝑐 𝑡 ( 𝑦 𝑡 ) in 1−𝛼 𝑆 Need to convert it to low regret for 𝑐 𝑡 𝑥 𝑡 in 𝑆
Observation 1 For any 𝑥∈𝑆 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 ≤ 𝑡=1 𝑇 𝑐 𝑡 𝑥 +2𝛼𝐶𝑇 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 ≤ 𝑡=1 𝑇 𝑐 𝑡 𝑥 +2𝛼𝐶𝑇 Proof. From convexity 𝑐 𝑡 1−𝛼 𝑥 = 𝑐 𝑡 𝛼0+ 1−𝛼 𝑥 ≤ 𝛼𝑐 𝑡 0 + 1−𝛼 𝑐 𝑡 𝑥 = 𝑐 𝑡 𝑥 +𝛼 𝑐 𝑡 0 − 𝑐 𝑡 𝑥 ≤ 𝑐 𝑡 𝑥 +2𝛼𝐶
Observation 2 For any y∈ 1−𝛼 𝑆 and any x∈𝑆 𝑐 𝑡 𝑥 − 𝑐 𝑡 𝑦 ≤ 2𝐶 𝛼𝑟 | 𝑦−𝑥 | Proof. Denote Δ=𝑥−𝑦 If Δ ≥𝛼𝑟 the observation follows from 2𝐶≤ 2𝐶 𝛼𝑟 | 𝑦−𝑥 | Otherwise Δ <𝛼𝑟, let 𝑧=𝑦+𝛼𝑟 Δ | Δ | and 𝑧∈𝑆 from y∈ 1−𝛼 𝑆 𝛼𝑟 Δ Δ ∈𝛼𝑟𝔹⊆𝛼𝑆 ↓ 𝑧∈ 1−𝛼 𝑆+𝛼𝑆⊆𝑆
Proof Cont. Notice that x= Δ 𝛼𝑟 𝑧+ 1− Δ 𝛼𝑟 𝑦 So from convexity 𝑐 𝑡 𝑥 = 𝑐 𝑡 Δ 𝛼𝑟 𝑧+ 1− Δ 𝛼𝑟 𝑦 ≤ Δ 𝛼𝑟 𝑐 𝑡 𝑧 + 1− Δ 𝛼𝑟 𝑐 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑐 𝑡 𝑧 − 𝑐 𝑡 (𝑦) 𝛼𝑟 Δ ≤ 𝑐 𝑡 𝑦 + 2𝐶 𝛼𝑟 Δ Other direction is also true 𝑧=𝑦+𝛼𝑟 Δ Δ Δ=𝑥−𝑦
BGD Regret Theorem For any 𝑇≥ 3𝑅𝑑 2𝑟 2 and for the following parameters 𝜂= 𝛿𝑅 𝑑𝐶 𝑇 𝛿= 3 𝑟 𝑅 2 𝑑 2 12𝑇 𝛼= 3 3𝑅𝑑 2𝑟 𝑇 For every 𝑥∈𝑆 the BGD achieves regret 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤3𝐶 𝑇 5/6 3 𝑑𝑅 𝑟
Proof 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑦 𝑡+1 = 𝑃 1−𝛼 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Proof First we need to show that 𝑥 𝑡 ∈𝑆 Notice that 1−𝛼 𝑆+𝛼𝑟𝔹⊆ 1−𝛼 𝑆+𝛼𝑆⊆𝑆 Since 𝑦 𝑡 ∈ 1−𝛼 𝑆, we just need to show that 𝛿≤𝛼𝑟 𝛿= 3 𝑟 𝑅 2 𝑑 2 12𝑇 𝛼𝑟= 3 3𝑅𝑑 2𝑟 𝑇 𝑟= 3 3𝑅 𝑟 2 𝑑 2 𝑇 This is true because 𝑇≥ 3𝑅𝑑 2𝑟 2
Proof Cont. 𝑐 𝑡 𝑦 𝑡 = 𝔼 𝑣∈𝔹 𝑐 𝑡 ( 𝑦 𝑡 +𝛿𝑣) 𝑐 𝑡 𝑦 𝑡 = 𝔼 𝑣∈𝔹 𝑐 𝑡 ( 𝑦 𝑡 +𝛿𝑣) 𝛻 𝑐 𝑡 𝑦 𝑡 = 𝑑 𝛿 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦 𝑡 +𝛿𝑢 𝑢 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑦 𝑡+1 = 𝑃 1−𝛼 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Proof Cont. Now we want to bound the regret We have 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 ≤ 𝑑𝐶 𝛿 =:𝐺 Expected Zinkevich’s Theorem says 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝑅𝐺 𝑇 = 𝑅𝑑𝐶 𝑇 𝛿 (1) Where y∈ 1−𝛼 𝑆 and 𝜂= 𝑅 𝐺 𝑇 = 𝛿𝑅 𝑑𝐶 𝑇
Proof Cont. From observation 2 we get 𝑐 𝑡 𝑦 − 𝑐 𝑡 𝑥 ≤ 2𝐶 𝛼𝑟 | 𝑦−𝑥 | For 𝑦∈ 1−𝛼 𝑆 𝑐 𝑡 𝑦 𝑡 = 𝔼 𝑣∈𝔹 𝑐 𝑡 ( 𝑦 𝑡 +𝛿𝑣) From observation 2 we get 𝑐 𝑡 𝑦 𝑡 − 𝑐 𝑡 𝑥 𝑡 ≤ 𝑐 𝑡 𝑦 𝑡 − 𝑐 𝑡 𝑦 𝑡 + 𝑐 𝑡 𝑦 𝑡 − 𝑐 𝑡 𝑥 𝑡 ≤2 2𝐶 𝛼𝑟 𝛿 𝑐 𝑡 𝑦 − 𝑐 𝑡 𝑦 ≤ 2𝐶 𝛼𝑟 𝛿 Now we get for 𝑦∈ 1−𝛼 𝑆 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 𝑡 +2 2𝐶 𝛼𝑟 𝛿 − 𝑡=1 𝑇 ( 𝑐 𝑡 𝑦 − 2𝐶 𝛼𝑟 𝛿)=
1 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤ 𝑅𝑑𝐶 𝑇 𝛿 Proof Cont. 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 𝑡 +2 2𝐶 𝛼𝑟 𝛿 − 𝑡=1 𝑇 ( 𝑐 𝑡 𝑦 − 2𝐶 𝛼𝑟 𝛿)= 𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 𝑦 +3𝑇 2𝐶 𝛼𝑟 𝛿≤ 𝑅𝑑𝐶 𝑇 𝛿 +3𝑇 2𝐶 𝛼𝑟 𝛿 (2)
Proof Cont. 𝑦= 1−𝛼 𝑥 for some 𝑥∈𝑆 so we can use observation 1 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 ≤ 𝑡=1 𝑇 𝑐 𝑡 𝑥 +2𝛼𝐶𝑇 2 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤ 𝑅𝑑𝐶 𝑇 𝛿 +3𝑇 2𝐶 𝛼𝑟 𝛿 Proof Cont. 𝑦= 1−𝛼 𝑥 for some 𝑥∈𝑆 so we can use observation 1 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑥 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 +2𝛼𝐶𝑇= 𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑥 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 𝑦 +2𝛼𝐶𝑇≤ 𝑅𝑑𝐶 𝑇 𝛿 +𝑇 6𝐶 𝛼𝑟 𝛿+2𝛼𝐶𝑇 Substituting the parameters finishes the proof. 𝛿= 3 𝑟 𝑅 2 𝑑 2 12𝑇 𝛼= 3 3𝑅𝑑 2𝑟 𝑇
BGD with Lipschitz Regret Theorem If all 𝑐 𝑡 are 𝐿−Lipschitz then for 𝑇 sufficiently large and the parameters 𝜂= 𝛿𝑅 𝑑𝐶 𝑇 𝛿= 𝑇 −1/4 𝑅𝑑𝐶𝑟 3(𝐿𝑟+𝐶) 𝛼= 𝛿 𝑟 The BGD achieves regret 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤2 𝑇 3/4 3𝑅𝑑𝐶(𝐿+ 𝐶 𝑟 )
Part 4 Reshaping
Removing Dependence in 1/𝑟 There are algorithms that for a convex set 𝑟𝔹⊆𝑆⊆𝑅𝔹 find an affine transformation 𝑇 that puts 𝑆 in near-isotropic position and run in time 𝑂 𝑑 4 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑑, 𝑅 𝑟 ) 𝑇 𝑆 ⊆ ℝ 𝑑 is in isotropic position if the covariance matrix of a random sample from 𝑇 𝑆 is the identity matrix This gives us 𝔹⊆𝑇(𝑆)⊆𝑑𝔹 So we have new 𝑅=𝑑 and 𝑟=1 Also if 𝑐 𝑡 is 𝐿− Lipschitz then 𝑐 𝑡 ⃘ 𝑇 −1 is 𝐿𝑅− Lipschitz
Removing Dependence in 1/𝑟 So if we first put 𝑆 in near-isotropic position we get the regret bound 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤6 𝑇 3 4 𝑑( 𝐶𝐿𝑅 +𝐶) And without the Lipschitz condition 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤6 𝑇 5 6 𝑑𝐶 2 𝑇 3/4 3𝑅𝑑𝐶(𝐿+ 𝐶 𝑟 ) 3𝐶 𝑇 5/6 3 𝑑𝑅 𝑟
Part 5 Adaptive Adversary
Expected Adaptive Zinkevich’s Theorem Let 𝑐 1 ,…, 𝑐 𝑇 :(1−𝛼)𝑆→ℝ be convex, differentiable functions ( 𝑐 𝑡 depends on 𝑦 1 ,…, 𝑦 𝑡−1 ) Let 𝑔 1 ,…, 𝑔 𝑇 be random vectors such that 𝔼 𝑔 𝑡 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝑔 𝑡 ≤𝐺 (also implies ∇ 𝑐 𝑡 ( 𝑦 𝑡 ) ≤𝐺) Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1−𝛼)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Then for 𝜂= 𝑅 𝐺 𝑇 and for every 𝑦∈(1−𝛼)𝑆 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤3𝑅𝐺 𝑇
Proof 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 1 =0 Define ℎ 𝑡 :(1−𝛼)𝑆→ℝ by ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 where 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 Notice that 𝛻 ℎ 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 + 𝜉 𝑡 = 𝑔 𝑡 So our updates are the same as running regular gradient descent on ℎ 𝑡 From Zinkevich’s Theorem 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 (1)
Proof Cont. ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 𝔼 𝑔 𝑡 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 Notice that 𝔼 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝔼 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 = 𝔼 𝑔 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 =0 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝔼 𝑦 𝑡 𝑇 𝔼 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =0 We get the following connection between 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) and 𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) +𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) (3)
Proof Cont. 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝔼 𝑔 𝑡 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 𝜉 𝑡 ≤ 𝑔 𝑡 + 𝛻 𝑐 𝑡 𝑦 𝑡 ≤2𝐺 For every 1≤𝑠<𝑡≤𝑇 we have 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 =𝔼 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 Given 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 we know 𝑔 𝑠 (so also 𝜉 𝑠 ) and therefore 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 =𝔼 𝜉 𝑠 𝑇 𝔼 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =0 We use it to get 𝔼 𝑡=1 𝑇 𝜉 𝑡 2 ≤𝔼 𝑡=1 𝑇 𝜉 𝑡 2 = 𝑡=1 𝑇 𝔼 𝜉 𝑡 2 +2 1≤𝑠<𝑡≤𝑇 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 = 𝑡=1 𝑇 𝔼 𝜉 𝑡 2 ≤ 𝑡=1 𝑇 𝔼 2𝐺 2 =4𝑇 𝐺 2
Proof Cont. Now we connect between 𝔼 ℎ 𝑡 (𝑦) and 𝔼 𝑐 𝑡 (𝑦) ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 𝔼 𝑡=1 𝑇 𝜉 𝑡 ≤2𝐺 𝑇 𝑆⊆𝑅𝔹 Proof Cont. Now we connect between 𝔼 ℎ 𝑡 (𝑦) and 𝔼 𝑐 𝑡 (𝑦) 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 −𝔼 𝑡=1 𝑇 𝑐 𝑡 (𝑦) ≤𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 − 𝑐 𝑡 (𝑦) = 𝔼 𝑦 𝑇 𝑡=1 𝑇 𝜉 𝑡 ≤𝔼 𝑦 𝑡=1 𝑇 𝜉 𝑡 ≤𝑅𝔼 𝑡=1 𝑇 𝜉 𝑡 ≤2𝑅𝐺 𝑇 (2)
Proof Cont. 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 2 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 −𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤2𝑅𝐺 𝑇 3 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) Proof Cont. 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 = 𝑡=1 𝑇 𝔼 𝑐 𝑡 𝑦 𝑡 −𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 = (3) 𝑡=1 𝑇 𝔼 ℎ 𝑡 𝑦 𝑡 −𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤ (2) 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 −𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 +2 𝑅𝐺 𝑇 = 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 𝑦 +2𝑅𝐺 𝑇 ≤ (1) 3𝑅𝐺 𝑇