The Nonstochastic Multiarmed Bandit Problem

The Nonstochastic Multiarmed Bandit Problem
Seminar on Experts and Bandits, Fall Barak Itkin

Problem setup 𝐾 slot machines Rewards bounded in 0,1
No experts Rewards bounded in 0,1 Can also be generalized to other bounds Partial information You only learn about the arm you pulled “No assumptions at all” on slot machines Not even a fixed distribution In-fact, can be adversarial

Motivation Think of packet routing Rewards are the round-trip time
Multiple routes exist to reach the destination Rewards are the round-trip time Will change depending on load in different parts of the network Partial information You only learn about the packet you sent “No assumptions at all” on load behavior Load can change dramatically over-time

Back to the problem Arm assignment is determined in advance
I.e. before the first arm is pulled Adversarial = Assignment can be picked after strategy is already known But still, before the game begins We want to minimize the regret “How much better could have we done?” We’ll start with “Weak” Regret Comparing to always pulling one arm (the “best” arm)

Our Goal We would like to show an algorithm that has the following bound on the “weak” regret: 𝑂 𝐾 𝐺 max ln 𝐾 𝐺 max is the return of the best arm This should work for any setup of arms Including random/adversarial setups

Difference from last week?
In the second half we saw: 𝐾 experts, each loses a certain fraction of what they get We can split between multiple experts We always learn about all the experts Regret compare to single best expert (“weak” regret) Where’s the difference? Solo investment – we chose only one expert Partial information – we learn only about chosen expert Bounds: Last week: ln 𝐾 + 2𝑇 ln 𝐾 Ours: 𝑂 𝐾 𝐺 max ln 𝐾

Notations 𝑡 – The time step. Also known as “trial”
𝑡∈ 1,2,…,𝑇 𝑥 𝑖 𝑡 – The reward of arm 𝑖 at trial 𝑡 𝑥 𝑖 𝑡 ∈ 0,1 𝐴 – an algorithm, choosing the arm 𝑖 𝑡 at trial 𝑡 The input at time 𝑡 is the arms chosen and their rewards till now Formally ,2,…,𝐾 × 0,1 𝑡−1

Notations++ 𝐺 𝐴 𝑇 – The return of algorithm 𝐴 at time horizon 𝑇
𝐺 𝐴 𝑇 = 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 Will be abbreviated as 𝐺 𝐴 where the 𝑇 is obvious 𝐺 max 𝑇 – The best single-arm reward at time horizon 𝑇 𝐺 max 𝑇 = max 𝑗 𝑡=1 𝑇 𝑥 𝑗 𝑡 Will also be abbreviated as 𝐺 max

Recalling our baseline
The naïve approach

Simplistic approach Explore Exploit Profit? Check each arm 𝐵 times
Continue pulling only the best arm Profit? Only with “fixed distribution” (no significant changes over time) May fail with arbitrary/adversary rewards

Multiplicative Weights approach
Maintain weights 𝑤 𝑖 𝑡 for arm 𝑖 at trial 𝑡 Start with uniform weights Use the weights to define a distribution 𝑝 𝑖 𝑡 Typically 𝑝 𝑖 𝑡 = 𝑤 𝑖 𝑡 / 𝑗=1 𝐾 𝑤 𝑗 𝑡 Pick an arm by sampling the distribution Update weights based on rewards of each action When rewards are known, multiply by a function 𝑢 of the reward: 𝑤 𝑖 𝑡+1 ← 𝑤 𝑖 𝑡 ⋅𝑢 𝑥 𝑖 𝑡 We’ll discuss partial information setup today

First attempt The Exp3 algorithm

Exp3 – Exploration Start with uniform weights
No surprise here Always encourage the algorithm to explore more arms Add an “exploration factor” 𝛾 to the probability Each arm will always have a probability of at least 𝛾/𝐾 The exploration factor is controlled by 𝛾∈ 0,1 Can be fine-tuned later The exploration factor does not change over time Rewards may be arbitrary or even adversarial So, we must always continue exploring

Estimation rational Question story time: Answer: More generally:
You randomly visit a shop on 20% of the days Only on those days you know their profit However, you need to give a profit estimation for every day What do you do? Answer: On days you visit, estimate 5⋅ profit of that day On other days, estimate 0 More generally: observation / chance to observe if seen 0 otherwise

Exp3 – Weight updates To update the weights, use “estimated rewards” instead of the actual rewards 𝑥 𝑖 𝑡 = 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 For actions that weren’t chosen, consider as if 𝒙 𝒊 𝒕 =𝟎 This will create an unbiased estimator: 𝐸 𝑥 𝑖 𝑡 𝑖 1 , 𝑖 2 ,…, 𝑖 𝑡−1 = 𝑝 𝑖 𝑡 ⋅ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 + 1− 𝑝 𝑖 𝑡 ⋅0 = 𝑥 𝑖 𝑡 This equality holds for all actions – not just the one chosen This helps with the weight updates on partial information

Exp3 Initialize: Set 𝑤 𝑖 1 ←1for all actions 𝑖 For each 𝒕:
Sample 𝑖 𝑡 from 𝑝 𝑖 𝑡 ≝ 1−𝛾 ⋅ 𝑤 𝑖 𝑡 𝑗=1 𝐾 𝑤 𝑗 𝑡 + 𝛾 𝐾 Receive reward 𝑥 𝑖 𝑡 𝑡 Update the weights with 𝑤 𝑖 𝑡+1 ← 𝑤 𝑖 𝑡 ⋅ exp 𝛾 𝑥 𝑖 𝑡 𝐾 𝑥 𝑖 𝑡 =0 for all but the selected action Practically, we only update 𝑤 𝑖 𝑡

Bounds Since we have a probabilistic argument, we care about the expected “weak” regret: 𝐸 𝐺 max − 𝐺 Exp3 = 𝐺 max −𝐸 𝐺 Exp3 Theorem: 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾 No, this is not what we wanted ( 𝐾 𝐺 max ln 𝐾 ) We’ll fix that later

Bound intuition 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾
𝛾 is the exploration factor As 𝛾→1, we stop exploiting our weights Instead we prefer “randomly” looking around Thus we come closer and closer to 𝐺 max As 𝛾→0, we exploit our weights too much We check for less changes in other arms Becomes worse when we have more arms (𝐾) We must find the “sweet spot”

The sweet spot 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾
For some 𝑔≥ 𝐺 max , consider 𝛾= min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 𝛾∈ 0,1 This yields …≤2 𝐾 ln 𝐾 𝑒−1 𝑔 =2 𝑒−1 𝑔𝐾 ln 𝐾 =2.63 𝑔𝐾 ln 𝐾 Which is similar to the bound we wanted (𝑂 𝐾 𝐺 max ln 𝐾 )

Proof 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾
Enough talking, let’s start the proof! We assume 0<𝛾<1 The bound trivially holds for 𝛾=1 (as 𝑒−1≥1) Furthermore, we’ll want to rely on some facts As outlined in the next slides

Fact #1 Irrelevant to our cause
We just want numbering (starting at 2) to be consistent with the paper 

Fact #2 𝒙 𝒊 𝒕 ≤𝑲/𝜸 Proof: 𝑥 𝑖 𝑡 ≝ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡
𝒙 𝒊 𝒕 ≤𝑲/𝜸 Proof: 𝑥 𝑖 𝑡 ≝ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 Since 𝑥 𝑖 𝑡 ≤1, we get: ≤1/ 𝑝 𝑖 𝑡 Since 𝑝 𝑖 𝑡 =…+𝛾/𝐾, we get: ≤𝐾/𝛾

Fact #3 𝒊=𝟏 𝑲 𝒑 𝒊 𝒕 𝒙 𝒊 𝒕 = 𝒙 𝒊 𝒕 𝒕 Since 𝑥 𝑖 𝑡 ≝ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 , the sum becomes: 𝑖=1 𝐾 𝑥 𝑖 𝑡 We defined 𝑥 𝑖 𝑡 =0 for all unobserved actions And so we remain only with 𝑖 𝑡

Fact #4 𝒊=𝟏 𝑲 𝒑 𝒊 𝒕 𝒙 𝒊 𝒕 𝟐 ≤ 𝒊=𝟏 𝑲 𝒙 𝒊 𝒕
𝒊=𝟏 𝑲 𝒑 𝒊 𝒕 𝒙 𝒊 𝒕 𝟐 ≤ 𝒊=𝟏 𝑲 𝒙 𝒊 𝒕 From the previous slide (fact #3) we get = 𝑥 𝑖 𝑡 𝑡 ⋅ 𝑥 𝑖 𝑡 𝑡 Since 𝑥 𝑖 𝑡 ≤1 we get ≤ 𝑥 𝑖 𝑡 𝑡 As before, since un-observed actions cancel out we can do: = 𝑖=1 𝐾 𝑥 𝑖 𝑡

Proof - outline Denote 𝑊 𝑡 ≝ 𝑖=1 𝐾 𝑤 𝑖 𝑡
Establish a relation between 𝑊 𝑡+1 / 𝑊 𝑡 and 𝑥 𝑖 𝑡 𝑡 Establish a relation between ln 𝑊 𝑡+1 / 𝑊 𝑡 and 𝑥 𝑖 𝑡 𝑡 Obtain with 1+𝑥≤ 𝑒 𝑥 Establish a relation between ln 𝑊 𝑇+1 / 𝑊 1 and 𝐺 Exp3 Obtain by summing over 𝒕 Compute a direct bound over ln 𝑊 𝑇+1 / 𝑊 1 Apply bound to 𝐺 Exp3

Recall the weight update definition
Proof We begin with the definition of the weight sums: 𝑊 𝑡+1 𝑊 𝑡 = 𝑖=1 𝐾 𝑤 𝑖 𝑡+1 𝑊 𝑡 Recall the weight update definition 𝑤 𝑗 𝑡+1 ← 𝑤 𝑗 𝑡 ⋅ exp 𝛾 𝑥 𝑗 𝑡 𝐾 = 𝑖=1 𝐾 𝑤 𝑖 𝑡 𝑊 𝑡 exp 𝛾 𝐾 𝑥 𝑖 𝑡

Proof 𝑊 𝑡+1 𝑊 𝑡 =…= 𝑖=1 𝐾 𝑤 𝑖 𝑡 𝑊 𝑡 exp 𝛾 𝐾 𝑥 𝑖 𝑡 Recall the probability definition: 𝑝 𝑖 𝑡 ≝ 1−𝛾 ⋅ 𝑤 𝑖 𝑡 𝑊 𝑡 + 𝛾 𝐾 Some algebra and we get: 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 = 𝑤 𝑖 𝑡 𝑊 𝑡 = 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 exp 𝛾 𝐾 𝑥 𝑖 𝑡

Proof 𝑊 𝑡+1 𝑊 𝑡 =…= 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 exp 𝛾 𝐾 𝑥 𝑖 𝑡 Note the following inequality for 𝒙≤𝟏 𝑒 𝑥 ≤1+𝑥+ 𝑒−2 𝑥 2 (No, it’s not a famous one. Yes it works, I tested it) ≤ 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 1+ 𝛾 𝐾 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2

Proof 𝑊 𝑡+1 𝑊 𝑡 =…≤ 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 1+ 𝛾 𝐾 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2 Opening the square brackets, we get (just ugly math) 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 ⋅1 = 𝑖=1 𝐾 1−𝛾 ⋅ 𝑤 𝑖 𝑡 𝑊 𝑡 1−𝛾 = 𝑖=1 𝐾 𝑤 𝑖 𝑡 𝑊 𝑡 =1 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 ⋅ 𝛾 𝐾 𝑥 𝑖 𝑡 = 𝛾 𝐾 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 − 𝛾 𝐾 𝑥 𝑖 𝑡 ≤ 𝛾 𝐾 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 ⋅ 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2 ≤ 𝑒−2 𝛾 𝐾 −𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 2

Proof 𝑊 𝑡+1 𝑊 𝑡 =…≤ 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 1+ 𝛾 𝐾 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2 Combining it all, we have ≤1+ 𝛾 𝐾 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 −𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 2 Using fact #3 ( 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 = 𝑥 𝑖 𝑡 𝑡 ) and Using fact #4 ( 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 2 ≤ 𝑖=1 𝐾 𝑥 𝑖 𝑡 ) ≤1+ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 −𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡

Proof 𝑊 𝑡+1 𝑊 𝑡 =…≤1+ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡
𝑊 𝑡+1 𝑊 𝑡 =…≤1+ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 −𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡 Taking the log and using 1+𝑥≤ 𝑒 𝑥 (specifically ln 1+𝑥 ≤𝑥): ln 𝑊 𝑡+1 𝑊 𝑡 ≤ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 −𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡 Summing over 𝒕 from 𝟏 to 𝑻 we get: ln 𝑊 𝑇+1 𝑊 1 ≤ 𝛾 𝐾 1−𝛾 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 −𝛾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡

Proof ln 𝑊 𝑇+1 𝑊 1 ≤ 𝛾 𝐾 1−𝛾 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ln 𝑊 𝑇+1 𝑊 1 ≤ 𝛾 𝐾 1−𝛾 𝐺 Exp3 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ∗ 1−𝛾 𝛾 𝐾 ln 𝑊 𝑇+1 𝑊 1 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 We’ll get back to this equation in a second

Proof Choosing some action 𝑗 and since 𝑤 𝑗 𝑇+1 ≤ 𝑊 𝑇+1 , note that:
ln 𝑊 𝑇+1 𝑊 1 ≥ ln 𝑤 𝑗 𝑇+1 𝑊 1 = ln 𝑤 𝑗 𝑇+1 − ln 𝑊 1 Recall that 𝑤 𝑗 𝑡+1 ← 𝑤 𝑗 𝑡 ⋅ exp 𝛾 𝑥 𝑗 𝑡 𝐾 and so we can expand the log recursively: ln 𝑤 𝑗 𝑇+1 = 𝛾 𝐾 ⋅ 𝑥 𝑗 𝑡 + ln 𝑤 𝑗 𝑇 =…= 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 + ln 𝑤 𝑗 1 Recall that 𝑤 𝑗 1 =1 and 𝑊 1 =𝐾 (uniform initialization) and so: $ ln 𝑊 𝑇+1 𝑊 1 ≥ 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − ln 𝐾

Proof ∗ 1−𝛾 𝛾 𝐾 ln 𝑊 𝑇+1 𝑊 1 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3
$ ln 𝑊 𝑇+1 𝑊 1 ≥ 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − ln 𝐾 Combining ∗ and $ we get: 1−𝛾 𝛾 𝐾 ⋅ 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 1−𝛾 𝛾 𝐾 ⋅ ln 𝐾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 𝛾 ln 𝐾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3

Proof 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 𝛾 ln 𝐾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3
Recall we saw that 𝐸 𝑥 𝑖 𝑡 𝑖 1 , 𝑖 2 ,…, 𝑖 𝑡−1 = 𝑥 𝑖 𝑡 . Taking expectation on both sides we get: 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐾 𝑖=1 𝐾 𝑡=1 𝑇 𝑥 𝑖 𝑡 ≤𝐸 𝐺 Exp3 Since 𝑡=1 𝑇 𝑥 𝑐 𝑖 ≤ 𝐺 max for any 𝑐, and since we have a minus sign, we can do: 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐾 𝐾 𝐺 max ≤𝐸 𝐺 Exp3

Proof 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐺 max ≤𝐸 𝐺 Exp3
Note this was true for any action 𝑗, which means including the maximal action! 1−𝛾 𝐺 max − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐺 max ≤𝐸 𝐺 Exp3 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝛾 𝐺 max + 𝐾 ln 𝐾 𝛾 Q.E.D.(!)

Break(?)

Is it enough? Maybe? What was our problem?
If it was a clear “yes”, I wouldn’t have asked it :P What was our problem?

Room for improvement We assumed an upper limit 𝒈≥ 𝑮 max is known
We then selected 𝛾= min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 This obtained the bound 𝐺 max −𝐸 𝐺 Exp3 ≤2.63 𝑔𝐾 ln 𝐾 If we don’t know 𝑮 max , we may over-shoot 𝒈 For example, setting 𝑔≔𝑇 This would yield a less tight result :/

New goal Maintain the same bound over the weak regret 𝑂 𝐾 𝐺 max ln 𝐾
Don’t assume a known 𝒈≥ 𝑮 max In fact, do better! Maintain this bound uniformly throughout the execution* *A formal definition will follow

Notations# (better than ++++)
𝐺 𝑖 𝑡+1 – The return of action 𝑖 till (exclusive) time horizon 𝑡 𝐺 𝑖 𝑡+1 = 𝑠=1 𝑡 𝑥 𝑖 𝑠 𝐺 𝑖 𝑡+1 – The estimated return of action 𝑖 till time horizon 𝑡 𝐺 𝑖 𝑡+1 = 𝑠=1 𝑡 𝑥 𝑖 𝑠 𝐺 max 𝑡+1 – The maximal return of any single action till time horizon 𝑡 𝐺 max 𝑡+1 = max 𝑖 𝐺 𝑖 𝑡+1

Re-stating our goal At every time-step 1≤𝑡≤𝑇, the weak regret till that point should maintain the bound 𝑂 𝐾 𝐺 max 𝑡+1 ln 𝐾 This bound will hold all the way! This also gives a hint on how to do this: Previously we guessed 𝑔≥ 𝐺 max Instead, lets maintain 𝑔≥ 𝐺 max 𝑡+1 Update 𝑔 and 𝛾 as 𝐺 max 𝑡 grows Effectively – we are searching for the right 𝜸!

Exp3.1 (creative names!) Initialize: Set 𝑡←1
For each “epoch” 𝒓=𝟎,𝟏,𝟐,… do: 𝑔 𝑟 ≝ 4 𝑟 ⋅ 𝐾 ln 𝐾 𝑒−1 (Re-)Initialize Exp3 with: 𝛾 𝑟 ≝ min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 𝑟 = min 1,1/2 𝑟 While 𝐦𝐚𝐱 𝒊 𝑮 𝒊 𝒕 ≤ 𝒈 𝒓 −𝑲/ 𝜸 𝒓 do: Do one step with Exp3 (Update our tracking of 𝐺 𝑖 𝑡+1 for all actions) 𝑡←t+1

Bounds (Theorem 4.1) Theorem 4.1:
𝐺 max −𝐸 𝐺 Exp3.1 ≤8 𝑒−1 𝐺 max 𝐾 ln 𝐾 +8 𝑒−1 𝐾+2𝐾 ln 𝐾 Proof outline: Lemma 4.2: Bound the weak regret per “epoch” By giving a lower bound on the actual reward Lemma 4.3: Bound the number of epochs for a finite horizon 𝑇 Combine both to obtain the final bound

Epoch notations Let 𝑇 be the overall number of steps
Needed only for the proof We don’t need to know this in advance Let 𝑅 be the overall number of epochs Same here Let 𝑆 𝑟 and 𝑇 𝑟 be the first and last steps (𝑡) of epoch 𝑟 𝑆 1 =1, 𝑇 𝑅 =𝑇 An epoch may be empty – in that case 𝑆 𝑟 = 𝑇 𝑟 +1> 𝑇 𝑟 𝐺 max ≝ 𝐺 max 𝑇+1

Lemma 4.2 For any action 𝑗 and for every epoch 𝑟
𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 ≥ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾 Holds trivially for empty epochs We’ll prove for 𝑇 𝑟 ≥ 𝑆 𝑟

Proof Some many slides back, we proved the following:
1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 Translating to our current notations, this becomes 1− 𝛾 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑖=1 𝐾 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡

Proof 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑖=1 𝐾 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 Merging the estimated reward also with all previous epochs: 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝐺 𝑗 𝑇 𝑟 +1 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑖=1 𝐾 𝐺 𝑖 𝑇 𝑟 +1 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 From the termination condition, we know that for any action 𝒊, at time 𝒕, in epoch 𝒓: 𝐺 𝑖 𝑡 ≤ 𝑔 𝑟 −𝐾/ 𝛾 𝑟 From fact #2, we know that 𝑥 𝑖 𝑡 ≤𝐾/ 𝛾 𝑟 Together we get that for any action 𝐺 𝑖 𝑡+1 ≤ 𝑔 𝑟 𝐺 𝑖 𝑇 𝑟 +1 ≤ 𝑔 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑔 𝑟 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝐾 𝑔 𝑟 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡

Proof 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑔 𝑟 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝐾 𝑔 𝑟 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−1 𝑔 𝑟 𝛾 𝑟 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 Recall that 𝛾 𝑟 = min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 𝑟 Substitute and Q.E.D

Lemma 4.3 Denote 𝑐≝ 𝐾 ln 𝐾 𝑒−1 , 𝑧≝ 2 𝑅−1
So that 𝑔 𝑟 ≝𝑐⋅ 4 𝑟 Then the number of epochs 𝑅 satisfies the inequality 2 𝑅−1 ≝𝑧≤ 𝐾 𝑐 𝐺 max 𝑐 Holds trivially for 𝑅=0 We’ll prove for 𝑅≥1

Proof Regardless of the lemma, we know that:
𝐺 max ≝ 𝐺 max 𝑇+1 ≝ 𝐺 max 𝑇 𝑅 +1 ≥ 𝐺 max 𝑇 𝑅−1 +1 𝑇 𝑅−1 is the termination step, so we know the epoch continuation condition was violated > 𝑔 𝑅−1 − 𝐾 𝛾 𝑅−1 =𝑐⋅ 4 𝑅−1 − 𝐾 min 1,2 − 𝑅−1 ≥𝑐⋅ 4 𝑅−1 −𝐾⋅ 2 𝑅−1 =𝑐 𝑧 2 −𝐾𝑧

Proof So, we’ve shown that 𝐺 max >𝑐 𝑧 2 −𝐾𝑧
𝑧= 2 𝑅 is a variable, as we are trying to bound 𝑅 This function is a parabola of 𝑧 The minimum is obtained at 𝑧=𝑘/2𝑐 (Reminder: Extremum of 𝑎 𝑥 2 +𝑏𝑥+𝑐 is at −𝑏/2𝑎) It increases monotonically for 𝑧>𝑘/2𝑐

Proof Now, suppose the claim is false
Reversing the original inequality, we get 𝑧> 𝐾 𝑐 𝐺 max 𝑐 > 𝐾 𝑐 𝐺 max 𝑐 𝑧 and the expression in purple are both larger than 𝑘/2𝑐 This is in the increasing part of 𝑐 𝑧 2 −𝐾𝑧: 𝑐 𝑧 2 −𝐾𝑧>𝑐 𝐾 𝑐 𝐺 max 𝑐 2 −𝐾 𝐾 𝑐 𝐺 max 𝑐 > 𝐾 2 𝑐 +2𝐾 𝐺 max 𝑐 + 𝐺 max − 𝐾 2 𝑐 −𝐾 𝐺 max 𝑐 = 𝐺 max +𝐾 𝐺 max 𝑐 > 𝐺 max

Proof Assuming the claim is false, we got 𝑐 𝑧 2 −𝐾𝑧> 𝐺 max
But, before we proved that 𝐺 max >𝑐 𝑧 2 −𝐾𝑧 Q.E.D.

Theorem 4.1 Our target was:
𝐺 max −𝐸 𝐺 Exp3.1 ≤8 𝑒−1 𝐺 max 𝐾 ln 𝐾 +8 𝑒−1 𝐾+2𝐾 ln 𝐾 Let’s begin

Proof 𝐺 Exp3.1 ≝ 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 ≝ 𝑟=0 𝑅 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 Lemma 4.2 says that for any action 𝑗 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 ≥ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾 Then this should also be for the maximal action ≥ max 𝑗 𝑟=0 𝑅 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾

Proof ≥ max 𝑗 𝑟=0 𝑅 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾 = max 𝑗 𝐺 𝑗 𝑇+1 − 𝑟=0 𝑅 2 𝑒−1 4 𝑟 ⋅ 𝐾 ln 𝐾 𝑒−1 𝐾 ln 𝐾 = max 𝑗 𝐺 𝑗 𝑇+1 −2𝐾 ln 𝐾 ⋅ 𝑟=0 𝑅 2 𝑟 = 𝐺 max −2𝐾 ln 𝐾 2 𝑅+1 −1

Proof = 𝐺 max −2𝐾 ln 𝐾 2 𝑅+1 −1 Using lemma 4.3 we obtain ≥ 𝐺 max −2𝐾 ln 𝐾 4⋅ 𝑒−1 ln 𝐾 + 𝐺 max ⋅ 𝑒−1 𝐾 ln 𝐾 −1 = 𝐺 max −8𝐾 𝑒−1 −8 𝐺 max ⋅ 𝑒−1 ⋅𝐾 ln 𝐾 −4𝐾 ln 𝐾 +2𝐾 ln 𝐾 = 𝐺 max −2𝐾 ln 𝐾 −8𝐾 𝑒−1 −8 𝑒−1 𝐺 max 𝐾 ln 𝐾

Finishing the proof 𝐺 Exp3.1 ≥ 𝐺 max −2𝐾 ln 𝐾 −8𝐾 𝑒−1 −8 𝑒−1 𝐺 max 𝐾 ln 𝐾 𝐺 max − 𝐺 Exp3.1 ≤2𝐾 ln 𝐾 +8𝐾 𝑒−1 +8 𝑒−1 𝐺 max 𝐾 ln 𝐾 Using Jensen’s inequality (where the variable is 𝐺 max ), we complete the proof and take the expectation Full details in the paper We don’t have time at this point anyhow :P

Questions?

The Nonstochastic Multiarmed Bandit Problem

Similar presentations

Presentation on theme: "The Nonstochastic Multiarmed Bandit Problem"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Nonstochastic Multiarmed Bandit Problem

Similar presentations

Presentation on theme: "The Nonstochastic Multiarmed Bandit Problem"— Presentation transcript:

Similar presentations

About project

Feedback