The Nonstochastic Multiarmed Bandit Problem

Slides:



Advertisements
Similar presentations
Analysis of Algorithms II
Advertisements

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
How Bad is Selfish Routing? By Tim Roughgarden Eva Tardos Presented by Alex Kogan.
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
Algorithms Recurrences. Definition – a recurrence is an equation or inequality that describes a function in terms of its value on smaller inputs Example.
4/5/05Tucker, Sec Applied Combinatorics, 4rth Ed. Alan Tucker Section 4.3 Graph Models Prepared by Jo Ellis-Monaghan.
Planning under Uncertainty
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
The Integers. The Division Algorithms A high-school question: Compute 58/17. We can write 58 as 58 = 3 (17) + 7 This forms illustrates the answer: “3.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Inequalities for Stochastic Linear Programming Problems By Albert Madansky Presented by Kevin Byrnes.
Approximation Algorithms based on linear programming.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Proof And Strategies Chapter 2. Lecturer: Amani Mahajoub Omer Department of Computer Science and Software Engineering Discrete Structures Definition Discrete.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Copyright © Cengage Learning. All rights reserved.
Induction in Pascal’s Triangle
5.1 Exponential Functions
Copyright © Cengage Learning. All rights reserved.
Chapter 3 The Real Numbers.
Induction in Pascal’s Triangle
The Duality Theorem Primal P: Maximize
Visual Recognition Tutorial
3. The X and Y samples are independent of one another.
Mathematical Induction II
Randomized Algorithms
Copyright © Cengage Learning. All rights reserved.
Haim Kaplan and Uri Zwick
Gray Code Can you find an ordering of all the n-bit strings in such a way that two consecutive n-bit strings differed by only one bit? This is called the.
Data Mining Lecture 11.
Chapter 5. Optimal Matchings
ELEMENTARY NUMBER THEORY AND METHODS OF PROOF
Vapnik–Chervonenkis Dimension
Distinct Distances in the Plane
James B. Orlin Presented by Tal Kaminker
Chap 9. General LP problems: Duality and Infeasibility
Copyright © Cengage Learning. All rights reserved.
Distributed Consensus
Hidden Markov Models Part 2: Algorithms
Depth Estimation via Sampling
Instructor: Shengyu Zhang
Analysis of Algorithms
Axiomatic semantics Points to discuss: The assignment statement
Randomized Algorithms
Alternating tree Automata and Parity games
The
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
3.5 Minimum Cuts in Undirected Graphs
Data Structures Review Session
MURI Kickoff Meeting Randolph L. Moses November, 2008
Theory Of Computer Science
Advanced Algorithms Analysis and Design
Recurrences (Method 4) Alexandra Stefan.
Chapter 5. The Duality Theorem
13. The Weak Law and the Strong Law of Large Numbers
Compact routing schemes with improved stretch
Learning From Observed Data
Clustering.
Part 7. Phantoms: Legal States That Cannot Be Constructed
CS 2604 Data Structures and File Management
Chapter 5: Sampling Distributions
13. The Weak Law and the Strong Law of Large Numbers
Presentation transcript:

The Nonstochastic Multiarmed Bandit Problem Seminar on Experts and Bandits, Fall 2017-2018 Barak Itkin

Problem setup 𝐾 slot machines Rewards bounded in 0,1 No experts Rewards bounded in 0,1 Can also be generalized to other bounds Partial information You only learn about the arm you pulled “No assumptions at all” on slot machines Not even a fixed distribution In-fact, can be adversarial

Motivation Think of packet routing Rewards are the round-trip time Multiple routes exist to reach the destination Rewards are the round-trip time Will change depending on load in different parts of the network Partial information You only learn about the packet you sent “No assumptions at all” on load behavior Load can change dramatically over-time

Back to the problem Arm assignment is determined in advance I.e. before the first arm is pulled Adversarial = Assignment can be picked after strategy is already known But still, before the game begins We want to minimize the regret “How much better could have we done?” We’ll start with “Weak” Regret Comparing to always pulling one arm (the “best” arm)

Our Goal We would like to show an algorithm that has the following bound on the “weak” regret: 𝑂 𝐾 𝐺 max ln 𝐾 𝐺 max is the return of the best arm This should work for any setup of arms Including random/adversarial setups

Difference from last week? In the second half we saw: 𝐾 experts, each loses a certain fraction of what they get We can split between multiple experts We always learn about all the experts Regret compare to single best expert (“weak” regret) Where’s the difference? Solo investment – we chose only one expert Partial information – we learn only about chosen expert Bounds: Last week: ln 𝐾 + 2𝑇 ln 𝐾 Ours: 𝑂 𝐾 𝐺 max ln 𝐾

Notations 𝑡 – The time step. Also known as “trial” 𝑡∈ 1,2,…,𝑇 𝑥 𝑖 𝑡 – The reward of arm 𝑖 at trial 𝑡 𝑥 𝑖 𝑡 ∈ 0,1 𝐴 – an algorithm, choosing the arm 𝑖 𝑡 at trial 𝑡 The input at time 𝑡 is the arms chosen and their rewards till now Formally - 1,2,…,𝐾 × 0,1 𝑡−1

Notations++ 𝐺 𝐴 𝑇 – The return of algorithm 𝐴 at time horizon 𝑇 𝐺 𝐴 𝑇 = 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 Will be abbreviated as 𝐺 𝐴 where the 𝑇 is obvious 𝐺 max 𝑇 – The best single-arm reward at time horizon 𝑇 𝐺 max 𝑇 = max 𝑗 𝑡=1 𝑇 𝑥 𝑗 𝑡 Will also be abbreviated as 𝐺 max

Recalling our baseline The naïve approach

Simplistic approach Explore Exploit Profit? Check each arm 𝐵 times Continue pulling only the best arm Profit? Only with “fixed distribution” (no significant changes over time) May fail with arbitrary/adversary rewards

Multiplicative Weights approach Maintain weights 𝑤 𝑖 𝑡 for arm 𝑖 at trial 𝑡 Start with uniform weights Use the weights to define a distribution 𝑝 𝑖 𝑡 Typically 𝑝 𝑖 𝑡 = 𝑤 𝑖 𝑡 / 𝑗=1 𝐾 𝑤 𝑗 𝑡 Pick an arm by sampling the distribution Update weights based on rewards of each action When rewards are known, multiply by a function 𝑢 of the reward: 𝑤 𝑖 𝑡+1 ← 𝑤 𝑖 𝑡 ⋅𝑢 𝑥 𝑖 𝑡 We’ll discuss partial information setup today

First attempt The Exp3 algorithm

Exp3 – Exploration Start with uniform weights No surprise here Always encourage the algorithm to explore more arms Add an “exploration factor” 𝛾 to the probability Each arm will always have a probability of at least 𝛾/𝐾 The exploration factor is controlled by 𝛾∈ 0,1 Can be fine-tuned later The exploration factor does not change over time Rewards may be arbitrary or even adversarial So, we must always continue exploring

Estimation rational Question story time: Answer: More generally: You randomly visit a shop on 20% of the days Only on those days you know their profit However, you need to give a profit estimation for every day What do you do? Answer: On days you visit, estimate 5⋅ profit of that day On other days, estimate 0 More generally: observation / chance to observe if seen 0 otherwise

Exp3 – Weight updates To update the weights, use “estimated rewards” instead of the actual rewards 𝑥 𝑖 𝑡 = 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 For actions that weren’t chosen, consider as if 𝒙 𝒊 𝒕 =𝟎 This will create an unbiased estimator: 𝐸 𝑥 𝑖 𝑡 𝑖 1 , 𝑖 2 ,…, 𝑖 𝑡−1 = 𝑝 𝑖 𝑡 ⋅ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 + 1− 𝑝 𝑖 𝑡 ⋅0 = 𝑥 𝑖 𝑡 This equality holds for all actions – not just the one chosen This helps with the weight updates on partial information

Exp3 Initialize: Set 𝑤 𝑖 1 ←1for all actions 𝑖 For each 𝒕: Sample 𝑖 𝑡 from 𝑝 𝑖 𝑡 ≝ 1−𝛾 ⋅ 𝑤 𝑖 𝑡 𝑗=1 𝐾 𝑤 𝑗 𝑡 + 𝛾 𝐾 Receive reward 𝑥 𝑖 𝑡 𝑡 Update the weights with 𝑤 𝑖 𝑡+1 ← 𝑤 𝑖 𝑡 ⋅ exp 𝛾 𝑥 𝑖 𝑡 𝐾 𝑥 𝑖 𝑡 =0 for all but the selected action Practically, we only update 𝑤 𝑖 𝑡

Bounds Since we have a probabilistic argument, we care about the expected “weak” regret: 𝐸 𝐺 max − 𝐺 Exp3 = 𝐺 max −𝐸 𝐺 Exp3 Theorem: 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾 No, this is not what we wanted ( 𝐾 𝐺 max ln 𝐾 ) We’ll fix that later

Bound intuition 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾 𝛾 is the exploration factor As 𝛾→1, we stop exploiting our weights Instead we prefer “randomly” looking around Thus we come closer and closer to 𝐺 max As 𝛾→0, we exploit our weights too much We check for less changes in other arms Becomes worse when we have more arms (𝐾) We must find the “sweet spot”

The sweet spot 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾 For some 𝑔≥ 𝐺 max , consider 𝛾= min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 𝛾∈ 0,1 This yields …≤2 𝐾 ln 𝐾 𝑒−1 𝑔 =2 𝑒−1 𝑔𝐾 ln 𝐾 =2.63 𝑔𝐾 ln 𝐾 Which is similar to the bound we wanted (𝑂 𝐾 𝐺 max ln 𝐾 )

Proof 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝐺 max ⋅𝛾+ 𝐾 ln 𝐾 𝛾 Enough talking, let’s start the proof! We assume 0<𝛾<1 The bound trivially holds for 𝛾=1 (as 𝑒−1≥1) Furthermore, we’ll want to rely on some facts As outlined in the next slides

Fact #1 Irrelevant to our cause We just want numbering (starting at 2) to be consistent with the paper 

Fact #2 𝒙 𝒊 𝒕 ≤𝑲/𝜸 Proof: 𝑥 𝑖 𝑡 ≝ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 𝒙 𝒊 𝒕 ≤𝑲/𝜸 Proof: 𝑥 𝑖 𝑡 ≝ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 Since 𝑥 𝑖 𝑡 ≤1, we get: ≤1/ 𝑝 𝑖 𝑡 Since 𝑝 𝑖 𝑡 =…+𝛾/𝐾, we get: ≤𝐾/𝛾

Fact #3 𝒊=𝟏 𝑲 𝒑 𝒊 𝒕 𝒙 𝒊 𝒕 = 𝒙 𝒊 𝒕 𝒕 Since 𝑥 𝑖 𝑡 ≝ 𝑥 𝑖 𝑡 / 𝑝 𝑖 𝑡 , the sum becomes: 𝑖=1 𝐾 𝑥 𝑖 𝑡 We defined 𝑥 𝑖 𝑡 =0 for all unobserved actions And so we remain only with 𝑖 𝑡

Fact #4 𝒊=𝟏 𝑲 𝒑 𝒊 𝒕 𝒙 𝒊 𝒕 𝟐 ≤ 𝒊=𝟏 𝑲 𝒙 𝒊 𝒕 𝒊=𝟏 𝑲 𝒑 𝒊 𝒕 𝒙 𝒊 𝒕 𝟐 ≤ 𝒊=𝟏 𝑲 𝒙 𝒊 𝒕 From the previous slide (fact #3) we get = 𝑥 𝑖 𝑡 𝑡 ⋅ 𝑥 𝑖 𝑡 𝑡 Since 𝑥 𝑖 𝑡 ≤1 we get ≤ 𝑥 𝑖 𝑡 𝑡 As before, since un-observed actions cancel out we can do: = 𝑖=1 𝐾 𝑥 𝑖 𝑡

Proof - outline Denote 𝑊 𝑡 ≝ 𝑖=1 𝐾 𝑤 𝑖 𝑡 Establish a relation between 𝑊 𝑡+1 / 𝑊 𝑡 and 𝑥 𝑖 𝑡 𝑡 Establish a relation between ln 𝑊 𝑡+1 / 𝑊 𝑡 and 𝑥 𝑖 𝑡 𝑡 Obtain with 1+𝑥≤ 𝑒 𝑥 Establish a relation between ln 𝑊 𝑇+1 / 𝑊 1 and 𝐺 Exp3 Obtain by summing over 𝒕 Compute a direct bound over ln 𝑊 𝑇+1 / 𝑊 1 Apply bound to 𝐺 Exp3

Recall the weight update definition Proof We begin with the definition of the weight sums: 𝑊 𝑡+1 𝑊 𝑡 = 𝑖=1 𝐾 𝑤 𝑖 𝑡+1 𝑊 𝑡 Recall the weight update definition 𝑤 𝑗 𝑡+1 ← 𝑤 𝑗 𝑡 ⋅ exp 𝛾 𝑥 𝑗 𝑡 𝐾 = 𝑖=1 𝐾 𝑤 𝑖 𝑡 𝑊 𝑡 exp 𝛾 𝐾 𝑥 𝑖 𝑡

Proof 𝑊 𝑡+1 𝑊 𝑡 =…= 𝑖=1 𝐾 𝑤 𝑖 𝑡 𝑊 𝑡 exp 𝛾 𝐾 𝑥 𝑖 𝑡 Recall the probability definition: 𝑝 𝑖 𝑡 ≝ 1−𝛾 ⋅ 𝑤 𝑖 𝑡 𝑊 𝑡 + 𝛾 𝐾 Some algebra and we get: 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 = 𝑤 𝑖 𝑡 𝑊 𝑡 = 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 exp 𝛾 𝐾 𝑥 𝑖 𝑡

Proof 𝑊 𝑡+1 𝑊 𝑡 =…= 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 exp 𝛾 𝐾 𝑥 𝑖 𝑡 Note the following inequality for 𝒙≤𝟏 𝑒 𝑥 ≤1+𝑥+ 𝑒−2 𝑥 2 (No, it’s not a famous one. Yes it works, I tested it) ≤ 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 1+ 𝛾 𝐾 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2

Proof 𝑊 𝑡+1 𝑊 𝑡 =…≤ 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 1+ 𝛾 𝐾 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2 Opening the square brackets, we get (just ugly math) 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 ⋅1 = 𝑖=1 𝐾 1−𝛾 ⋅ 𝑤 𝑖 𝑡 𝑊 𝑡 1−𝛾 = 𝑖=1 𝐾 𝑤 𝑖 𝑡 𝑊 𝑡 =1 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 ⋅ 𝛾 𝐾 𝑥 𝑖 𝑡 = 𝛾 𝐾 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 − 𝛾 𝐾 𝑥 𝑖 𝑡 ≤ 𝛾 𝐾 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 ⋅ 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2 ≤ 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 2

Proof 𝑊 𝑡+1 𝑊 𝑡 =…≤ 𝑖=1 𝐾 𝑝 𝑖 𝑡 − 𝛾 𝐾 1−𝛾 1+ 𝛾 𝐾 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 𝑥 𝑖 𝑡 2 Combining it all, we have ≤1+ 𝛾 𝐾 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 2 Using fact #3 ( 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 = 𝑥 𝑖 𝑡 𝑡 ) and Using fact #4 ( 𝑖=1 𝐾 𝑝 𝑖 𝑡 𝑥 𝑖 𝑡 2 ≤ 𝑖=1 𝐾 𝑥 𝑖 𝑡 ) ≤1+ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡

Proof 𝑊 𝑡+1 𝑊 𝑡 =…≤1+ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡 𝑊 𝑡+1 𝑊 𝑡 =…≤1+ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡 Taking the log and using 1+𝑥≤ 𝑒 𝑥 (specifically ln 1+𝑥 ≤𝑥): ln 𝑊 𝑡+1 𝑊 𝑡 ≤ 𝛾 𝐾 1−𝛾 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑖=1 𝐾 𝑥 𝑖 𝑡 Summing over 𝒕 from 𝟏 to 𝑻 we get: ln 𝑊 𝑇+1 𝑊 1 ≤ 𝛾 𝐾 1−𝛾 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡

Proof ln 𝑊 𝑇+1 𝑊 1 ≤ 𝛾 𝐾 1−𝛾 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ln 𝑊 𝑇+1 𝑊 1 ≤ 𝛾 𝐾 1−𝛾 𝐺 Exp3 + 𝑒−2 𝛾 𝐾 2 1−𝛾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ∗ 1−𝛾 𝛾 𝐾 ln 𝑊 𝑇+1 𝑊 1 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 We’ll get back to this equation in a second

Proof Choosing some action 𝑗 and since 𝑤 𝑗 𝑇+1 ≤ 𝑊 𝑇+1 , note that: ln 𝑊 𝑇+1 𝑊 1 ≥ ln 𝑤 𝑗 𝑇+1 𝑊 1 = ln 𝑤 𝑗 𝑇+1 − ln 𝑊 1 Recall that 𝑤 𝑗 𝑡+1 ← 𝑤 𝑗 𝑡 ⋅ exp 𝛾 𝑥 𝑗 𝑡 𝐾 and so we can expand the log recursively: ln 𝑤 𝑗 𝑇+1 = 𝛾 𝐾 ⋅ 𝑥 𝑗 𝑡 + ln 𝑤 𝑗 𝑇 =…= 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 + ln 𝑤 𝑗 1 Recall that 𝑤 𝑗 1 =1 and 𝑊 1 =𝐾 (uniform initialization) and so: $ ln 𝑊 𝑇+1 𝑊 1 ≥ 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − ln 𝐾

Proof ∗ 1−𝛾 𝛾 𝐾 ln 𝑊 𝑇+1 𝑊 1 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 $ ln 𝑊 𝑇+1 𝑊 1 ≥ 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − ln 𝐾 Combining ∗ and $ we get: 1−𝛾 𝛾 𝐾 ⋅ 𝛾 𝐾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 1−𝛾 𝛾 𝐾 ⋅ ln 𝐾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 𝛾 ln 𝐾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3

Proof 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 𝛾 ln 𝐾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 Recall we saw that 𝐸 𝑥 𝑖 𝑡 𝑖 1 , 𝑖 2 ,…, 𝑖 𝑡−1 = 𝑥 𝑖 𝑡 . Taking expectation on both sides we get: 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐾 𝑖=1 𝐾 𝑡=1 𝑇 𝑥 𝑖 𝑡 ≤𝐸 𝐺 Exp3 Since 𝑡=1 𝑇 𝑥 𝑐 𝑖 ≤ 𝐺 max for any 𝑐, and since we have a minus sign, we can do: 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐾 𝐾 𝐺 max ≤𝐸 𝐺 Exp3

Proof 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐺 max ≤𝐸 𝐺 Exp3 Note this was true for any action 𝑗, which means including the maximal action! 1−𝛾 𝐺 max − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐺 max ≤𝐸 𝐺 Exp3 𝐺 max −𝐸 𝐺 Exp3 ≤ 𝑒−1 𝛾 𝐺 max + 𝐾 ln 𝐾 𝛾 Q.E.D.(!)

Break(?)

Is it enough? Maybe? What was our problem? If it was a clear “yes”, I wouldn’t have asked it :P What was our problem?

Room for improvement We assumed an upper limit 𝒈≥ 𝑮 max is known We then selected 𝛾= min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 This obtained the bound 𝐺 max −𝐸 𝐺 Exp3 ≤2.63 𝑔𝐾 ln 𝐾 If we don’t know 𝑮 max , we may over-shoot 𝒈 For example, setting 𝑔≔𝑇 This would yield a less tight result :/

New goal Maintain the same bound over the weak regret 𝑂 𝐾 𝐺 max ln 𝐾 Don’t assume a known 𝒈≥ 𝑮 max In fact, do better! Maintain this bound uniformly throughout the execution* *A formal definition will follow

Notations# (better than ++++) 𝐺 𝑖 𝑡+1 – The return of action 𝑖 till (exclusive) time horizon 𝑡 𝐺 𝑖 𝑡+1 = 𝑠=1 𝑡 𝑥 𝑖 𝑠 𝐺 𝑖 𝑡+1 – The estimated return of action 𝑖 till time horizon 𝑡 𝐺 𝑖 𝑡+1 = 𝑠=1 𝑡 𝑥 𝑖 𝑠 𝐺 max 𝑡+1 – The maximal return of any single action till time horizon 𝑡 𝐺 max 𝑡+1 = max 𝑖 𝐺 𝑖 𝑡+1

Re-stating our goal At every time-step 1≤𝑡≤𝑇, the weak regret till that point should maintain the bound 𝑂 𝐾 𝐺 max 𝑡+1 ln 𝐾 This bound will hold all the way! This also gives a hint on how to do this: Previously we guessed 𝑔≥ 𝐺 max Instead, lets maintain 𝑔≥ 𝐺 max 𝑡+1 Update 𝑔 and 𝛾 as 𝐺 max 𝑡 grows Effectively – we are searching for the right 𝜸!

Exp3.1 (creative names!) Initialize: Set 𝑡←1 For each “epoch” 𝒓=𝟎,𝟏,𝟐,… do: 𝑔 𝑟 ≝ 4 𝑟 ⋅ 𝐾 ln 𝐾 𝑒−1 (Re-)Initialize Exp3 with: 𝛾 𝑟 ≝ min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 𝑟 = min 1,1/2 𝑟 While 𝐦𝐚𝐱 𝒊 𝑮 𝒊 𝒕 ≤ 𝒈 𝒓 −𝑲/ 𝜸 𝒓 do: Do one step with Exp3 (Update our tracking of 𝐺 𝑖 𝑡+1 for all actions) 𝑡←t+1

Bounds (Theorem 4.1) Theorem 4.1: 𝐺 max −𝐸 𝐺 Exp3.1 ≤8 𝑒−1 𝐺 max 𝐾 ln 𝐾 +8 𝑒−1 𝐾+2𝐾 ln 𝐾 Proof outline: Lemma 4.2: Bound the weak regret per “epoch” By giving a lower bound on the actual reward Lemma 4.3: Bound the number of epochs for a finite horizon 𝑇 Combine both to obtain the final bound

Epoch notations Let 𝑇 be the overall number of steps Needed only for the proof We don’t need to know this in advance Let 𝑅 be the overall number of epochs Same here Let 𝑆 𝑟 and 𝑇 𝑟 be the first and last steps (𝑡) of epoch 𝑟 𝑆 1 =1, 𝑇 𝑅 =𝑇 An epoch may be empty – in that case 𝑆 𝑟 = 𝑇 𝑟 +1> 𝑇 𝑟 𝐺 max ≝ 𝐺 max 𝑇+1

Lemma 4.2 For any action 𝑗 and for every epoch 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 ≥ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾 Holds trivially for empty epochs We’ll prove for 𝑇 𝑟 ≥ 𝑆 𝑟

Proof Some many slides back, we proved the following: 1−𝛾 𝑡=1 𝑇 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 − 𝑒−2 𝛾 𝐾 𝑡=1 𝑇 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝐺 Exp3 Translating to our current notations, this becomes 1− 𝛾 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑖=1 𝐾 𝑥 𝑖 𝑡 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑖=1 𝐾 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡

Proof 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑖=1 𝐾 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 Merging the estimated reward also with all previous epochs: 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝐺 𝑗 𝑇 𝑟 +1 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝑖=1 𝐾 𝐺 𝑖 𝑇 𝑟 +1 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 From the termination condition, we know that for any action 𝒊, at time 𝒕, in epoch 𝒓: 𝐺 𝑖 𝑡 ≤ 𝑔 𝑟 −𝐾/ 𝛾 𝑟 From fact #2, we know that 𝑥 𝑖 𝑡 ≤𝐾/ 𝛾 𝑟 Together we get that for any action 𝐺 𝑖 𝑡+1 ≤ 𝑔 𝑟 𝐺 𝑖 𝑇 𝑟 +1 ≤ 𝑔 𝑟 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑔 𝑟 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝐾 𝑔 𝑟 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡

Proof 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝛾 𝑟 𝑔 𝑟 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−2 𝛾 𝑟 𝐾 𝐾 𝑔 𝑟 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 − 𝐾 ln 𝐾 𝛾 𝑟 − 𝑒−1 𝑔 𝑟 𝛾 𝑟 ≤ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 Recall that 𝛾 𝑟 = min 1, 𝐾 ln 𝐾 𝑒−1 𝑔 𝑟 Substitute and Q.E.D

Lemma 4.3 Denote 𝑐≝ 𝐾 ln 𝐾 𝑒−1 , 𝑧≝ 2 𝑅−1 So that 𝑔 𝑟 ≝𝑐⋅ 4 𝑟 Then the number of epochs 𝑅 satisfies the inequality 2 𝑅−1 ≝𝑧≤ 𝐾 𝑐 + 𝐺 max 𝑐 + 1 2 Holds trivially for 𝑅=0 We’ll prove for 𝑅≥1

Proof Regardless of the lemma, we know that: 𝐺 max ≝ 𝐺 max 𝑇+1 ≝ 𝐺 max 𝑇 𝑅 +1 ≥ 𝐺 max 𝑇 𝑅−1 +1 𝑇 𝑅−1 is the termination step, so we know the epoch continuation condition was violated > 𝑔 𝑅−1 − 𝐾 𝛾 𝑅−1 =𝑐⋅ 4 𝑅−1 − 𝐾 min 1,2 − 𝑅−1 ≥𝑐⋅ 4 𝑅−1 −𝐾⋅ 2 𝑅−1 =𝑐 𝑧 2 −𝐾𝑧

Proof So, we’ve shown that 𝐺 max >𝑐 𝑧 2 −𝐾𝑧 𝑧= 2 𝑅 is a variable, as we are trying to bound 𝑅 This function is a parabola of 𝑧 The minimum is obtained at 𝑧=𝑘/2𝑐 (Reminder: Extremum of 𝑎 𝑥 2 +𝑏𝑥+𝑐 is at −𝑏/2𝑎) It increases monotonically for 𝑧>𝑘/2𝑐

Proof Now, suppose the claim is false Reversing the original inequality, we get 𝑧> 𝐾 𝑐 + 𝐺 max 𝑐 + 1 2 > 𝐾 𝑐 + 𝐺 max 𝑐 𝑧 and the expression in purple are both larger than 𝑘/2𝑐 This is in the increasing part of 𝑐 𝑧 2 −𝐾𝑧: 𝑐 𝑧 2 −𝐾𝑧>𝑐 𝐾 𝑐 + 𝐺 max 𝑐 2 −𝐾 𝐾 𝑐 + 𝐺 max 𝑐 > 𝐾 2 𝑐 +2𝐾 𝐺 max 𝑐 + 𝐺 max − 𝐾 2 𝑐 −𝐾 𝐺 max 𝑐 = 𝐺 max +𝐾 𝐺 max 𝑐 > 𝐺 max

Proof Assuming the claim is false, we got 𝑐 𝑧 2 −𝐾𝑧> 𝐺 max But, before we proved that 𝐺 max >𝑐 𝑧 2 −𝐾𝑧 Q.E.D.

Theorem 4.1 Our target was: 𝐺 max −𝐸 𝐺 Exp3.1 ≤8 𝑒−1 𝐺 max 𝐾 ln 𝐾 +8 𝑒−1 𝐾+2𝐾 ln 𝐾 Let’s begin

Proof 𝐺 Exp3.1 ≝ 𝑡=1 𝑇 𝑥 𝑖 𝑡 𝑡 ≝ 𝑟=0 𝑅 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 Lemma 4.2 says that for any action 𝑗 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑖 𝑡 𝑡 ≥ 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾 Then this should also be for the maximal action ≥ max 𝑗 𝑟=0 𝑅 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾

Proof ≥ max 𝑗 𝑟=0 𝑅 𝑡= 𝑆 𝑟 𝑇 𝑟 𝑥 𝑗 𝑡 −2 𝑒−1 𝑔 𝑟 𝐾 ln 𝐾 = max 𝑗 𝐺 𝑗 𝑇+1 − 𝑟=0 𝑅 2 𝑒−1 4 𝑟 ⋅ 𝐾 ln 𝐾 𝑒−1 𝐾 ln 𝐾 = max 𝑗 𝐺 𝑗 𝑇+1 −2𝐾 ln 𝐾 ⋅ 𝑟=0 𝑅 2 𝑟 = 𝐺 max −2𝐾 ln 𝐾 2 𝑅+1 −1

Proof = 𝐺 max −2𝐾 ln 𝐾 2 𝑅+1 −1 Using lemma 4.3 we obtain ≥ 𝐺 max −2𝐾 ln 𝐾 4⋅ 𝑒−1 ln 𝐾 + 𝐺 max ⋅ 𝑒−1 𝐾 ln 𝐾 + 1 2 −1 = 𝐺 max −8𝐾 𝑒−1 −8 𝐺 max ⋅ 𝑒−1 ⋅𝐾 ln 𝐾 −4𝐾 ln 𝐾 +2𝐾 ln 𝐾 = 𝐺 max −2𝐾 ln 𝐾 −8𝐾 𝑒−1 −8 𝑒−1 𝐺 max 𝐾 ln 𝐾

Finishing the proof 𝐺 Exp3.1 ≥ 𝐺 max −2𝐾 ln 𝐾 −8𝐾 𝑒−1 −8 𝑒−1 𝐺 max 𝐾 ln 𝐾 𝐺 max − 𝐺 Exp3.1 ≤2𝐾 ln 𝐾 +8𝐾 𝑒−1 +8 𝑒−1 𝐺 max 𝐾 ln 𝐾 Using Jensen’s inequality (where the variable is 𝐺 max ), we complete the proof and take the expectation Full details in the paper We don’t have time at this point anyhow :P

Questions?