Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits

Slides:



Advertisements
Similar presentations
Blind online optimization Gradient descent without a gradient Abie Flaxman CMU Adam Tauman Kalai TTI Brendan McMahan CMU.
Advertisements

6.853: Topics in Algorithmic Game Theory Fall 2011 Constantinos Daskalakis Lecture 16.
Extracting Randomness From Few Independent Sources Boaz Barak, IAS Russell Impagliazzo, UCSD Avi Wigderson, IAS.
Shortest Vector In A Lattice is NP-Hard to approximate
Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
Follow the regularized leader
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
T HE POWER OF C ONVEX R ELAXATION : N EAR - OPTIMAL MATRIX COMPLETION E MMANUEL J. C ANDES AND T ERENCE T AO M ARCH, 2009 Presenter: Shujie Hou February,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The FIR Adaptive Filter The LMS Adaptive Filter Stability and Convergence.
Machine Learning Week 2 Lecture 1.
A New Understanding of Prediction Markets via No-Regret Learning.
Visual Recognition Tutorial
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
1 L-BFGS and Delayed Dynamical Systems Approach for Unconstrained Optimization Xiaohui XIE Supervisor: Dr. Hon Wah TAM.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Parametric Inference.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.
Visual Recognition Tutorial
1 More Applications of the Pumping Lemma. 2 The Pumping Lemma: Given a infinite regular language there exists an integer for any string with length we.
Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Orgad Keller Modified by Ariel Rosenfeld Less Than Matching.
Why Function Optimization ?
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Error estimates for degenerate parabolic equation Yabin Fan CASA Seminar,
Name: Mehrab Khazraei(145061) Title: Penalty or Exterior penalty function method professor Name: Sahand Daneshvar.
Computational Optimization
Auction Seminar Optimal Mechanism Presentation by: Alon Resler Supervised by: Amos Fiat.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Genome Rearrangements Unoriented Blocks. Quick Review Looking at evolutionary change through reversals Find the shortest possible series of reversals.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Non-Bayes classifiers. Linear discriminants, neural networks.
Issues on the border of economics and computation נושאים בגבול כלכלה וחישוב Speaker: Dr. Michael Schapira Topic: Dynamics in Games (Part III) (Some slides.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.
Monte-Carlo method for Two-Stage SLP Lecture 5 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO Working Group on Continuous.
Stein Unbiased Risk Estimator Michael Elad. The Objective We have a denoising algorithm of some sort, and we want to set its parameters so as to extract.
6.853: Topics in Algorithmic Game Theory Fall 2011 Constantinos Daskalakis Lecture 22.
METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Estimating standard error using bootstrap
Computation of the solutions of nonlinear polynomial systems
Visual Recognition Tutorial
12. Principles of Parameter Estimation
New Characterizations in Turnstile Streams with Applications
Computational Optimization
Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Basic Algorithms Christina Gallner
Vapnik–Chervonenkis Dimension
James B. Orlin Presented by Tal Kaminker
Additive Combinatorics and its Applications in Theoretical CS
Chap 3. The simplex method
Depth Estimation via Sampling
CSCI B609: “Foundations of Data Science”
Ying shen Sse, tongji university Sep. 2016
The
The Curve Merger (Dvir & Widgerson, 2008)
The Nonstochastic Multiarmed Bandit Problem
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Learning From Observed Data
12. Principles of Parameter Estimation
Data Exploration and Pattern Recognition © R. El-Yaniv
Outline Preface Fundamentals of Optimization
Presentation transcript:

Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits

Online Convex Optimization Problem Convex set 𝑆 In every iteration we choose 𝑥 𝑡 ∈𝑆 and then get a convex cost function 𝑐 𝑡 :𝑆→ −𝐶,𝐶 for some 𝐶>0 We want to minimize the regret 𝑡=1 𝑇 𝑐 𝑡 𝑥 𝑡 − min 𝑥∈𝑆 𝑡=1 𝑇 𝑐 𝑡 (𝑥)

Bandit Setting The Gradient Descent approach: 𝑥 𝑡+1 = 𝑥 𝑡 −𝜂𝛻 𝑐 𝑡 ( 𝑥 𝑡 ) Last week we saw 𝑂( 𝑇 ) regret bound But now instead of 𝑐 𝑡 we get 𝑐 𝑡 ( 𝑥 𝑡 ) Can’t compute 𝛻 𝑐 𝑡 ( 𝑥 𝑡 )! But we still want to use Gradient Descent Solution: estimate gradient using one point We will show 𝑂( 𝑇 3/4 ) regret bound

Notation and Assumptions 𝔹= 𝑥∈ ℝ 𝑑 : 𝑥 ≤1 ; 𝕊= 𝑥∈ ℝ 𝑑 : 𝑥 =1 Expected regret 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − min 𝑥∈𝑆 𝑡=1 𝑇 𝑐 𝑡 (𝑥) Projection of point 𝑥 onto convex set 𝑆 𝑃 𝑆 𝑥 = arg min 𝑧∈𝑆 𝑥−𝑧 Assume 𝑆 is a convex set such that 𝑟𝔹⊆𝑆⊆𝑅𝔹 1−𝛼 𝑆= 1−𝛼 𝑥:𝑥∈𝑆 ⊆𝑆 is also convex and 0∈(1−𝛼)𝑆⊆𝑅𝔹 𝑦∈ 1−𝛼 𝑆 ↓ 𝑦= 1−𝛼 𝑥=𝛼0+ 1−𝛼 𝑥∈𝑆

Part 1 Gradient Estimation

Gradient Estimation For a function 𝑐 𝑡 and 𝛿>0 define 𝑐 𝑡 𝑦 = 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) Lemma: 𝛻 𝑐 𝑡 𝑦 = 𝑑 𝛿 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 To get and unbiased estimator for ∇ 𝑐 𝑡 𝑦 we can sample a unit vector 𝑢 uniformly and compute 𝑑 𝛿 𝑐 𝑡 𝑦+𝛿𝑢 𝑢

Proof 𝑐 𝑡 𝑦 = 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) For 𝑑=1 𝑐 𝑡 𝑦 = 𝔼 𝑣∈[−1,1] 𝑐 𝑡 (𝑦+𝛿𝑣) = 𝔼 𝑣∈[−𝛿,𝛿] 𝑐 𝑡 (𝑦+𝑣) = 1 2𝛿 −𝛿 𝛿 𝑐 𝑡 𝑦+𝑣 𝑑𝑣 Differentiate using the fundamental theorem of calculus 𝛻 𝑐 𝑡 𝑦 = 𝑐 𝑡 ′ 𝑦 = 𝑐 𝑡 𝑦+𝛿 − 𝑐 𝑡 𝑦−𝛿 2𝛿 = 1 𝛿 𝔼 𝑢∈{−1,1} 𝑐 𝑡 𝑦+𝛿𝑢 𝑢

Proof Cont. For 𝑑>1 Stoke’s theorem gives 𝛻 𝛿𝔹 𝑐 𝑡 𝑦+𝑣 𝑑𝑣 = 𝛿𝕊 𝑐 𝑡 𝑦+𝑢 𝑢 𝑢 𝑑𝑢 Vol 𝑑 𝛿𝔹 𝛻 𝛿𝔹 𝑐 𝑡 𝑦+𝑣 𝑑𝑣 Vol 𝑑 𝛿𝔹 = Vol 𝑑−1 𝛿𝕊 𝛿𝕊 𝑐 𝑡 𝑦+𝑢 𝑢 𝑢 𝑑𝑢 Vol 𝑑−1 𝛿𝕊 Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝛿𝔹 𝑐 𝑡 (𝑦+𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈δ𝕊 𝑐 𝑡 𝑦+𝑢 𝑢 𝑢 Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢

Proof Cont. Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 𝑐 𝑡 𝑦 = 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) Vol 𝑑 𝛿𝔹 𝛻 𝔼 𝑣∈𝔹 𝑐 𝑡 (𝑦+𝛿𝑣) = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 Vol 𝑑 𝛿𝔹 𝛻 𝑐 𝑡 𝑦 = Vol 𝑑−1 𝛿𝕊 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 𝛻 𝑐 𝑡 𝑦 = Vol 𝑑−1 𝛿𝕊 Vol 𝑑 𝛿𝔹 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦+𝛿𝑢 𝑢 The following fact concludes the proof Vol 𝑑−1 𝛿𝕊 Vol 𝑑 𝛿𝔹 = 𝑑 𝛿 For example in ℝ 2 : 2𝜋𝛿 𝜋 𝛿 2 = 2 𝛿

Part 2 Regret Bound for Estimated Gradients

Zinkevich’s Theorem Let ℎ 1 ,…, ℎ 𝑇 :(1−𝛼)𝑆→ℝ be convex, differentiable functions Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1−𝛼)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 ( 𝑦 𝑡 −𝜂𝛻 ℎ 𝑡 𝑦 𝑡 ) Let 𝐺= max 𝑡 𝛻 ℎ 𝑡 𝑦 𝑡 Then for 𝜂= 𝑅 𝐺 𝑇 and for every y∈(1−𝛼)𝑆 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇

Expected Zinkevich’s Theorem Let 𝑐 1 ,…, 𝑐 𝑇 :(1−𝛼)𝑆→ℝ be convex, differentiable functions Let 𝑔 1 ,…, 𝑔 𝑇 be random vectors such that 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝑔 𝑡 ≤𝐺 (also implies ∇ 𝑐 𝑡 𝑦 𝑡 ≤𝐺) Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1−𝛼)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Then for 𝜂= 𝑅 𝐺 𝑇 and for every y∈(1−𝛼)𝑆 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝑅𝐺 𝑇

Proof 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 1 =0 Define ℎ 𝑡 :(1−𝛼)𝑆→ℝ by ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 where 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 Notice that 𝛻 ℎ 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 + 𝜉 𝑡 = 𝑔 𝑡 So our updates are the same as running regular gradient descent on ℎ 𝑡 From Zinkevich’s Theorem 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 (1)

ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 Proof Cont. Notice that 𝔼 𝜉 𝑡 | 𝑦 𝑡 =𝔼 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 | 𝑦 𝑡 =𝔼 𝑔 𝑡 | 𝑦 𝑡 −∇ 𝑐 𝑡 𝑦 𝑡 =0 Therefore 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 | 𝑦 𝑡 =𝔼 𝑦 𝑡 𝑇 𝔼 𝜉 𝑡 | 𝑦 𝑡 =0 𝔼 𝑦 𝑇 𝜉 𝑡 = 𝑦 𝑇 𝔼 𝜉 𝑡 = 𝑦 𝑇 𝔼 𝔼 𝜉 𝑡 | 𝑦 𝑡 =0 We get the following connections 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) +𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) (3) 𝔼 ℎ 𝑡 (𝑦) =𝔼 𝑐 𝑡 (𝑦) +𝔼 𝑦 𝑇 𝜉 𝑡 = 𝑐 𝑡 𝑦 (2)

Proof Cont. 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 2 𝔼 ℎ 𝑡 (𝑦) = 𝑐 𝑡 (𝑦) 3 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) Proof Cont. 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 = (3) 𝑡=1 𝑇 𝔼 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 𝑦 = (2) 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 𝔼 ℎ 𝑡 𝑦 = 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 𝑦 ≤ (1) 𝑅𝐺 𝑇

Part 3 BGD Algorithm

Ideal World Algorithm 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑢 𝑡 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 Ideal World Algorithm 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: Select unit vector 𝑢 𝑡 uniformly at random Play 𝑦 𝑡 and observe cost 𝑐 𝑡 𝑦 𝑡 Compute 𝑔 𝑡 using 𝑢 𝑡 𝑦 𝑡+1 ← 𝑃 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 To compute 𝑔 𝑡 we need 𝑐 𝑡 𝑦 𝑡 +𝛿 𝑢 𝑡 So we need to play 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 instead But now we have problems: 𝑥 𝑡 ∈𝑆?? The regret is for c t ( 𝑥 𝑡 ) although we are doing Estimated Gradient Descent for 𝑐 𝑡 𝑦 𝑡

Bandit Gradient Descent Algorithm (BGD) Parameters: 𝜂>0,𝛿>0,0<𝛼<1 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: Select unit vector 𝑢 𝑡 uniformly at random 𝑥 𝑡 ← 𝑦 𝑡 +𝛿 𝑢 𝑡 Play 𝑥 𝑡 and observe cost 𝑐 𝑡 𝑥 𝑡 = 𝑐 𝑡 ( 𝑦 𝑡 +𝛿 𝑢 𝑡 ) 𝑔 𝑡 ← 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑢 𝑡 𝑦 𝑡+1 ← 𝑃 1−𝛼 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 1 ←0 For 𝑡∈{1,…,𝑇}: Select random unit vector 𝑢 𝑡 Play 𝑦 𝑡 and observe 𝑐 𝑡 𝑦 𝑡 Compute 𝑔 𝑡 using 𝑢 𝑡 𝑦 𝑡+1 ← 𝑃 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑥 𝑡 ∈𝑆? We have low regret for 𝑐 𝑡 ( 𝑦 𝑡 ) in 1−𝛼 𝑆 Need to convert it to low regret for 𝑐 𝑡 𝑥 𝑡 in 𝑆

Observation 1 For any 𝑥∈𝑆 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 ≤ 𝑡=1 𝑇 𝑐 𝑡 𝑥 +2𝛼𝐶𝑇 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 ≤ 𝑡=1 𝑇 𝑐 𝑡 𝑥 +2𝛼𝐶𝑇 Proof. From convexity 𝑐 𝑡 1−𝛼 𝑥 = 𝑐 𝑡 𝛼0+ 1−𝛼 𝑥 ≤ 𝛼𝑐 𝑡 0 + 1−𝛼 𝑐 𝑡 𝑥 = 𝑐 𝑡 𝑥 +𝛼 𝑐 𝑡 0 − 𝑐 𝑡 𝑥 ≤ 𝑐 𝑡 𝑥 +2𝛼𝐶

Observation 2 For any y∈ 1−𝛼 𝑆 and any x∈𝑆 𝑐 𝑡 𝑥 − 𝑐 𝑡 𝑦 ≤ 2𝐶 𝛼𝑟 | 𝑦−𝑥 | Proof. Denote Δ=𝑥−𝑦 If Δ ≥𝛼𝑟 the observation follows from 2𝐶≤ 2𝐶 𝛼𝑟 | 𝑦−𝑥 | Otherwise Δ <𝛼𝑟, let 𝑧=𝑦+𝛼𝑟 Δ | Δ | and 𝑧∈𝑆 from y∈ 1−𝛼 𝑆 𝛼𝑟 Δ Δ ∈𝛼𝑟𝔹⊆𝛼𝑆 ↓ 𝑧∈ 1−𝛼 𝑆+𝛼𝑆⊆𝑆

Proof Cont. Notice that x= Δ 𝛼𝑟 𝑧+ 1− Δ 𝛼𝑟 𝑦 So from convexity 𝑐 𝑡 𝑥 = 𝑐 𝑡 Δ 𝛼𝑟 𝑧+ 1− Δ 𝛼𝑟 𝑦 ≤ Δ 𝛼𝑟 𝑐 𝑡 𝑧 + 1− Δ 𝛼𝑟 𝑐 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑐 𝑡 𝑧 − 𝑐 𝑡 (𝑦) 𝛼𝑟 Δ ≤ 𝑐 𝑡 𝑦 + 2𝐶 𝛼𝑟 Δ Other direction is also true 𝑧=𝑦+𝛼𝑟 Δ Δ Δ=𝑥−𝑦

BGD Regret Theorem For any 𝑇≥ 3𝑅𝑑 2𝑟 2 and for the following parameters 𝜂= 𝛿𝑅 𝑑𝐶 𝑇 𝛿= 3 𝑟 𝑅 2 𝑑 2 12𝑇 𝛼= 3 3𝑅𝑑 2𝑟 𝑇 For every 𝑥∈𝑆 the BGD achieves regret 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤3𝐶 𝑇 5/6 3 𝑑𝑅 𝑟

Proof 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑦 𝑡+1 = 𝑃 1−𝛼 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Proof First we need to show that 𝑥 𝑡 ∈𝑆 Notice that 1−𝛼 𝑆+𝛼𝑟𝔹⊆ 1−𝛼 𝑆+𝛼𝑆⊆𝑆 Since 𝑦 𝑡 ∈ 1−𝛼 𝑆, we just need to show that 𝛿≤𝛼𝑟 𝛿= 3 𝑟 𝑅 2 𝑑 2 12𝑇 𝛼𝑟= 3 3𝑅𝑑 2𝑟 𝑇 𝑟= 3 3𝑅 𝑟 2 𝑑 2 𝑇 This is true because 𝑇≥ 3𝑅𝑑 2𝑟 2

Proof Cont. 𝑐 𝑡 𝑦 𝑡 = 𝔼 𝑣∈𝔹 𝑐 𝑡 ( 𝑦 𝑡 +𝛿𝑣) 𝑐 𝑡 𝑦 𝑡 = 𝔼 𝑣∈𝔹 𝑐 𝑡 ( 𝑦 𝑡 +𝛿𝑣) 𝛻 𝑐 𝑡 𝑦 𝑡 = 𝑑 𝛿 𝔼 𝑢∈𝕊 𝑐 𝑡 𝑦 𝑡 +𝛿𝑢 𝑢 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 𝑥 𝑡 = 𝑦 𝑡 +𝛿 𝑢 𝑡 𝑦 𝑡+1 = 𝑃 1−𝛼 𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Proof Cont. Now we want to bound the regret We have 𝔼 𝑔 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝑔 𝑡 = 𝑑 𝛿 𝑐 𝑡 𝑥 𝑡 𝑢 𝑡 ≤ 𝑑𝐶 𝛿 =:𝐺 Expected Zinkevich’s Theorem says 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝑅𝐺 𝑇 = 𝑅𝑑𝐶 𝑇 𝛿 (1) Where y∈ 1−𝛼 𝑆 and 𝜂= 𝑅 𝐺 𝑇 = 𝛿𝑅 𝑑𝐶 𝑇

Proof Cont. From observation 2 we get 𝑐 𝑡 𝑦 − 𝑐 𝑡 𝑥 ≤ 2𝐶 𝛼𝑟 | 𝑦−𝑥 | For 𝑦∈ 1−𝛼 𝑆 𝑐 𝑡 𝑦 𝑡 = 𝔼 𝑣∈𝔹 𝑐 𝑡 ( 𝑦 𝑡 +𝛿𝑣) From observation 2 we get 𝑐 𝑡 𝑦 𝑡 − 𝑐 𝑡 𝑥 𝑡 ≤ 𝑐 𝑡 𝑦 𝑡 − 𝑐 𝑡 𝑦 𝑡 + 𝑐 𝑡 𝑦 𝑡 − 𝑐 𝑡 𝑥 𝑡 ≤2 2𝐶 𝛼𝑟 𝛿 𝑐 𝑡 𝑦 − 𝑐 𝑡 𝑦 ≤ 2𝐶 𝛼𝑟 𝛿 Now we get for 𝑦∈ 1−𝛼 𝑆 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 𝑡 +2 2𝐶 𝛼𝑟 𝛿 − 𝑡=1 𝑇 ( 𝑐 𝑡 𝑦 − 2𝐶 𝛼𝑟 𝛿)=

1 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤ 𝑅𝑑𝐶 𝑇 𝛿 Proof Cont. 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 𝑡 +2 2𝐶 𝛼𝑟 𝛿 − 𝑡=1 𝑇 ( 𝑐 𝑡 𝑦 − 2𝐶 𝛼𝑟 𝛿)= 𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 𝑦 +3𝑇 2𝐶 𝛼𝑟 𝛿≤ 𝑅𝑑𝐶 𝑇 𝛿 +3𝑇 2𝐶 𝛼𝑟 𝛿 (2)

Proof Cont. 𝑦= 1−𝛼 𝑥 for some 𝑥∈𝑆 so we can use observation 1 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 ≤ 𝑡=1 𝑇 𝑐 𝑡 𝑥 +2𝛼𝐶𝑇 2 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤ 𝑅𝑑𝐶 𝑇 𝛿 +3𝑇 2𝐶 𝛼𝑟 𝛿 Proof Cont. 𝑦= 1−𝛼 𝑥 for some 𝑥∈𝑆 so we can use observation 1 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑥 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 1−𝛼 𝑥 +2𝛼𝐶𝑇= 𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑥 𝑡 − 𝑡=1 𝑇 𝑐 𝑡 𝑦 +2𝛼𝐶𝑇≤ 𝑅𝑑𝐶 𝑇 𝛿 +𝑇 6𝐶 𝛼𝑟 𝛿+2𝛼𝐶𝑇 Substituting the parameters finishes the proof. 𝛿= 3 𝑟 𝑅 2 𝑑 2 12𝑇 𝛼= 3 3𝑅𝑑 2𝑟 𝑇

BGD with Lipschitz Regret Theorem If all 𝑐 𝑡 are 𝐿−Lipschitz then for 𝑇 sufficiently large and the parameters 𝜂= 𝛿𝑅 𝑑𝐶 𝑇 𝛿= 𝑇 −1/4 𝑅𝑑𝐶𝑟 3(𝐿𝑟+𝐶) 𝛼= 𝛿 𝑟 The BGD achieves regret 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤2 𝑇 3/4 3𝑅𝑑𝐶(𝐿+ 𝐶 𝑟 )

Part 4 Reshaping

Removing Dependence in 1/𝑟 There are algorithms that for a convex set 𝑟𝔹⊆𝑆⊆𝑅𝔹 find an affine transformation 𝑇 that puts 𝑆 in near-isotropic position and run in time 𝑂 𝑑 4 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑑, 𝑅 𝑟 ) 𝑇 𝑆 ⊆ ℝ 𝑑 is in isotropic position if the covariance matrix of a random sample from 𝑇 𝑆 is the identity matrix This gives us 𝔹⊆𝑇(𝑆)⊆𝑑𝔹 So we have new 𝑅=𝑑 and 𝑟=1 Also if 𝑐 𝑡 is 𝐿− Lipschitz then 𝑐 𝑡 ⃘ 𝑇 −1 is 𝐿𝑅− Lipschitz

Removing Dependence in 1/𝑟 So if we first put 𝑆 in near-isotropic position we get the regret bound 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤6 𝑇 3 4 𝑑( 𝐶𝐿𝑅 +𝐶) And without the Lipschitz condition 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑥 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑥 ≤6 𝑇 5 6 𝑑𝐶 2 𝑇 3/4 3𝑅𝑑𝐶(𝐿+ 𝐶 𝑟 ) 3𝐶 𝑇 5/6 3 𝑑𝑅 𝑟

Part 5 Adaptive Adversary

Expected Adaptive Zinkevich’s Theorem Let 𝑐 1 ,…, 𝑐 𝑇 :(1−𝛼)𝑆→ℝ be convex, differentiable functions ( 𝑐 𝑡 depends on 𝑦 1 ,…, 𝑦 𝑡−1 ) Let 𝑔 1 ,…, 𝑔 𝑇 be random vectors such that 𝔼 𝑔 𝑡 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝑔 𝑡 ≤𝐺 (also implies ∇ 𝑐 𝑡 ( 𝑦 𝑡 ) ≤𝐺) Let 𝑦 1 ,…, 𝑦 𝑇 ∈(1−𝛼)𝑆 be defined by 𝑦 1 =0 and 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 Then for 𝜂= 𝑅 𝐺 𝑇 and for every 𝑦∈(1−𝛼)𝑆 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤3𝑅𝐺 𝑇

Proof 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 1 =0 Define ℎ 𝑡 :(1−𝛼)𝑆→ℝ by ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 where 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 Notice that 𝛻 ℎ 𝑡 𝑦 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 + 𝜉 𝑡 = 𝑔 𝑡 So our updates are the same as running regular gradient descent on ℎ 𝑡 From Zinkevich’s Theorem 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 (1)

Proof Cont. ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 𝔼 𝑔 𝑡 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 Notice that 𝔼 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝔼 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 = 𝔼 𝑔 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 =0 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝔼 𝑦 𝑡 𝑇 𝔼 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =0 We get the following connection between 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) and 𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) +𝔼 𝑦 𝑡 𝑇 𝜉 𝑡 =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) (3)

Proof Cont. 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝑦 𝑡+1 = 𝑃 (1−𝛼)𝑆 𝑦 𝑡 −𝜂 𝑔 𝑡 𝔼 𝑔 𝑡 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =𝛻 𝑐 𝑡 𝑦 𝑡 𝜉 𝑡 = 𝑔 𝑡 −𝛻 𝑐 𝑡 𝑦 𝑡 𝜉 𝑡 ≤ 𝑔 𝑡 + 𝛻 𝑐 𝑡 𝑦 𝑡 ≤2𝐺 For every 1≤𝑠<𝑡≤𝑇 we have 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 =𝔼 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 Given 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 we know 𝑔 𝑠 (so also 𝜉 𝑠 ) and therefore 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 =𝔼 𝜉 𝑠 𝑇 𝔼 𝜉 𝑡 | 𝑦 1 , 𝑐 1 ,…,𝑦 𝑡 , 𝑐 𝑡 =0 We use it to get 𝔼 𝑡=1 𝑇 𝜉 𝑡 2 ≤𝔼 𝑡=1 𝑇 𝜉 𝑡 2 = 𝑡=1 𝑇 𝔼 𝜉 𝑡 2 +2 1≤𝑠<𝑡≤𝑇 𝔼 𝜉 𝑠 𝑇 𝜉 𝑡 = 𝑡=1 𝑇 𝔼 𝜉 𝑡 2 ≤ 𝑡=1 𝑇 𝔼 2𝐺 2 =4𝑇 𝐺 2

Proof Cont. Now we connect between 𝔼 ℎ 𝑡 (𝑦) and 𝔼 𝑐 𝑡 (𝑦) ℎ 𝑡 𝑦 = 𝑐 𝑡 𝑦 + 𝑦 𝑇 𝜉 𝑡 𝔼 𝑡=1 𝑇 𝜉 𝑡 ≤2𝐺 𝑇 𝑆⊆𝑅𝔹 Proof Cont. Now we connect between 𝔼 ℎ 𝑡 (𝑦) and 𝔼 𝑐 𝑡 (𝑦) 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 −𝔼 𝑡=1 𝑇 𝑐 𝑡 (𝑦) ≤𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 − 𝑐 𝑡 (𝑦) = 𝔼 𝑦 𝑇 𝑡=1 𝑇 𝜉 𝑡 ≤𝔼 𝑦 𝑡=1 𝑇 𝜉 𝑡 ≤𝑅𝔼 𝑡=1 𝑇 𝜉 𝑡 ≤2𝑅𝐺 𝑇 (2)

Proof Cont. 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 1 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 (𝑦) ≤𝑅𝐺 𝑇 2 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 −𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤2𝑅𝐺 𝑇 3 𝔼 ℎ 𝑡 ( 𝑦 𝑡 ) =𝔼 𝑐 𝑡 ( 𝑦 𝑡 ) Proof Cont. 𝔼 𝑡=1 𝑇 𝑐 𝑡 ( 𝑦 𝑡 ) − 𝑡=1 𝑇 𝑐 𝑡 𝑦 = 𝑡=1 𝑇 𝔼 𝑐 𝑡 𝑦 𝑡 −𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 = (3) 𝑡=1 𝑇 𝔼 ℎ 𝑡 𝑦 𝑡 −𝔼 𝑡=1 𝑇 𝑐 𝑡 𝑦 ≤ (2) 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 −𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 +2 𝑅𝐺 𝑇 = 𝔼 𝑡=1 𝑇 ℎ 𝑡 𝑦 𝑡 − 𝑡=1 𝑇 ℎ 𝑡 𝑦 +2𝑅𝐺 𝑇 ≤ (1) 3𝑅𝐺 𝑇