Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute.

Slides:

Advertisements

Similar presentations

Blind online optimization Gradient descent without a gradient Abie Flaxman CMU Adam Tauman Kalai TTI Brendan McMahan CMU.

Advertisements

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.

On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group.

Online Learning for Online Pricing Problems Maria Florina Balcan.

Markov Decision Process

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.

Online learning, minimizing regret, and combining expert advice

Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.

Niranjan Srinivas Andreas Krause Caltech Caltech

From Cognitive Science and Machine Learning Summer School 2010

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Blinded Bandits Ofer Dekel, Elad Hazan, Tomer Koren NIPS 2014 (Yesterday)

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Online Learning in Complex Environments

Markov Decision Processes

Bandits for Taxonomies: A Model-based Approach Sandeep Pandey Deepak Agarwal Deepayan Chakrabarti Vanja Josifovski.

Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.

The Value of Knowing a Demand Curve: Regret Bounds for Online Posted-Price Auctions Bobby Kleinberg and Tom Leighton.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Shuchi Chawla, Carnegie Mellon University Static Optimality and Dynamic Search Optimality in Lists and Trees Avrim Blum Shuchi Chawla Adam Kalai 1/6/2002.

Multi-armed Bandit Problems with Dependent Arms

Discussion of Profs. Robins’ and M  ller’s Papers S.A. Murphy ENAR 2003.

Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.

Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,

Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)

Commitment without Regrets: Online Learning in Stackelberg Security Games Nika Haghtalab Carnegie Mellon University Joint work with Maria-Florina Balcan,

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.

online convex optimization (with partial information)

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Simulation Selection Problems: Overview of an Economic Analysis Based On Paper By: Stephen E. Chick Noah Gans Presented By: Michael C. Jones MSIM 852.

Upper Confidence Trees for Game AI Chahine Koleejan.

Probabilistic Results for Mixed Criticality Real-Time Scheduling Bader N. Alahmad Sathish Gopalakrishnan.

Uri Zwick Tel Aviv University Simple Stochastic Games Mean Payoff Games Parity Games TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.

Computer Science 101 A Survey of Computer Science Timing Problems.

MDPs (cont) & Reinforcement Learning

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Department of Electrical and Computer Engineering Sequential Learning for Passive Monitoring of Multichannel Wireless Networks Department of Electrical.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.

Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Revision.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Basics of Multi-armed Bandit Problems

Keep the Adversary Guessing: Agent Security by Policy Randomization

Figure 5: Change in Blackjack Posterior Distributions over Time.

Tradeoffs Between Fairness and Accuracy in Machine Learning

Math 6330: Statistical Consulting Class 11

Better Algorithms for Better Computers

Analytics and OR DP- summary.

Maximum Matching in the Online Batch-Arrival Model

Static Optimality and Dynamic Search Optimality in Lists and Trees

Markov Decision Processes

The Price of information in combinatorial optimization

Multi-armed Bandit Problems with Dependent Arms

Tuning bandit algorithms in stochastic environments

CS 188: Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

Presentation transcript:

Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections

BANDITS AND REGRET TIME AVG REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3

TWO APPROACHES Bayesian setting [Robbins52] Independent prior probability dist. over payoff sequences for each machine Thm: Maximize (discounted) expected reward by pulling arm of largest Gittins index Nonstochastic [Auer,Cesa-Bianchi,Freund,Schapire95] Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected regret of O

RouteTime 25 min 17 min 44 min STRUCTURED COMB-OPT ClusteringErrors Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification Problems not included: Portfolio selection (nonlinear) Online sudoko

STRUCTURED COMB-OPT Known decision set S. LINEAR Known LINEAR cost func. c: S £ [0,1] d ! [0,1]. Unknown w 1, w 2, …, w 2 [0,1] d On period t = 1, 2, …, T: Alg. picks s t 2 S. Alg. pays and finds out c(s t,w t ). REGRET = =

MAIN POINTS Offline optimization M: [0,1] d ! S M(w) = argmin s 2S c(s,w), e.g. shortest path Easier than sequential decision-making!? EXPLORATION Automatically find exploration basis using M LOW REGRET Dimension matters more than # decisions EFFICIENCY Online algorithm uses offline black-box opt. M

MAIN RESULT An algorithm that achives: For any set S, any linear c: S £ [0,1] d ! [0,1], any T ¸ 1, and any sequence w 1,…,w T 2 [0,1] d, E[regret of alg] · 15dT -1/3 Each update requires linear time and calls offline optimizer M with probability O(dT -1/3 ) [AK04,MB04,DH06]

EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]

REMAINDER OF TALK EXPLORATION EXPLORATION Good exploration basis definition Finding one EXPLOITATION EXPLOITATION Perturbation (randomized regularization) Stability analysis OTHER DIRECTIONS OTHER DIRECTIONS Approximation algorithms Convex problems

EXPLORATION

GOING TO d-DIMENSIONS Linear cost function c: S £ [0,1] d ! [0,1] Mapping S ! [0,1] d : s = (c(s, (1,0,…,0) ),c(s, (0,1,…,0) ),…,c(s, (0,…,0,1) ) c(s,w) = s ¢ w S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K

EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a 2-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-2,2] Possible to find an exploration basis efficiently using offline optimizer M(w) = argmin s 2 S c(s,w) [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K bad good

EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a C-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-C,C] Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K

EXPLORATION BASIS Alg: Repeat Let w be direction such that Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K

EXPLOITATION

EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]

INSTABILITY Define z t = M( i · t w i ) = argmin s 2 S i · t c(s,w i ) Natural idea: use z t-1 on period t? REGRET=1! ½

STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 Proof: min s 2 S c(s,w 1 )+c(s,w 2 )+…+c(s,w T ) = c(z T,w 1 )+…+c(z T,w T-1 )+c(z T,w T ) ¸ c(z T-1,w 1 )+…+c(z T-1,w T-1 )+c(z T,w T ) ¸ ¸ c(z 1,w 1 )+c(z 2,w 2 )+…+c(z T,w T )

STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 ) Regret of z t-1 on t · t · T c(z t-1,w t )-c(z t,w t ) Idea: regularize to achieve stability Let y t = M( i · t w i + p), for random p 2 [0,1] d. E[Regret of y t-1 on t] · t · T E[c(y t-1,w t )-c(y t,w t )] + Strange: randomized regularization! y t can be computed using M

OTHER DIRECTIONS

BANDIT CONVEX OPT. Convex feasible set S µ R d Unknown sequence of concave functions f 1,…, f T : S ! [0,1] On period t = 1,2,…,T: Algorithm chooses x t 2 S Algorithm pays and finds out f t (x t ) Thm. 8 concave f 1, f 2, …: S ! [0,1], 8 T 0,T ¸ 1, bacterial ascent algorithm achieves:

MOTIVATING EXAMPLE Company has to decide how much to advertize among d channels, within budget. Feedback is total profit, affected by external factors. x1x1 f 1 (x 1 ) $PROFIT $ADVERTISING x2x2 f 2 (x 2 ) x3x3 f 3 (x 3 ) x4x4 f 4 (x 4 ) f1f1 f2f2 f3f3 f4f4 x*

BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1

BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2

BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2 x3x3

APPROXIMATION ALGs What if offline optimization is NP-hard? Example: repeated traveling salesman problem Suppose you have approximation algorithm A, c(A (w),w) · min s 2 S c(s,w) for all w 2 [0,1] d Would like to achieve low -regret = our cost – (min cost of best s 2 S) Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]

CONCLUSIONS Can extend bandit algorithms to structured problems Guarantee worst-case low regret Linear combinatorial optimization problems Convex optimization Remarks Works against adaptive adversaries as well Online efficiency = offline efficiency Can handle approximation algorithms Can achieve cost · (1+ ) min cost + O(1/ )