Download presentation
Presentation is loading. Please wait.
Published byGavin Crowley Modified over 10 years ago
1
Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections
2
BANDITS AND REGRET TIME 1 2 3 1 9 5 1 1 8 8 8 6 9 3 4 3 2 4 AVG 18654 9 5 1 5 REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3
3
TWO APPROACHES Bayesian setting [Robbins52] Independent prior probability dist. over payoff sequences for each machine Thm: Maximize (discounted) expected reward by pulling arm of largest Gittins index Nonstochastic [Auer,Cesa-Bianchi,Freund,Schapire95] Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected regret of O
4
RouteTime 25 min 17 min 44 min STRUCTURED COMB-OPT ClusteringErrors 40 55 19 Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification Problems not included: Portfolio selection (nonlinear) Online sudoko
5
STRUCTURED COMB-OPT Known decision set S. LINEAR Known LINEAR cost func. c: S £ [0,1] d ! [0,1]. Unknown w 1, w 2, …, w 2 [0,1] d On period t = 1, 2, …, T: Alg. picks s t 2 S. Alg. pays and finds out c(s t,w t ). REGRET = =
6
MAIN POINTS Offline optimization M: [0,1] d ! S M(w) = argmin s 2S c(s,w), e.g. shortest path Easier than sequential decision-making!? EXPLORATION Automatically find exploration basis using M LOW REGRET Dimension matters more than # decisions EFFICIENCY Online algorithm uses offline black-box opt. M
7
MAIN RESULT An algorithm that achives: For any set S, any linear c: S £ [0,1] d ! [0,1], any T ¸ 1, and any sequence w 1,…,w T 2 [0,1] d, E[regret of alg] · 15dT -1/3 Each update requires linear time and calls offline optimizer M with probability O(dT -1/3 ) [AK04,MB04,DH06]
8
EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]
9
REMAINDER OF TALK EXPLORATION EXPLORATION Good exploration basis definition Finding one EXPLOITATION EXPLOITATION Perturbation (randomized regularization) Stability analysis OTHER DIRECTIONS OTHER DIRECTIONS Approximation algorithms Convex problems
10
EXPLORATION
11
GOING TO d-DIMENSIONS Linear cost function c: S £ [0,1] d ! [0,1] Mapping S ! [0,1] d : s = (c(s, (1,0,…,0) ),c(s, (0,1,…,0) ),…,c(s, (0,…,0,1) ) c(s,w) = s ¢ w S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K
12
EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a 2-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-2,2] Possible to find an exploration basis efficiently using offline optimizer M(w) = argmin s 2 S c(s,w) [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K bad good
13
EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a C-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-C,C] Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K
14
EXPLORATION BASIS Alg: Repeat Let w be direction such that Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K
15
EXPLOITATION
16
EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]
17
INSTABILITY Define z t = M( i · t w i ) = argmin s 2 S i · t c(s,w i ) Natural idea: use z t-1 on period t? REGRET=1! ½0 01 10 01 10
18
STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 Proof: min s 2 S c(s,w 1 )+c(s,w 2 )+…+c(s,w T ) = c(z T,w 1 )+…+c(z T,w T-1 )+c(z T,w T ) ¸ c(z T-1,w 1 )+…+c(z T-1,w T-1 )+c(z T,w T ) ¸ ¸ c(z 1,w 1 )+c(z 2,w 2 )+…+c(z T,w T )
19
STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 ) Regret of z t-1 on t · t · T c(z t-1,w t )-c(z t,w t ) Idea: regularize to achieve stability Let y t = M( i · t w i + p), for random p 2 [0,1] d. E[Regret of y t-1 on t] · t · T E[c(y t-1,w t )-c(y t,w t )] + Strange: randomized regularization! y t can be computed using M
20
OTHER DIRECTIONS
21
BANDIT CONVEX OPT. Convex feasible set S µ R d Unknown sequence of concave functions f 1,…, f T : S ! [0,1] On period t = 1,2,…,T: Algorithm chooses x t 2 S Algorithm pays and finds out f t (x t ) Thm. 8 concave f 1, f 2, …: S ! [0,1], 8 T 0,T ¸ 1, bacterial ascent algorithm achieves:
22
MOTIVATING EXAMPLE Company has to decide how much to advertize among d channels, within budget. Feedback is total profit, affected by external factors. x1x1 f 1 (x 1 ) $PROFIT $ADVERTISING x2x2 f 2 (x 2 ) x3x3 f 3 (x 3 ) x4x4 f 4 (x 4 ) f1f1 f2f2 f3f3 f4f4 x*
23
BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1
24
BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2
25
BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2 x3x3
26
APPROXIMATION ALGs What if offline optimization is NP-hard? Example: repeated traveling salesman problem Suppose you have approximation algorithm A, c(A (w),w) · min s 2 S c(s,w) for all w 2 [0,1] d Would like to achieve low -regret = our cost – (min cost of best s 2 S) Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]
27
CONCLUSIONS Can extend bandit algorithms to structured problems Guarantee worst-case low regret Linear combinatorial optimization problems Convex optimization Remarks Works against adaptive adversaries as well Online efficiency = offline efficiency Can handle approximation algorithms Can achieve cost · (1+ ) min cost + O(1/ )
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.