Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute.

Similar presentations


Presentation on theme: "Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute."— Presentation transcript:

1 Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections

2 BANDITS AND REGRET TIME 1 2 3 1 9 5 1 1 8 8 8 6 9 3 4 3 2 4 AVG 18654 9 5 1 5 REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3

3 TWO APPROACHES Bayesian setting [Robbins52] Independent prior probability dist. over payoff sequences for each machine Thm: Maximize (discounted) expected reward by pulling arm of largest Gittins index Nonstochastic [Auer,Cesa-Bianchi,Freund,Schapire95] Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected regret of O

4 RouteTime 25 min 17 min 44 min STRUCTURED COMB-OPT ClusteringErrors 40 55 19 Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification Problems not included: Portfolio selection (nonlinear) Online sudoko

5 STRUCTURED COMB-OPT Known decision set S. LINEAR Known LINEAR cost func. c: S £ [0,1] d ! [0,1]. Unknown w 1, w 2, …, w 2 [0,1] d On period t = 1, 2, …, T: Alg. picks s t 2 S. Alg. pays and finds out c(s t,w t ). REGRET = =

6 MAIN POINTS Offline optimization M: [0,1] d ! S M(w) = argmin s 2S c(s,w), e.g. shortest path Easier than sequential decision-making!? EXPLORATION Automatically find exploration basis using M LOW REGRET Dimension matters more than # decisions EFFICIENCY Online algorithm uses offline black-box opt. M

7 MAIN RESULT An algorithm that achives: For any set S, any linear c: S £ [0,1] d ! [0,1], any T ¸ 1, and any sequence w 1,…,w T 2 [0,1] d, E[regret of alg] · 15dT -1/3 Each update requires linear time and calls offline optimizer M with probability O(dT -1/3 ) [AK04,MB04,DH06]

8 EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]

9 REMAINDER OF TALK EXPLORATION EXPLORATION Good exploration basis definition Finding one EXPLOITATION EXPLOITATION Perturbation (randomized regularization) Stability analysis OTHER DIRECTIONS OTHER DIRECTIONS Approximation algorithms Convex problems

10 EXPLORATION

11 GOING TO d-DIMENSIONS Linear cost function c: S £ [0,1] d ! [0,1] Mapping S ! [0,1] d : s = (c(s, (1,0,…,0) ),c(s, (0,1,…,0) ),…,c(s, (0,…,0,1) ) c(s,w) = s ¢ w S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K

12 EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a 2-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-2,2] Possible to find an exploration basis efficiently using offline optimizer M(w) = argmin s 2 S c(s,w) [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K bad good

13 EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a C-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-C,C] Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K

14 EXPLORATION BASIS Alg: Repeat Let w be direction such that Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K

15 EXPLOITATION

16 EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]

17 INSTABILITY Define z t = M( i · t w i ) = argmin s 2 S i · t c(s,w i ) Natural idea: use z t-1 on period t? REGRET=1! ½0 01 10 01 10

18 STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 Proof: min s 2 S c(s,w 1 )+c(s,w 2 )+…+c(s,w T ) = c(z T,w 1 )+…+c(z T,w T-1 )+c(z T,w T ) ¸ c(z T-1,w 1 )+…+c(z T-1,w T-1 )+c(z T,w T ) ¸ ¸ c(z 1,w 1 )+c(z 2,w 2 )+…+c(z T,w T )

19 STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 ) Regret of z t-1 on t · t · T c(z t-1,w t )-c(z t,w t ) Idea: regularize to achieve stability Let y t = M( i · t w i + p), for random p 2 [0,1] d. E[Regret of y t-1 on t] · t · T E[c(y t-1,w t )-c(y t,w t )] + Strange: randomized regularization! y t can be computed using M

20 OTHER DIRECTIONS

21 BANDIT CONVEX OPT. Convex feasible set S µ R d Unknown sequence of concave functions f 1,…, f T : S ! [0,1] On period t = 1,2,…,T: Algorithm chooses x t 2 S Algorithm pays and finds out f t (x t ) Thm. 8 concave f 1, f 2, …: S ! [0,1], 8 T 0,T ¸ 1, bacterial ascent algorithm achieves:

22 MOTIVATING EXAMPLE Company has to decide how much to advertize among d channels, within budget. Feedback is total profit, affected by external factors. x1x1 f 1 (x 1 ) $PROFIT $ADVERTISING x2x2 f 2 (x 2 ) x3x3 f 3 (x 3 ) x4x4 f 4 (x 4 ) f1f1 f2f2 f3f3 f4f4 x*

23 BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1

24 BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2

25 BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2 x3x3

26 APPROXIMATION ALGs What if offline optimization is NP-hard? Example: repeated traveling salesman problem Suppose you have approximation algorithm A, c(A (w),w) · min s 2 S c(s,w) for all w 2 [0,1] d Would like to achieve low -regret = our cost – (min cost of best s 2 S) Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]

27 CONCLUSIONS Can extend bandit algorithms to structured problems Guarantee worst-case low regret Linear combinatorial optimization problems Convex optimization Remarks Works against adaptive adversaries as well Online efficiency = offline efficiency Can handle approximation algorithms Can achieve cost · (1+ ) min cost + O(1/ )


Download ppt "Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute."

Similar presentations


Ads by Google