Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections
BANDITS AND REGRET TIME AVG REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3
TWO APPROACHES Bayesian setting [Robbins52] Independent prior probability dist. over payoff sequences for each machine Thm: Maximize (discounted) expected reward by pulling arm of largest Gittins index Nonstochastic [Auer,Cesa-Bianchi,Freund,Schapire95] Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected regret of O
RouteTime 25 min 17 min 44 min STRUCTURED COMB-OPT ClusteringErrors Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification Problems not included: Portfolio selection (nonlinear) Online sudoko
STRUCTURED COMB-OPT Known decision set S. LINEAR Known LINEAR cost func. c: S £ [0,1] d ! [0,1]. Unknown w 1, w 2, …, w 2 [0,1] d On period t = 1, 2, …, T: Alg. picks s t 2 S. Alg. pays and finds out c(s t,w t ). REGRET = =
MAIN POINTS Offline optimization M: [0,1] d ! S M(w) = argmin s 2S c(s,w), e.g. shortest path Easier than sequential decision-making!? EXPLORATION Automatically find exploration basis using M LOW REGRET Dimension matters more than # decisions EFFICIENCY Online algorithm uses offline black-box opt. M
MAIN RESULT An algorithm that achives: For any set S, any linear c: S £ [0,1] d ! [0,1], any T ¸ 1, and any sequence w 1,…,w T 2 [0,1] d, E[regret of alg] · 15dT -1/3 Each update requires linear time and calls offline optimizer M with probability O(dT -1/3 ) [AK04,MB04,DH06]
EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]
REMAINDER OF TALK EXPLORATION EXPLORATION Good exploration basis definition Finding one EXPLOITATION EXPLOITATION Perturbation (randomized regularization) Stability analysis OTHER DIRECTIONS OTHER DIRECTIONS Approximation algorithms Convex problems
EXPLORATION
GOING TO d-DIMENSIONS Linear cost function c: S £ [0,1] d ! [0,1] Mapping S ! [0,1] d : s = (c(s, (1,0,…,0) ),c(s, (0,1,…,0) ),…,c(s, (0,…,0,1) ) c(s,w) = s ¢ w S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K
EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a 2-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-2,2] Possible to find an exploration basis efficiently using offline optimizer M(w) = argmin s 2 S c(s,w) [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K bad good
EXPLORATION BASIS Def: Exploration basis b 1, b 2, …, b d 2 S is a C-Barycentric-spanner if, for every s 2 S, s = i i b i for some 1, 2, …, d 2 [-C,C] Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K
EXPLORATION BASIS Alg: Repeat Let w be direction such that Det(b 1 …b i-1, 1 b 1 +…+ d b d,b i+1 …b d )= i Det(b 1 …b d ) ) argmax b 1,…,b k 2 S |Det(b 1,…,b k )| is a 1-BS [AK04] S = { s | s 2 S } K = convex-hull(S) WLOG dim(S)=d K
EXPLOITATION
EXPLORE vs EXPLOIT Find good exploration basis using M On period t = 1, 2, …, T: Explore Explore with probability, Play s t := a random element of exploration basis Estimate v t somehow Exploit Exploit with probability 1-, Play s t := M( i<t v i + p) v t := 0 Key Key property: E[v t ] = w t E[calls to M] =. random perturbation [Hannan57] [AK04, MB04] MB04]
INSTABILITY Define z t = M( i · t w i ) = argmin s 2 S i · t c(s,w i ) Natural idea: use z t-1 on period t? REGRET=1! ½
STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 Proof: min s 2 S c(s,w 1 )+c(s,w 2 )+…+c(s,w T ) = c(z T,w 1 )+…+c(z T,w T-1 )+c(z T,w T ) ¸ c(z T-1,w 1 )+…+c(z T-1,w T-1 )+c(z T,w T ) ¸ ¸ c(z 1,w 1 )+c(z 2,w 2 )+…+c(z T,w T )
STABILITY ANALYSIS [KV03] Define z t = M( i · t w i ) = argmin s 2 S i<t c(s,w i ) Lemma: Regret of using z t on period t is 0 ) Regret of z t-1 on t · t · T c(z t-1,w t )-c(z t,w t ) Idea: regularize to achieve stability Let y t = M( i · t w i + p), for random p 2 [0,1] d. E[Regret of y t-1 on t] · t · T E[c(y t-1,w t )-c(y t,w t )] + Strange: randomized regularization! y t can be computed using M
OTHER DIRECTIONS
BANDIT CONVEX OPT. Convex feasible set S µ R d Unknown sequence of concave functions f 1,…, f T : S ! [0,1] On period t = 1,2,…,T: Algorithm chooses x t 2 S Algorithm pays and finds out f t (x t ) Thm. 8 concave f 1, f 2, …: S ! [0,1], 8 T 0,T ¸ 1, bacterial ascent algorithm achieves:
MOTIVATING EXAMPLE Company has to decide how much to advertize among d channels, within budget. Feedback is total profit, affected by external factors. x1x1 f 1 (x 1 ) $PROFIT $ADVERTISING x2x2 f 2 (x 2 ) x3x3 f 3 (x 3 ) x4x4 f 4 (x 4 ) f1f1 f2f2 f3f3 f4f4 x*
BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1
BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2
BACTERIAL ASCENT S EXPLORE EXPLOIT x0x0 x1x1 x2x2 x3x3
APPROXIMATION ALGs What if offline optimization is NP-hard? Example: repeated traveling salesman problem Suppose you have approximation algorithm A, c(A (w),w) · min s 2 S c(s,w) for all w 2 [0,1] d Would like to achieve low -regret = our cost – (min cost of best s 2 S) Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]
CONCLUSIONS Can extend bandit algorithms to structured problems Guarantee worst-case low regret Linear combinatorial optimization problems Convex optimization Remarks Works against adaptive adversaries as well Online efficiency = offline efficiency Can handle approximation algorithms Can achieve cost · (1+ ) min cost + O(1/ )