Download presentation
Presentation is loading. Please wait.
Published byBriana Pointer Modified over 9 years ago
1
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University
2
Outline The max k-armed bandit problem Previous work Our distribution-free approach Experimental evaluation
3
What is the max k-armed bandit problem?
4
You are in a room with k slot machines Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution D i Allowed n total pulls Goal: maximize total payoff > 50 years of papers The classical k-armed bandit
5
You are in a room with k slot machines Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution D i Allowed n total pulls Goal: maximize highest payoff Introduced ~2003 The max k-armed bandit
6
Why study it?
7
Goal: improve multi-start heuristics A multi-start heuristic runs an underlying randomized heuristic a bunch of times and returns the best solution Examples: HBSS (Bresina 1996) VBSS (Cicirello & Smith 2005) GRASPs (Feo & Resende 1995, and many others)
8
Given: some optimization problem, k randomized heuristics Each time you run a heuristic, get a solution with a certain quality Allowed n runs Goal: maximize quality of best solution Application: selecting among heuristics
9
Given n pulls, how can we maximize the (expected) maximum payoff? If n=1, should pull blue arm (higher mean) If n=1000, should mainly pull maroon arm (higher variance) The max k-armed bandit: example
10
Distributional assumptions? Without distributional assumptions, optimal strategy is not interesting. For example suppose payoffs are in {0,1}; arms are shuffled so you don’t know which is which. Optimal strategy samples the arms in round-robin order! Can’t distinguish a “good” arm until you receive payoff 1, at which point max payoff can’t be improved
11
Why? Extremal Types Theorem: let M n = max. of n independent draws from some fixed distribution. As n , distribution of M n a GEV distribution GEV sometimes gives an excellent fit to payoff distributions we care about Distributional assumptions? All previous work assumed each machine returns payoff from a generalized extreme value (GEV) distribution
12
Previous work Cicirello & Smith (CP 2004, AAAI 2005): Assumed Gumbel distributions (special case of GEV), no rigorous performance guarantees Good results selecting among heuristics for the RCPSP/max Streeter & Smith (AAAI 2006) Rigorous result for general GEV distributions But no experimental evaluation
13
Our contributions Threshold ascent: strategy to solve max k- armed problem using classical k-armed solver as subroutine Chernoff interval estimation: strategy for classical k-armed bandit algorithm that works well when mean payoffs are small (we assume payoffs in [0,1])
14
Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0 Idea: Initialize t - Use S to maximize number of payoffs that exceed t Once m payoffs > t have been received, increase t and repeat
15
Threshold Ascent Designed to work well when: For t > t critical, there is a growing gap between probability that eventually-best arm yields payoff > t and corresponding prob. for other arms
16
Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0 Idea: Initialize t - Use S to maximize number of payoffs that exceed t Once m payoffs > t have been received, increase t and repeat m controls exploration/exploitation tradeoff (larger m means algorithm converges more before increasing t) as t gets large, S sees a classical k- armed bandit instance where almost all payoffs are zero we don’t really start S from scratch each time we increase t
17
Interval Estimation Interval estimation (Lai & Robbins 1987, Kaelbling 1993) maintains confidence interval for each arm’s mean payoff; pulls arm with highest upper bound 11 22 33 Arm 1Arm 2 Arm 3
18
Chernoff Interval Estimation We analyze a variant of interval estimation with confidence intervals derived from Chernoff bounds regret = average_payoff(strategy) - *, where * = mean payoff of best arm. We prove an O(sqrt( * )*X) regret bound, where X = sqrt(k (log n)/n). Using Hoeffding’s inequality just gives O(X). (Auer et al. 2002). As * 0, our bound is much better. Can get comparable bounds using “multiplicative weight update” algorithms
19
Experimental Evaluation
20
The RCPSP/max Assign start times to activities subject to resource and temporal constraints Goal: find a schedule with minimum makespan NP-hard, “one of the most intractable problems in operations research” (Mohring 2000) Multi-start heuristics give state-of-the-art performance (Cicirello & Smith 2005)
21
Evaluation Five multi-start heuristics; each is a randomized rule for greedily building a schedule LPF - “longest path following” LST - “latest start time” MST - “minimum slack time” MTS - “most total successors” RSM - “resource scheduling method” Three max k-armed bandit strategies: Threshold Ascent (m=100, S = Chernoff interval estimation with 99% confidence intervals) round robin sampling QD-BEACON (Cicirello & Smith 2004, 2005) Note: we use a less aggressive variant of interval estimation in these experiments
22
Evaluation Ran on 169 instances from ProGen/max library For each instance, ran each of five rules 10,000 times and saved results in file For each of three strategies, solve as max 5- armed bandit with n=10,000 pulls Define regret = difference between max. possible payoff and max. payoff actually obtained
23
Results Threshold Ascent outperforms the other max k- armed bandit strategies, as well as the five “pure” strategies
24
Summary & Conclusions The max k-armed bandit problem is a simple online learning problem with applications to heuristic search We described a new, distribution-free approach to the max k-armed bandit problem Our strategy is effective at selecting among randomized priority dispatching rules for the RCPSP/max
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.