A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Chapter 7 Project Management
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Lecture 6: Job Shop Scheduling Introduction
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Evaluating Heuristics for the Fixed-Predecessor Subproblem of Pm | prec, p j = 1 | C max.
An Approximate Truthful Mechanism for Combinatorial Auctions An Internet Mathematics paper by Aaron Archer, Christos Papadimitriou, Kunal Talwar and Éva.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 14: March 3, 2004 Scheduling Heuristics and Approximation.
ISE480 Sequencing and Scheduling Izmir University of Economics ISE Fall Semestre.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.
Characterizing the Distribution of Low- Makespan Schedules in the Job Shop Scheduling Problem Matthew J. Streeter Stephen F. Smith Carnegie Mellon University.
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
The Value of Knowing a Demand Curve: Regret Bounds for Online Posted-Price Auctions Bobby Kleinberg and Tom Leighton.
Jointly Optimal Transmission and Probing Strategies for Multichannel Systems Saswati Sarkar University of Pennsylvania Joint work with Sudipto Guha (Upenn)
Fast Matching Algorithms for Repetitive Optimization Sanjay Shakkottai, UT Austin Joint work with Supratim Deb (Bell Labs) and Devavrat Shah (MIT)
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Carnegie Mellon University Pittsburgh, PA.
EDA (CS286.5b) Day 11 Scheduling (List, Force, Approximation) N.B. no class Thursday (FPGA) …
Visual Recognition Tutorial
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)
Scatter search for project scheduling with resource availability cost Jia-Xian Zhu.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
MIC’2011 1/58 IX Metaheuristics International Conference, July 2011 Restart strategies for GRASP+PR Talk given at the 10 th International Symposium on.
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
1 Introduction to Estimation Chapter Concepts of Estimation The objective of estimation is to determine the value of a population parameter on the.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Efficient and Scalable Computation of the Energy and Makespan Pareto Front for Heterogeneous Computing Systems Kyle M. Tarplee 1, Ryan Friese 1, Anthony.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 12: February 13, 2002 Scheduling Heuristics and Approximation.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Outline Introduction Minimizing the makespan Minimizing total flowtime
Optimal Resource Allocation for Protecting System Availability against Random Cyber Attack International Conference Computer Research and Development(ICCRD),
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
How the Experts Algorithm Can Help Solve LPs Online Marco Molinaro TU Delft Anupam Gupta Carnegie Mellon University.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
March 7, Using Pattern Recognition Techniques to Derive a Formal Analysis of Why Heuristic Functions Work B. John Oommen A Joint Work with Luis.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Scheduling Determines the precise start time of each task.
Approximating the MST Weight in Sublinear Time
Basic Project Scheduling
Basic Project Scheduling
Confidence Intervals for a Population Mean,
Feedback-Aware Social Event-Participant Arrangement
EM for Inference in MV Data
EM for Inference in MV Data
CS639: Data Management for Data Science
Chapter 6 Confidence Intervals.
Chapter 4. Supplementary Questions
Presentation transcript:

A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University

Outline The max k-armed bandit problem Previous work Our distribution-free approach Experimental evaluation

What is the max k-armed bandit problem?

You are in a room with k slot machines Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution D i Allowed n total pulls Goal: maximize total payoff > 50 years of papers The classical k-armed bandit

You are in a room with k slot machines Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution D i Allowed n total pulls Goal: maximize highest payoff Introduced ~2003 The max k-armed bandit

Why study it?

Goal: improve multi-start heuristics A multi-start heuristic runs an underlying randomized heuristic a bunch of times and returns the best solution Examples: HBSS (Bresina 1996) VBSS (Cicirello & Smith 2005) GRASPs (Feo & Resende 1995, and many others)

Given: some optimization problem, k randomized heuristics Each time you run a heuristic, get a solution with a certain quality Allowed n runs Goal: maximize quality of best solution Application: selecting among heuristics

Given n pulls, how can we maximize the (expected) maximum payoff? If n=1, should pull blue arm (higher mean) If n=1000, should mainly pull maroon arm (higher variance) The max k-armed bandit: example

Distributional assumptions? Without distributional assumptions, optimal strategy is not interesting. For example suppose payoffs are in {0,1}; arms are shuffled so you don’t know which is which. Optimal strategy samples the arms in round-robin order! Can’t distinguish a “good” arm until you receive payoff 1, at which point max payoff can’t be improved

Why? Extremal Types Theorem: let M n = max. of n independent draws from some fixed distribution. As n , distribution of M n  a GEV distribution GEV sometimes gives an excellent fit to payoff distributions we care about Distributional assumptions? All previous work assumed each machine returns payoff from a generalized extreme value (GEV) distribution

Previous work Cicirello & Smith (CP 2004, AAAI 2005): Assumed Gumbel distributions (special case of GEV), no rigorous performance guarantees Good results selecting among heuristics for the RCPSP/max Streeter & Smith (AAAI 2006) Rigorous result for general GEV distributions But no experimental evaluation

Our contributions Threshold ascent: strategy to solve max k- armed problem using classical k-armed solver as subroutine Chernoff interval estimation: strategy for classical k-armed bandit algorithm that works well when mean payoffs are small (we assume payoffs in [0,1])

Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0 Idea: Initialize t  -  Use S to maximize number of payoffs that exceed t Once m payoffs > t have been received, increase t and repeat

Threshold Ascent Designed to work well when: For t > t critical, there is a growing gap between probability that eventually-best arm yields payoff > t and corresponding prob. for other arms

Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0 Idea: Initialize t  -  Use S to maximize number of payoffs that exceed t Once m payoffs > t have been received, increase t and repeat m controls exploration/exploitation tradeoff (larger m means algorithm converges more before increasing t) as t gets large, S sees a classical k- armed bandit instance where almost all payoffs are zero we don’t really start S from scratch each time we increase t

Interval Estimation Interval estimation (Lai & Robbins 1987, Kaelbling 1993) maintains confidence interval for each arm’s mean payoff; pulls arm with highest upper bound 11 22 33 Arm 1Arm 2 Arm 3

Chernoff Interval Estimation We analyze a variant of interval estimation with confidence intervals derived from Chernoff bounds regret = average_payoff(strategy) -  *, where  * = mean payoff of best arm. We prove an O(sqrt(  * )*X) regret bound, where X = sqrt(k (log n)/n). Using Hoeffding’s inequality just gives O(X). (Auer et al. 2002). As  *  0, our bound is much better. Can get comparable bounds using “multiplicative weight update” algorithms

Experimental Evaluation

The RCPSP/max Assign start times to activities subject to resource and temporal constraints Goal: find a schedule with minimum makespan NP-hard, “one of the most intractable problems in operations research” (Mohring 2000) Multi-start heuristics give state-of-the-art performance (Cicirello & Smith 2005)

Evaluation Five multi-start heuristics; each is a randomized rule for greedily building a schedule LPF - “longest path following” LST - “latest start time” MST - “minimum slack time” MTS - “most total successors” RSM - “resource scheduling method” Three max k-armed bandit strategies: Threshold Ascent (m=100, S = Chernoff interval estimation with 99% confidence intervals) round robin sampling QD-BEACON (Cicirello & Smith 2004, 2005) Note: we use a less aggressive variant of interval estimation in these experiments

Evaluation Ran on 169 instances from ProGen/max library For each instance, ran each of five rules 10,000 times and saved results in file For each of three strategies, solve as max 5- armed bandit with n=10,000 pulls Define regret = difference between max. possible payoff and max. payoff actually obtained

Results Threshold Ascent outperforms the other max k- armed bandit strategies, as well as the five “pure” strategies

Summary & Conclusions The max k-armed bandit problem is a simple online learning problem with applications to heuristic search We described a new, distribution-free approach to the max k-armed bandit problem Our strategy is effective at selecting among randomized priority dispatching rules for the RCPSP/max