Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.

Slides:

Advertisements

Similar presentations

“Students” t-test.

Advertisements

A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.

Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.

1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April

Sampling Methods and Sampling Distributions Chapter.

Multi-armed Bandit Problems with Dependent Arms

7-2 Estimating a Population Proportion

8-2 Basics of Hypothesis Testing

Experimental Evaluation

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 10 Introduction to Estimation.

Chapter 9 Hypothesis Testing.

Business Statistics: Communicating with Numbers

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

1 Psych 5500/6500 Statistics and Parameters Fall, 2008.

Review of normal distribution. Exercise Solution.

Lecture Slides Elementary Statistics Twelfth Edition

Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.

Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.

Inference for a Single Population Proportion (p).

Albert Morlan Caitrin Carroll Savannah Andrews Richard Saney.

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

QBM117 Business Statistics Estimating the population mean , when the population variance  2, is known.

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 10 Introduction to Estimation.

Statistical inference. Distribution of the sample mean Take a random sample of n independent observations from a population. Calculate the mean of these.

PARAMETRIC STATISTICAL INFERENCE

Copyright © Cengage Learning. All rights reserved. 10 Inferences Involving Two Populations.

Statistical Methods Introduction to Estimation noha hussein elkhidir16/04/35.

Estimating the Value of a Parameter Using Confidence Intervals

LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.

Chapter 10 Introduction to Estimation Sir Naseer Shahzada.

CHAPTER 17: Tests of Significance: The Basics

Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 2 – Slide 1 of 27 Chapter 3 Section 2 Measures of Dispersion.

Confidence intervals: The basics BPS chapter 14 © 2006 W.H. Freeman and Company.

Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.

Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.

1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.

Machine Learning Chapter 5. Evaluating Hypotheses

Various Topics of Interest to the Inquiring Orthopedist Richard Gerkin, MD, MS BGSMC GME Research.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

C HAPTER 4  Hypothesis Testing -Test for one and two means -Test for one and two proportions.

Point Estimates. Remember….. Population  It is the set of all objects being studied Sample  It is a subset of the population.

Understanding CI for Means Ayona Chatterjee Math 2063 University of West Georgia.

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.

Lecture Slides Elementary Statistics Twelfth Edition

Inference for a Single Population Proportion (p)

Figure 5: Change in Blackjack Posterior Distributions over Time.

Measures of Dispersion

ECO 173 Chapter 10: Introduction to Estimation Lecture 5a

ECO 173 Chapter 10: Introduction to Estimation Lecture 5a

Reinforcement Learning

Chapter 9 Hypothesis Testing

Chapter 2: Evaluative Feedback

Lecture # 2 MATHEMATICAL STATISTICS

IE 355: Quality and Applied Statistics I Confidence Intervals

Lecture Slides Elementary Statistics Twelfth Edition

Objectives 6.1 Estimating with confidence Statistical confidence

Chapter 2: Evaluative Feedback

How Confident Are You?.

Presentation transcript:

Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl

K-armed Bandit Problem At each time step one of k arms may be pulled. Each pull produces some reward generated by a fixed and bounded probability distribution depending only on the arm that was pulled. The goal is to maximize the total average reward received for a finite number of time steps.

The Greedy Strategy Compute the sample mean of an arm A by dividing the total reward received from the arm by the number of times the arm has been pulled. At each time step choose the arm with highest sample mean.

Greedy Strategy Problems It is very possible to pull an arm and receive a reward well below the actual mean payoff of the arm. The sample mean will thus be much less than the true mean and for this reason the arm may never be chosen again even if it has the highest true mean.

The Naive Strategy Pull each arm an equal number of times.

Naive Strategy Problems The Naive strategy makes no distinction between different arms. Therefore it will choose the worst arm as much as the best arm. The total average reward will therefore generally be non-optimal.

The  -IE Strategy The  -IE strategy is a generalization of Kaelbling’s 1 IE (Interval Estimation) Strategy. For each arm, calculate a confidence interval centered around the sample mean. Choose the arm whose confidence interval has the highest tail. Fong introduced the parameter , which affects the size of the confidence intervals and often improves the performance of the algorithm. 1 Kaelbling, L.P. (1993). Learning in Embedded Systems. Cambridge, MA: MIT Press

Example Here we see that the  -IE strategy is not the same as the greedy strategy. Arm A1 has a higher sample mean and we are very confident that the true mean is close to the sample mean. Arm A2 has a much lower sample mean but we are less confident. Using the  -IE strategy we would choose to pull arm A2. A1 A2 01

Example Continued After choosing Arm A2 a number of times the confidence interval shrinks. At this point we will now choose arm A1. A1 A2 01

Laplace (or Optimistic) Strategy Calculate an estimate to the mean of an arm using the formula: est : = (Z+S)/(Z+N), where S is the total payoff of the arm, N is the number of times the arm was pulled, and Z is some positive constant. At each time step choose the arm with highest estimated mean. This can be thought of as assuming each arm has been pulled Z additional A strategy very similar to this is discussed by Sutton and Barto 1. 1 Sutton, Richard S. & Barto, Andrew G. (1998). Reinforcement Learning. Cambridge, MA: MIT Press

Laplace Strategy continued The parameter Z can be chosen large enough so that the Laplace satisfies the PAO conditions (introduced by Greiner and Orponen). Define an  -optimal arm to be an arm whose true mean is greater than (  -  ), where  is the mean of the optimal arm. For any  and  there exists a number Z such that the Laplace strategy is guaranteed to find an  optimal arm with probability (1 -  ). Fong 1 has shown that the  -IE and Naive strategies also satisfy the PAO conditions. 1 Fong, Philip W.L. (1995). A Quantitative Study of Hypothesis Selection. Twelfth International Conference on Machine Learning, , Tahoe City, California.

Experiments Four experiments were carried out. Each experiment consists of two phases: Exploration (5000 trials) and Exploitation (10000 trials). During the exploration phase any arm may be chosen, during the exploitation phase the greedy algorithm is used. The exploitation phase is included to penalize methods that pick sub-optimal arms. Each experiment was repeated 100 times and results averaged.

Experimental Results

Tuning the Parameter Z Experiment 1

Tuning the Parameter Z cont. Experiment 2

Tuning the Parameter Z cont. Experiment 3

Tuning the Parameter Z cont. Experiment 4

Conclusion The Laplace Strategy is a simple alternative to the  -IE Strategy. It has obtained slightly better empirical results. It is simpler in the sense that no confidence interval must be computed.

Future Work Average Case Analysis of these methods Calculation of optimal values of the parameters Z and  based on the range and variance of the payoffs as well as the number of trials.