Figure 5: Change in Blackjack Posterior Distributions over Time.

Slides:

Advertisements

Similar presentations

1 Reinforcement Learning (RL). 2 Introduction The concept of reinforcement learning incorporates an agent that solves the problem in hand by interacting.

Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Markov Decision Process

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Rene Plowden Joseph Libby. Improving the profit margin by optimizing the win ratio through the use of various strategies and algorithmic computations.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

Markov Decision Processes

Planning under Uncertainty

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

Chapter 5: Monte Carlo Methods

Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

Learning BlackJack with ANN (Aritificial Neural Network) Ip Kei Sam ID:

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

INTRODUCTION TO Machine Learning

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

Virtual University of Pakistan

Keep the Adversary Guessing: Agent Security by Policy Randomization

Markov Decision Process (MDP)

Bayesian Generalized Product Partition Model

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Reinforcement learning

Reinforcement Learning

Chapter 5: Monte Carlo Methods

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

Reinforcement Learning

When Security Games Go Green

"Playing Atari with deep reinforcement learning."

The Rich and the Poor: A Markov Decision Process Approach to Optimizing Taxi Driver Revenue Efficiency Huigui Rong, Xun Zhou, Chang Yang, Zubair Shafiq,

Statistical Learning Dong Liu Dept. EEIS, USTC.

CS 188: Artificial Intelligence

Reinforcement Learning

Reinforcement learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CAP 5636 – Advanced Artificial Intelligence

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Chapter 2: Evaluative Feedback

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Reinforcement Learning

September 22, 2011 Dr. Itamar Arel College of Engineering

CS 188: Artificial Intelligence Spring 2006

Introduction to Reinforcement Learning and Q-Learning

Distributed Algorithms for DCOP: A Graphical-Game-Based Approach

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

Reinforcement learning

Mathematical Foundations of BME Reza Shadmehr

Chapter 2: Evaluative Feedback

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

Figure 5: Change in Blackjack Posterior Distributions over Time. Evaluation of Bayesian Reinforcement Learning Techniques as Applied to Multi-Armed Bandit and Blackjack Problems Robert Sawyer and Shawn Harris INTRODUCTION BLACKJACK LEARNING BLACKJACK PERFORMANCE Figure 2 below shows an example of five arms with similar payouts. Note that as the number of iterations increases, the variance in the posterior distribution decreases and the probability of sampling from the best arm increases. In reinforcement learning (RL), an agent interacts within an environment, periodically receiving rewards. The agent knows its current state in the environment and which actions it is allowed to take from that state. The agent aims to learn an optimal policy, or mapping from states to actions, to maximize the discounted future rewards received from the environment. RL methods aim to convert the agent’s experiences with the environment into an optimal policy [1], which can then be followed by the agent to maximize its rewards in the environment. When an agent follows its optimal policy, the agent is said to be exploiting the environment. Conversely, an agent is exploring the environment when it takes actions it does not believe to be optimal in order to learn more about the environment. The agent must balance this tradeoff between exploration and exploitation to maximize its overall reward while ensuring it is not missing better action trajectories through the environment. This poster will present how reinforcement learning techniques can be used for resolving Multi-Armed Bandit (MAB) problems and for learning strategies in Blackjack. In the game of blackjack, the object is to reach a final score higher than the dealer without exceeding 21, or to let the dealer draw additional cards in the hopes that his or her hand will exceed 21. As a reinforcement learning, the state of the agent is defined by the agent’s card values, whether the agent has a usable ace, and the dealer’s showing card. Thus, the agent must learn a policy mapping these states to actions (stay or hit) to maximize its chance of winning the hand. When learning this policy with no prior knowledge of the game, the agent must learn when to hit and stay through experience playing the dealer. The agent uses this experience to model the transition probabilities between states and reward function over these transitions. Both of these can be modeled in a Bayesian fashion, with the transitions from state to next state under an action being modeled by a Dirichlet posterior distribution and the rewards being modeled as a Beta posterior. The transition posterior is updated by transition counts and the reward posterior is updated by delayed rewards of 0 or 1 if the agent lost or won the hand respectively. Performance of various blackjack agents is show in Figure 7. These are the cumulative hands won over iterations of hands played. After 100,000 hands, the optimal policy wins about 44.5% of total hands played, the Bayesian and ε –greedy win about 41%, and the random policy (red) wins about 30%. Figure 7 shows the first x iterations to illustrate how the agents improve performance in the early stages. Figure 2: Evolution of Posterior Beta Distributions for MAB Problem after different number of action selection iterations. Figure 7: Blackjack cumulative reward METHOD COMPARISON MULTI-ARMED BANDIT In addition to providing a natural way of action selection in the exploration-exploitation problem, the convergence criterion of the Bayesian methods provides a robust method for determining when to stop exploration and begin fully exploiting the agent’s knowledge. By calculating the probability the agent has found the maximum agent, the Bayesian agent can incorporate the uncertainty of the environment into its decision to stop exploring, while the ε-greedy agents depend on their fixed parameters of ε and decay, which may be optimal or suboptimal depending on the environment. The two 10-armed bandit situations below illustrate this problem. These figures show the plot of cumulative reward of different agents by iteration. The vertical lines indicate the iteration in which the agent converges to a greedy policy, when the agent stops exploring and begins exploiting the arm it believes is best. In Figure 3, one arm has reward probability is 0.9 while the other reward probabilities are below all below 0.5. In this scenario, the agent should be able to quickly determine which arm is best and continue to exploit that arm. Using an alpha of 0.95, the Bayesian Agent (shown in magenta) converges within 500 iterations to the optimal arm. Meanwhile ε-greedy arm with the quickest decay (the one that converges first in blue) converges later and to an incorrect arm. The ε-greedy arms with slower decay factors converge to the correct arm, but at a much later iteration than the Bayesian Agent. In Figure 4, arms have reward probabilities of 0.5, 0.475, 0.45, 0.425, 0.4, and others at 0.2. In this scenario, the two ε-greedy agents that converge before the Bayesian agent converge to incorrect arms, the 0.475 and 0.425 arms respectively, indicating that more exploration was needed before converging. CONCLUSIONS In a Multi-Armed Bandit (MAB) problem, the agent must learn which of K slot machines (armed bandits) gives a binary reward at the highest rate. There is only one state, from which the agent can pull any of the available arms. A good strategy for maximizing the cumulative reward must involve some combination of exploration (finding the best arm) and exploitation (using the arm that appears to have the best payout). In general, the agent should perform more exploration actions at the start, when the payouts of each arm are uncertain, and perform more exploitation actions later, when the agent has less uncertainty regarding the payouts of each arm. A common frequentist method of action selection is known as ε-greedy. Under this action selection strategy, the agent performs the “greedy” action (the one it believes to be best, the exploitation action) with probability 1- ε and a random action (an exploration action) with probability ε. A more sophisticated version of this algorithm incorporates a decay factor on the ε, causing there to be less exploration actions as the number of action selection iterations increases. Bayesian methods provide a natural way of incorporating the uncertainty of the agent’s beliefs. In a Bayesian framework, the reward probabilities of each arm are treated as random variables following a beta distribution. Modeling the agent’s experiences as a binomial likelihood, a beta posterior can be used to model the reward probabilities for each arm. These reward probabilities are then sampled when the agent needs to make a decision, and the maximum sampled reward probability is the action taken (see Figure 1). This provides a natural framework for determining when to stop exploration by calculating the Bayesian Reinforcement Learning provides a natural framework for action selection balancing exploration and exploitation. We compared the cumulative reward and convergence to optimal policy in the multi-armed bandit problem of a common frequentist algorithm (ε-greedy) and a Bayesian algorithm (Thompson Sampling). While the cumulative reward did not vary significantly between algorithms, the Bayesian framework provides a more robust algorithm that does not depend on the suitability of hyperparameters for its effectiveness. Furthermore, the Bayesian framework provides a natural way to determine when the agent should stop exploring and start exploiting, rather than a set decay function. The importance of this convergence criterion was illustrated by examples showing when the ε –greedy agents converge to the optimal policy after the Bayesian agent in an “obvious arm” scenario, and when the ε –greedy agents converge to a suboptimal policy indicating more exploration was needed. We also applied Bayesian methods to reinforcement learning in the blackjack setting, modeling both transition probabilities and a reward function as random variables. These methods showed comparable performance to the frequentist methods. Overall, Bayesian methods provide an elegant approach to incorporating the agent’s uncertainty regarding the environment dynamics when making action decisions with respect to the exploration/exploitation dilemma in maximizing cumulative rewards. Figure 5: Change in Blackjack Posterior Distributions over Time. Figure 5 above shows the Bayesian agent’s reward posterior after various number of hands played for the state in which the agent has card value of 15, no usable ace, and the dealer is showing a 9. Each state requires the agent to make a decision using a beta posterior over both the stay (black) and hit (red) actions, which are calculated from the rewards earned from previous hands and experiences with those states. Figure 6 shows three different believed optimal policies for blackjack after different amounts of hands played. These policies are represented by a grid of actions corresponding to agent card value (rows) and dealer card showing (columns) for usable ace (top grids) and no usable ace (bottom grids). Each cell of the grid is either a 0 for stay, 1 for hit, or – if the state was never seen by the agent. N = 1000 N = 10000 N = 1000000 Usable Ace probability that the agent has found the maximum reward probability arm since each arm reward probability is a random variable with a beta posterior. This solves a major problem with the frequentist methods, which use parameters for the ε and decay factor that vary in effectiveness depending on the underlying (unknown) reward distribution. REFERENCES No usable Ace [1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. [2] Ghavamzadeh, Mohammad, et al. "Bayesian reinforcement learning: A survey." Foundations and Trends® in Machine Learning 8.5-6 (2015): 359-483. Figure 6. Believed Optimal Policies after various number of hands played Figure 1. Bayesian (Thompson) Sampling Figure 3. One Arm = 0.9, others < 0.5 Figure 4. Five arms near 0.5