1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
1 Eligibility Traces (ETs) Week #7. 2 Introduction A basic mechanism of RL. λ in TD(λ) refers to an eligibility trace. TD methods such as Sarsa and Q-learning.
Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning Tutorial
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Discretization Pieter Abbeel UC Berkeley EECS
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Markov Decision Process (MDP)
Reinforcement learning
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement Learning
Biomedical Data & Markov Decision Process
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 2: Evaluative Feedback
Presentation transcript:

1 Monte Carlo Methods Week #5

2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume a perfect environment model) –require experience sample sequences of states, actions, and rewards from interaction (on-line or simulated) with an environment –on-line experience exposes agent to learning in an unknown environment whereas simulated experience does require a model, but the model needs only generate sample transitions and not the complete probability distribution. In MC methods, the unknown value functions are computed from averaging sample returns.

3 Set up We assume the environment is a finite MDP. Finite set of states: S Finite set of actions: A(s) Dynamics of environment given by: –a set of transition probabilities, and – unknown –the expected immediate reward – unknown for all s i ∈ S, s j ∈ S + and a ∈ A(s).

4 MC Policy Evaluation Value of a state: expected cumulative future discounted reward To estimate state value from experience, returns observed after visits to that state should averaged. Every-visit MC method: method of estimating V π (s) as the average of the returns following all visits to the state s in a set of episodes. First-visit MC method: method of estimating V π (s) as the average of the returns following first visits to the state s in a set of episodes.

5 First-visit MC Method Initialize: π : the policy to be evaluated V: an arbitrary state-value function Returns(s): an empty list, for all s ∈ S Repeat forever: –Generate an episode using π –For each state s appearing in the episode: R ← return following the first occurrence of s Append R to Returns(s) V(s) ← average(Returns(s))

6 MC Estimation of Action Values For environments without a model, state values alone are not sufficient to determine a policy. A one-step look-ahead operation is enough to find the action that leads to the best combination of reward and next state. We need to estimate the value of the state-action pair, Q π (s,a), the expected return when starting in state s, taking action a. Problem: many state-action pairs may never be visited. –We need to maintain exploration. –Two assumptions to solve the problem: Exploring starts: each state-action pair has a non-zero probability. Infinite number of episodes: all state-action pairs are visited infinitely many times

7 Generalized Policy Iteration in MC Methods  V evaluation improvement where  * and Q* denote optimal policy and action values, respectively.

The figure is an exact replication of the figure on page 97 of [1]. 8 Policy Iteration: MC Version In the above diagram denoting the policy iteration the transition E and I stand for the policy evaluation and policy improvement phases of the poliy iteration steps in respective order [1]. For any Q, the corresponding greedy policy selects an action with the maximal Q-value as in the following:

9 Removing the Second Assumption... To remove the assumption of infinite number of episodes we again use the same idea of value iteration (i.e., instead of many just a single iteration of policy evaluation between each policy improvement step) as we did during the discussion of dynamic programming. An extreme version is the in-place value iteration where at each iteration only one state is updated.

10 Monte Carlo ES Method Initialize π(s) and Q(s,a) arbitrarily for all s ∈ S, a ∈ A(s): Returns(s,a) ← empty list Repeat forever: –Generate an episode using exploring starts and π –For each pair (s,a) appearing in the episode: R ← return following the first occurrence of s,a Append R to Returns(s,a) Q(s,a) ← average(Returns(s,a)) –For each s in the episode:

11 Removing the first Assumption... To remove the assumption of exploring starts the agent needs to continually select actions. Two types of MC control to select actions are: –on-policy MC control, and –off-policy MC control.

12 On-Policy MC Control In on-policy control, the policy we use to select actions is the same as the policy we use to estimate action values. Idea is that of GPI. Any policy we discussed in the second week may be used in on-policy MC control. Without ES, the policy moves to an ε-greedy one.

13 On-Policy MC Control Algorithm Initialize for all s ∈ S, a ∈ A(s): Q(s,a) ← arbitrary Returns(s,a) ← empty list Repeat forever: –Generate an episode using π –For each pair (s,a) appearing in the episode: R ← return following the first occurrence of s,a Append R to Returns(s,a) Q(s,a) ← average(Returns(s,a)) –For each s in the episode:

14 Off-Policy MC Control In off-policy control, the policy we use to select actions (behavior policy), π b, and the policy we use to estimate action values (estimation policy), π e, are separated. The question here is, how the state value estimates will be correctly updated using the estimation policy to reflect the selections of the behavior policy.

15 Off-Policy MC Control... (2) First requirement to work this out is that actions taken in π e are also taken in π b. Second, the rates of possibility of occurrence of any sequence between π e and π b starting from a specific visit to state s should be balanced. That is, consider i th first visit to state s and the complete sequence of states and actions following that visit. We define p i,e (s) and p i,b (s) as the probability of occurrence of the sequence mentioned above given policies π e and π b, respectively. Further, let R i (s) denote the corresponding observed return from state s, and T i (s) be the time of termination of the i th episode involving state s. Then,

16 Off-Policy MC Control... (3) Assigning weights to each return by its relative probability of occurrence under π e and π b, p i,e (s)/p i,b (s), the value estimate can be obtained by where n s is the number of returns observed from state s.

17 Off-Policy MC Control... (4) Looking below at the ratio of the probability we see that it depends upon the policies and not at all on the environment’s dynamics.

18 Off-Policy MC Control Algorithm Initialize for all s ∈ S, a ∈ A(s): Q(s,a) ← arbitrary N(s,a) ← 0; //Numerator D(s,a) ← 0;//Denominator of Q(s,a) Repeat forever: –Select a policy π e and use it to generate an episode s 0, a 0, r 1, s 1, a 1, r 2, s 2, a 2, r 3,..., s T-1, a T-1, r T, s T. –τ ← latest time at which a τ ≠ π b (s τ ) //...continues on the next page

19 Off-Policy MC Control Algorithm –For each pair (s,a) appearing in the episode at time τ or later: t ← the time of first occurrence of s,a such that t≥ τ –For each s ∈ S:

20 References [1] Sutton, R. S. and Barto A. G., “Reinforcement Learning: An introduction,” MIT Press, 1998