CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Slides:



Advertisements
Similar presentations
Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Advertisements

Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Reinforcement Learning
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Reinforcement learning (Chapter 21)
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Reinforcement learning
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CHAPTER 10 Reinforcement Learning Utility Theory.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement learning (Chapter 21)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Reinforcement learning
Reinforcement Learning
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
An Overview of Reinforcement Learning
Reinforcement Learning
Biomedical Data & Markov Decision Process
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
CS 188: Artificial Intelligence Fall 2008
یادگیری تقویتی Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 1: Introduction
Chapter 10: Dimensions of Reinforcement Learning
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 188: Artificial Intelligence Spring 2006
Chapter 4: Dynamic Programming
October 20, 2010 Dr. Itamar Arel College of Engineering
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1 MDPs and the RL Problem CMSC 471 – Spring 2014 Class #25 – Thursday, May 1 Russell & Norvig Chapter 21.1-21.3 Thanks to Rich Sutton and Andy Barto for the use of their slides (modified with additional slides and in-class exercise) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Learning Without a Model Last time, we saw how to learn a value function and/or a policy from a transition model What if we don’t have a transition model?? Idea #1: Explore the environment for a long time Record all transitions Learn the transition model Apply value iteration/policy iteration Slow and requires a lot of exploration! No intermediate learning! Idea #2: Learn a value function (or policy) directly from interactions with the environment, while exploring R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Simple Monte Carlo T T T T T T T T T T T R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

TD Prediction Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function Recall: target: the actual return after time t target: an estimate of the return γ: a discount factor in [0,1] (relative value of future rewards) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Simplest TD Method T T T T T T T T T T T R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Temporal Difference Learning TD-learning: Uπ (s) = Uπ (s) + α (R(s) + γ Uπ (s’) – Uπ (s)) or equivalently: Uπ (s) = α [ R(s) + γ Uπ (s’) ] + (1-α) [ Uπ (s) ] General idea: Iteratively update utility values, assuming that current utility values for other (local) states are correct Previous utility estimate Discount rate Learning rate Observed reward Previous utility estimate for successor state R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Exploration vs. Exploitation Problem with naive reinforcement learning: What action to take? Best apparent action, based on learning to date Greedy strategy Often prematurely converges to a suboptimal policy! Random action Will cover entire state space Very expensive and slow to learn! When to stop being random? Balance exploration (try random actions) with exploitation (use best action so far) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Q-Learning Q-value: Value of taking action A in state S (as opposed to V = value of state S) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

A B G Q-Learning Exercise     Q-learning reminder: Starting state: A Reward function:  in A yields -1 (at time t+1!);  in B yields +1; all other actions yield -.1; G is a terminal state Action sequence:       All Q-values are initialized to zero (including Q(G, *)) Fill in the following table for the six Q-learning updates: t at St Rt+1 St+1 Q’(st,at)  A 1 2  3 4 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

A B G Q-Learning Exercise     Q-learning reminder: Starting state: A Reward function:  in A yields -1 (at time t+1!);  in B yields +1; all other actions yield -.1; G is a terminal state Action sequence:       All Q-values are initialized to zero (including Q(G, *)); α and γ are 0.9 Fill in the following table for the six Q-learning updates: t at St Rt+1 St+1 Q’(st,at)  A -1 -0.9 1 -0.99 2  -.1 B -0.09 3 -0.162 4 -0.099 5 G 0.9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction