CMSC 471 Fall 2009 RL using Dynamic Programming

Slides:



Advertisements
Similar presentations
Reinforcement Learning: Learning from Interaction
Advertisements

Lecture 18: Temporal-Difference Learning
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Markov Decision Processes
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Chapter 5: Monte Carlo Methods
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
MDPs (cont) & Reinforcement Learning
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?
Any Questions? Programming Assignments?. CptS 450/516 Singularity Weka / Torch / RL-Glue/Burlap/RLPy Finite MDP = ? Start arbitrarily, moving towards.
Markov Decision Process (MDP)
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement Learning
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Chapter 2: Evaluative Feedback
Chapter 3: The Reinforcement Learning Problem
Chapter 8: Generalization and Function Approximation
Chapter 3: The Reinforcement Learning Problem
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 10: Dimensions of Reinforcement Learning
Chapter 9: Planning and Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Hidden Markov Models (cont.) Markov Decision Processes
Chapter 7: Eligibility Traces
Chapter 4: Dynamic Programming
October 20, 2010 Dr. Itamar Arel College of Engineering
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

CMSC 471 Fall 2009 RL using Dynamic Programming Prof. Marie desJardins Class #23/24 – Tuesday, 11/17 and 11/24 Thanks to Rich Sutton and Andy Barto for the use of their slides! R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Chapter 4: Dynamic Programming Objectives of this chapter: Overview of a collection of classical solution methods for MDPs known as dynamic programming (DP) Show how DP can be used to compute value functions, and hence, optimal policies Discuss efficiency and utility of DP R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Policy Evaluation Policy Evaluation: for a given policy p, compute the state-value function Recall: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Iterative Methods a “sweep” A sweep consists of applying a backup operation to each state. A full policy evaluation backup: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Iterative Policy Evaluation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

A Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2, . . ., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is –1 until the terminal state is reached R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Iterative Policy Eval for the Small Gridworld R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Policy Improvement Suppose we have computed for a deterministic policy p. For a given state s, would it be better to do an action ? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Policy Improvement Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Policy Improvement Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Policy Iteration policy evaluation policy improvement “greedification” R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Policy Iteration R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Value Iteration Recall the full policy evaluation backup: Here is the full value iteration backup: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Value Iteration Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Can you select states to backup intelligently? YES: an agent’s experience can act as a guide. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Efficiency of DP To find an optimal policy is polynomial in the number of states… BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”). In practice, classical DP can be applied to problems with a few millions of states. Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. It is surprisingly easy to come up with MDPs for which DP methods are not practical. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Summary Policy evaluation: backups without a max Policy improvement: form a greedy policy, if only locally Policy iteration: alternate the above two processes Value iteration: backups with a max Full backups (to be contrasted later with sample backups) Generalized Policy Iteration (GPI) Asynchronous DP: a way to avoid exhaustive sweeps Bootstrapping: updating estimates based on other estimates R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction