R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning Tutorial
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki Fung On Tik Andy
Chapter 5: Monte Carlo Methods
Policies and exploration and eligibility, oh my!.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous.
Reinforcement Learning
Policies and exploration and eligibility, oh my!.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
Qjk Qjk.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Introduction to Reinforcement Learning
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Artificial Intelligence Chapter 10 Planning, Acting, and Learning Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia-
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical.
Chapter 10 Planning, Acting, and Learning. 2 Contents The Sense/Plan/Act Cycle Approximate Search Learning Heuristic Functions Rewards Instead of Goals.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Planning, Acting, and Learning Chapter Contents The Sense/Plan/Act Cycle Approximate Search Learning Heuristic Functions Rewards Instead of Goals.
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Chapter 3: The Reinforcement Learning Problem
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 2: Evaluative Feedback
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
Chapter 6: Temporal Difference Learning
Chapter 9: Planning and Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
Chapter 4: Dynamic Programming
October 20, 2010 Dr. Itamar Arel College of Engineering
Chapter 2: Evaluative Feedback
Presentation transcript:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning and learning methods Objectives of this chapter:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 Models pModel: anything the agent can use to predict how the environment will respond to its actions pDistribution model: description of all possibilities and their probabilities e.g., pSample model: produces sample experiences e.g., a simulation model pBoth types of models can be used to produce simulated experience pOften sample models are much easier to come by

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 Planning pPlanning: any computational process that uses a model to create or improve a policy pPlanning in AI: state-space planning plan-space planning (e.g., partial-order planner) pWe take the following (unusual) view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 Planning Cont. Random-Sample One-Step Tabular Q-Planning pClassical DP methods are state-space planning methods pHeuristic search methods are state-space planning methods pA planning method based on Q-learning:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 Learning, Planning, and Acting pTwo uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy pImproving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Direct vs. Indirect RL pIndirect methods: make fuller use of experience: get better policy with fewer environment interactions pDirect methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 The Dyna Architecture (Sutton 1990)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8 The Dyna-Q Algorithm direct RL model learning planning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Dyna-Q on a Simple Maze rewards = 0 until goal, when =1

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Dyna-Q Snapshots: Midway in 2nd Episode

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 When the Model is Wrong: Blocking Maze The changed environment is harder Dyna AC uses actor –critic learning Dyna Q + uses exploration bonus r+k*sqrt(n)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 Shortcut Maze The changed environment is easier

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 What is Dyna-Q ? pUses an “exploration bonus”: Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent actually “plans” how to visit long unvisited states +

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Prioritized Sweeping pWhich states or state-action pairs should be generated during planning? pWork backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue pMoore and Atkeson 1993; Peng and Williams, 1993

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 Prioritized Sweeping

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16 Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction Prioritized sweeping Significantly improves the learning speed

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 Rod Maneuvering (Moore and Atkeson 1993)  Manneuver the long rod between obstacles to reach the goal  Step size and rotation increment are fixed  Solved with prioritized sweeping 4 actions and states

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Full and Sample (One-Step) Backups

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 Full vs. Sample Backups b successor states, equally likely; initial error = 1; assume all next states’ values are correct After full backup the error reduced to zero Computational Cost normalized to time of full backup

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 Trajectory Sampling pTrajectory sampling: perform backups along simulated trajectories pThis samples from the on-policy distribution pAdvantages when function approximation is used (Chapter 8) pFocusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Reachable under optimal control Irrelevant states

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Trajectory Sampling Experiment pone-step full tabular backups puniform: cycled through all state- action pairs pon-policy: backed up along simulated trajectories p200 randomly generated undiscounted episodic tasks p2 actions for each state, each with b equally likely next states p.1 prob of transition to terminal state pexpected reward on each transition selected from mean 0 variance 1 Gaussian

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Heuristic Search pUsed for action selection, not for changing a value function (=heuristic evaluation function) pBacked-up values are computed, but typically discarded pExtension of the idea of a greedy policy — only deeper pAlso suggests ways to select states to backup: smart focusing:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 Summary pEmphasized close relationship between planning and learning pImportant distinction between distribution models and sample models pLooked at some ways to integrate planning and learning synergy among planning, acting, model learning pDistribution of backups: focus of the computation trajectory sampling: backup along trajectories prioritized sweeping heuristic search pSize of backups: full vs. sample; deep vs. shallow