11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

Slides:



Advertisements
Similar presentations
1 Reinforcement Learning (RL). 2 Introduction The concept of reinforcement learning incorporates an agent that solves the problem in hand by interacting.
Advertisements

Reinforcement Learning
Lecture 18: Temporal-Difference Learning
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
1 Eligibility Traces (ETs) Week #7. 2 Introduction A basic mechanism of RL. λ in TD(λ) refers to an eligibility trace. TD methods such as Sarsa and Q-learning.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Reinforcement Learning Tutorial
Reinforcement Learning
Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki Fung On Tik Andy
Chapter 5: Monte Carlo Methods
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia-
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Chapter 7: Eligibility Traces
CS 188: Artificial Intelligence Spring 2006
Chapter 4: Dynamic Programming
October 20, 2010 Dr. Itamar Arel College of Engineering
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

11 Planning and Learning Week #9

22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic programming, heuristic search ◦Learning methods: Those that function without a model  Monte Carlo (MC) & Temporal Difference (TD) methods

33 Introduction... 2 MC methods and TD methods are distinct alternatives in RL and they are related by using eligibility traces (ETs). In a similar way, planning methods and learning methods may be shown to be related to each other.

44 Models A model is anything that helps the agent predict its environment’s responses to the actions it takes. model state s t action a t state s t+1 reward r t+1

55 Model Types from Planning Perspective Two types of models in RL from the planning perspective ◦Distribution Models: Those that generate a description of all possibilities and their probabilities ◦Sample Models: Those that produce only one of the possibilities, sampled according to probabilities at hand.

66 Distribution and Sample Models Distribution models are stronger than sample models since they may be used to generate samples They are harder to obtain since  Either there is not enough information to build one,  Or the information is there, but it is too complex to analytically obtain the model’s parameters. Both types of models generate simulated experience.

77 Simulated Experience in Models Given an initial state and an action ◦Distribution model can generate −All possible transitions and their occurrence probabilities −All possible episodes and their probabilities Sample model can generate −A possible transition −An entire episode

88 Planning In RL context, planning is defined as  any computational process that, once having a model presented, produces/improves a policy to interact with the environment represented by this model. model planning policy

99 Planning Two distinct approaches to planning  State space planning: −involves a search through the state space for an optimal policy or a path to a goal.  Plan-space planning −Performs a search through a space of plans. We will use state space planning.

10 State Space Planning All state space planning methods share a common structure. Two ideas are ◦All state space planning methods involve computing value functions as a key intermediate step toward improving the policy, ◦They compute their value functions by backup operations applied to simulated exprience.

11 Diagram of State-Space Planning Model Simulated experience Backups Values Policy

12 How learning and planning relate… The core of both learning and planning is the estimation of value functions by backup operations. The difference ◦real experience is used in learning that originates from the environment; ◦simulated experience generated by the model is employed in planning.

13 Further differences of Planning and Learning Origins of experience in learning and planning are different. –Simulated experience in planning and –Real experience in learning. This in turn leads to other differences such as –the different ways of performance assessment and – how flexibly experience is generated,

14 Advantages of Planning A learning algorithm may be replaced for the key backup step of a planning method. Learning methods can be applied to simulated experience as well as to real experience.

15 Random-sample One-step tabular Q-Planning Select a state s ∈ S and an action a ∈ A(s) at random Send s,a to a sample model, and obtain a sample next state s’ and a sample reward r. Apply one-step tabular Q-learning to s,a,s’,r

16 Integrating Planning, Acting and Learning Learning may be achieved together with planning where, –while interacting with the environment, –the model is continuously improved/updated.

17 Learning, Planning and Acting Value/Policy Experience Model planning direct RL acting model learning indirect RL

18 Dyna-Q Architecture A planner (agent) uses experience for … – … improving the model, and – … improving state-action values/policy. Dyna-Q architecture uses both learning and and planning together. Dyna agent achieves both direct and indirect RL indicated on page 17 in the algorithm on the next page.

19 Algorithm for Dyna-Q-learning Initialize Q(s,a) and Model(s,a) for all s  S and a  A(s) Do forever –s  current (non-terminal) state –a  ε-greedy(s,Q) –Execute a; observe s’ and r –Model(s,a)  s’,r//assuming deterministic environments –Repeat N times s  random previously observed state; a  random action previously taken in s s’,r  Model(s,a)

20 Algorithm’s parameter N With the growing values of N, the modification history of the model gets deeper. The deeper the update history of the model, the faster the convergence to the optimal policy. * * Check Example 9.1 in [1] pp and Fig. 9.5 and 9.6!

21 What if the model is wrong… …or what if the environment is non-stationary? If the environment has changed such that the current policy does not end with success anymore, then the agent discovers the optimal policy sooner or later provided sufficiently many episodes are involved. If, on the other hand, the environment changes so a new optimal policy arises, but the old (sub-optimal) policy is still available for access to goal state, then the optimal policy might go undetected even with an ε- greedy policy. * * Check Example 9.2 and 9.3 in [1] pp

22 What if the model is wrong…2 As in direct RL, there is no perfect and practical solution for the exploration/exploitation dilemma. Using heuristics may in such cases be effective. In the corresponding example * in [1], for each state-action pair, the time that has elapsed since the pair was last examined in a real interaction with the environment was recorded. This heuristic indicates the extent of the chance the model is incorrect. To give more chance to long-unexamined actions, a special “bonus reward” (r+κ  n; r being the regular reward, n the time steps the state-action pair is not examined and  R small) is given in simulated experience to these actions. Agent keeps encouraged to test such pairs and has a good possibility of getting to the optimal policy. * Check Example 9.2 and 9.3 in [1] pp

23 Prioritized Sweeping In real world cases, where the state-action space is usually tremendously large, unnecessary value updates (i.e., those that do not contribute to optimal or near optimal policy) in the model must be minimized. Prioritized sweeping is the improved version of the Dyna-Q algorithm, where only values of those state- action pairs get updated which lead to state-action pairs with currently updated values. Effective method as long as model has discrete states.

24 Prioritized Sweeping Algorithm Initialize Q(s,a) and Model(s,a) for all s,a and PQueue to empty Do forever –s  current (non-terminal) state –a  policy(s,Q) –Execute a; observe s’ and r –Model(s,a)  s’,r//assuming determinisic environments –if (p>  ) then insert s,a into PQueue with priority p –Repeat N times, while not isEmpty(PQueue) s,a  first(PQueue) s’,r  Model(s,a) Repeat for all s”,a” predicted to lead to s –r”  predicted reward –if (p>  ) then insert s”,a” into PQueue with priority p

25 References [1] Sutton, R. S. and Barto A. G., “Reinforcement Learning: An introduction,” MIT Press, 1998