1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Reinforcement Learning
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki Fung On Tik Andy
Policies and exploration and eligibility, oh my!.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Policies and exploration and eligibility, oh my!.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
Qjk Qjk.
Reinforcement Learning
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Introduction to Reinforcement Learning
Ibrahim Fathy, Mostafa Aref, Omar Enayet, and Abdelrahman Al-Ogail Faculty of Computer and Information Sciences Ain-Shams University ; Cairo ; Egypt.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Artificial Intelligence Chapter 10 Planning, Acting, and Learning Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel.
2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia-
Chapter 10 Planning, Acting, and Learning. 2 Contents The Sense/Plan/Act Cycle Approximate Search Learning Heuristic Functions Rewards Instead of Goals.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Reinforcement Learning Elementary Solution Methods
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Announcements Homework 3 due today (grace period through Friday)
Dr. Unnikrishnan P.C. Professor, EEE
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Spring 2006
October 20, 2010 Dr. Itamar Arel College of Engineering
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015 October 27, 2015

ECE 517: Reinforcement Learning in AI Final projects - logistics Projects can be done in groups of up to 3 students Details on projects will be posted soon Students are encouraged to propose a topic Students are encouraged to propose a topic Please me your top three choices for a project along with a preferred date for your presentation Please me your top three choices for a project along with a preferred date for your presentation Presentation dates: Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD) Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD) Format: 20 min presentation + 5 min Q&A ~5 min for background and motivation ~5 min for background and motivation ~15 for description of your work, results, conclusions ~15 for description of your work, results, conclusions Written report due: Monday, Dec. 7 Format similar to project report 2

ECE 517: Reinforcement Learning in AI Final projects – sample topics DQN – Playing Atari games using RL Teris player using RL (and NN) Curiosity based TD learning* Reinforcement Learning of Local Shape in the Game of Go AIBO learning to walk Study of value function definitions for TD learning Imitation learning in RL 3

ECE 517: Reinforcement Learning in AI 4 Outline Introduction Use of environment models Integration of planning and learning methods

ECE 517: Reinforcement Learning in AI 5 Introduction Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternatives Then showed how they can be seamlessly integrated by using eligibility traces such as in TD( ) Planning methods: e.g. Dynamic Programming and heuristic search Rely on knowledge of a model Rely on knowledge of a model Model – any information that helps the agent predict the way the environment will behave Model – any information that helps the agent predict the way the environment will behave Learning methods: Monte Carlo and Temporal Difference Learning Do not require a model Do not require a model Our goal: Explore the extent to which the two methods can be intermixed

ECE 517: Reinforcement Learning in AI 6 The original idea

ECE 517: Reinforcement Learning in AI 7 The original idea (cont.)

ECE 517: Reinforcement Learning in AI 8 Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution models: provide description of all possibilities (of next states and rewards) and their probabilities Distribution models: provide description of all possibilities (of next states and rewards) and their probabilities e.g. Dynamic Programming Example - sum of a dozen dice – produce all possible sums and their probabilities of occurring Sample models: produce just one sample experience Sample models: produce just one sample experience In our example - produce individual sums drawn according to this probability distribution Both types of models can be used to (mimic) produce simulated experience Often sample models are much easier to come by

ECE 517: Reinforcement Learning in AI 9 Planning Planning: any computational process that uses a model to create or improve a policy Planning in AI: State-space planning (such as in RL) – search for policy State-space planning (such as in RL) – search for policy Plan-space planning (e.g., partial-order planner) Plan-space planning (e.g., partial-order planner) e.g. evolutionary methods We take the following (unusual) view: All state-space planning methods involve computing value functions, either explicitly or implicitly All state-space planning methods involve computing value functions, either explicitly or implicitly They all apply backups to simulated experience They all apply backups to simulated experience

ECE 517: Reinforcement Learning in AI 10 Planning (cont.) Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods Planning methods rely on “real” experience as input, but in many cases they can be applied to simulated experience just as well Example: a planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning

ECE 517: Reinforcement Learning in AI 11 Learning, Planning, and Acting Two uses of real experience: Model learning: to improve the model Model learning: to improve the model Direct RL: to directly improve the value function and policy Direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning. Q: What are the advantages/disadvantages of each?

ECE 517: Reinforcement Learning in AI 12 Direct vs. Indirect RL Indirect methods: make fuller use of experience: get better policy with fewer environment interactions make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler simpler not affected by bad models not affected by bad models simultaneously But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel Q: Which scheme do you think applies to humans?

ECE 517: Reinforcement Learning in AI 13 The Dyna-Q Architecture (Sutton 1990)

ECE 517: Reinforcement Learning in AI 14 The Dyna-Q Algorithm model learning (update) planning direct RL Random-sample single-step tabular Q-planning method

ECE 517: Reinforcement Learning in AI 15 Dyna-Q on a Simple Maze rewards = 0 until goal reached, when reward = 1

ECE 517: Reinforcement Learning in AI 16 Dyna-Q Snapshots: Midway in 2 nd Episode Recall that in a planning context … Exploration – trying actions that improve the model Exploration – trying actions that improve the model Exploitation – Behaving in the optimal way given the current model Exploitation – Behaving in the optimal way given the current model Balance between the two is always a key challenge!

ECE 517: Reinforcement Learning in AI 17 Variations on the Dyna-Q agent (Regular) Dyna-Q Soft exploration/exploitation with constant rewards Soft exploration/exploitation with constant rewardsDyna-Q+ Encourages exploration of state-action pairs that have not been visited in a long time (in real interaction with the environment) Encourages exploration of state-action pairs that have not been visited in a long time (in real interaction with the environment) If n is the number of steps elapsed between two consecutive visits to (s,a), then the reward is larger as a function of n If n is the number of steps elapsed between two consecutive visits to (s,a), then the reward is larger as a function of nDyna-AC Actor-Critic learning rather that Q-learning Actor-Critic learning rather that Q-learning

ECE 517: Reinforcement Learning in AI 18 More on Dyna-Q+ ? Uses an “exploration bonus”: Keeps track of time since each state-action pair was tried for real Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent (indirectly) “plans” how to visit long unvisited states The agent (indirectly) “plans” how to visit long unvisited states

ECE 517: Reinforcement Learning in AI 19 When the Model is Wrong: Blocking Maze (cont.) The maze example was oversimplified In reality many things could go wrong Environment could be stochastic Environment could be stochastic Model can be imperfect (local minimum, stochasticity or no convergence) Model can be imperfect (local minimum, stochasticity or no convergence) Partial experience could be misleading Partial experience could be misleading When the model is incorrect, the planning process will compute a suboptimal policy This is actually a learning opportunity Discovery and correction of the modeling error Discovery and correction of the modeling error

ECE 517: Reinforcement Learning in AI 20 When the Model is Wrong: Blocking Maze (cont.) The changed environment is harder

ECE 517: Reinforcement Learning in AI 21 Shortcut Maze The changed environment is easier

ECE 517: Reinforcement Learning in AI 22 Prioritized Sweeping In the Dyna agents presented, simulated transitions are started in uniformly chosen state-action pairs Probably not optimal Probably not optimal Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Always perform backups from first in queue Moore and Atkeson 1993; Peng and Williams, 1993

ECE 517: Reinforcement Learning in AI 23 Prioritized Sweeping

ECE 517: Reinforcement Learning in AI 24 Prioritized Sweeping vs. Dyna-Q Both use N = 5 backups per environmental interaction

ECE 517: Reinforcement Learning in AI 25 Trajectory Sampling Trajectory sampling: perform backups along simulated trajectories This samples from the on-policy distribution Distribution constructed from experience (visits) Distribution constructed from experience (visits) Advantages when function approximation is used Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Reachable under optimal control Irrelevant states

ECE 517: Reinforcement Learning in AI 26 Summary Discussed close relationship between planning and learning Important distinction between distribution models and sample models Looked at some ways to integrate planning and learning synergy among planning, acting, model learning synergy among planning, acting, model learning Distribution of backups: focus of the computation prioritized sweeping prioritized sweeping trajectory sampling: backup along trajectories trajectory sampling: backup along trajectories