1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Slides:



Advertisements
Similar presentations
Partially Observable Markov Decision Process (POMDP)
Advertisements

brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
CPSC 322, Lecture 5Slide 1 Uninformed Search Computer Science cpsc322, Lecture 5 (Textbook Chpt 3.4) January, 14, 2009.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006.
An Introduction to Markov Decision Processes Sarah Hickmott
Infinite Horizon Problems
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Data Flow Analysis Compiler Design Nov. 8, 2005.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Department of Computer Science Undergraduate Events More
Data Flow Analysis Compiler Design Nov. 8, 2005.
Reinforcement Learning
1 TRADITIONAL PRODUCT COSTING METHODS Accounting Principles II AC Fall Semester, 1999.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Linear programming Lecture (4) and lecture (5). Recall An optimization problem is a decision problem in which we are choosing among several decisions.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Department of Computer Science Undergraduate Events More
CPSC 322, Lecture 5Slide 1 Uninformed Search Computer Science cpsc322, Lecture 5 (Textbook Chpt 3.5) Sept, 13, 2013.
Linear programming Lecture (4) and lecture (5). Recall An optimization problem is a decision problem in which we are choosing among several decisions.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.
ECE 517: Reinforcement Learning in Artificial Intelligence
CS 188: Artificial Intelligence
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
September 22, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 188: Artificial Intelligence Spring 2006
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
October 20, 2010 Dr. Itamar Arel College of Engineering
Markov Decision Processes
Markov Decision Processes
ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning November 3, 2010.
Reinforcement Learning (2)
Presentation transcript:

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011 September 8, 2011

ECE Reinforcement Learning in AI 2 Outline Optimal value functions (cont.) Implementation considerations Optimality and approximation

ECE Reinforcement Learning in AI 3 We define the state-value function for policy  as Similarly, we define the action-value function for The Bellman equation The value function V   s  is the unique solution to its Bellman equation 0 Recap on Value Functions 0 0 ∆

ECE Reinforcement Learning in AI 4 Optimal Value Functions A policy  is defined to be better than or equal to a policy  , if its expected return is greater than or equal to that of   for all states, i.e. There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies Optimal policies also share the same optimal action-value function, defined as

ECE Reinforcement Learning in AI 5 Optimal Value Functions (cont.) The latter gives the expected return for taking action a in state s and thereafter following an optimal policy Thus, we can write Since V   s  is the value function for a policy, it must satisfy the Bellman equation This is called the Bellman optimality equation This is called the Bellman optimality equation Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state ∆

ECE Reinforcement Learning in AI 6 0 Optimal Value Functions (cont.) ∆

ECE Reinforcement Learning in AI Optimal Value Functions (cont.) The Bellman optimality equation for Q  is Backup diagrams  arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)

ECE Reinforcement Learning in AI 8 Optimal Value Functions (cont.) For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy The Bellman optimality equation is actually a system of equations, one for each state The Bellman optimality equation is actually a system of equations, one for each state N equations (one for each state) N equations (one for each state) N variables – V   s  N variables – V   s  This assumes you know the dynamics of the environment Once one has V   s , it is relatively easy to determine an optimal policy … For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation Any policy that assigns nonzero probability only to these actions is an optimal policy Any policy that assigns nonzero probability only to these actions is an optimal policy This translates to a one-step search, i.e. greedy decisions will be optimal

ECE Reinforcement Learning in AI 9 Optimal Value Functions (cont.) With Q , the agent does not even have to do a one-step- ahead search For any state s – the agent can simply find any action that maximizes Q  (s,a) For any state s – the agent can simply find any action that maximizes Q  (s,a) The action-value function effectively embeds the results of all one-step-ahead searches It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair Agent does not need to know anything about the dynamics of the environment Agent does not need to know anything about the dynamics of the environment Q: What are the implementation tradeoffs here? ∆

ECE Reinforcement Learning in AI 10 Implementation Considerations Computational Complexity How complex is it to evaluate the value and state- value functions? How complex is it to evaluate the value and state- value functions? In software In software In hardware In hardware Data flow constraints Which part of the data needs to be globally vs. locally available? Which part of the data needs to be globally vs. locally available? Impact of memory bandwidth limitations Impact of memory bandwidth limitations ∆

ECE Reinforcement Learning in AI 11 Recycling Robot revisited 0 A transition graph is a useful way to summarize the dynamics of a finite MDP State node for each possible state State node for each possible state Action node for each possible state-action pair Action node for each possible state-action pair

ECE Reinforcement Learning in AI 12 0 Bellman Optimality Equations for the Recycling Robot To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re

ECE Reinforcement Learning in AI 13 Optimality and Approximation Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens Usually involves heavy computational load Usually involves heavy computational load Typically agents perform approximations to the optimal policy A critical aspect of the problem facing the agent is always the computational resources available to it In particular, the amount of computation it can perform in a single time step In particular, the amount of computation it can perform in a single time step Practical considerations are thus: Computational complexity Computational complexity Memory available Memory available Tabular methods apply for small state sets Communication overhead (for distributed implementations) Communication overhead (for distributed implementations) Hardware vs. software Hardware vs. software

ECE Reinforcement Learning in AI 14 Are approximations good or bad ? RL typically relies on approximation mechanisms (see later) This could be an opportunity Efficient “Feature-extraction” type of approximation may actually reduce “noise” Efficient “Feature-extraction” type of approximation may actually reduce “noise” Make it practical for us to address large-scale problems Make it practical for us to address large-scale problems In general, making “bad” decisions in RL result in learning opportunities (online) The online nature of RL encourages learning more effectively from events that occur frequently Supported in nature Supported in nature Capturing regularities is a key property of RL