1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Decision Theoretic Planning

Reinforcement Learning

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

POMDPs: Partially Observable Markov Decision Processes Advanced AI

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Markov Decision Processes

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning

Reinforcement Learning (1)

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Reinforcement Learning

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Department of Computer Science Undergraduate Events More

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.

MDPs (cont) & Reinforcement Learning

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Department of Computer Science Undergraduate Events More

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Markov Decision Processes

Biomedical Data & Markov Decision Process

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence Fall 2007

September 22, 2011 Dr. Itamar Arel College of Engineering

October 6, 2011 Dr. Itamar Arel College of Engineering

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Markov Decision Processes

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011 September 15, 2011

ECE Reinforcement Learning in AI 2 Outline Finite Horizon MDPs Dynamic Programming

ECE Reinforcement Learning in AI 3 Finite Horizon MDPs – Value Functions The duration, or expected duration, of the process is finite Let’s consider the following return functions: The expected sum of rewards The expected sum of rewards The expected discounted sum of rewards The expected discounted sum of rewards a sufficient condition for the above to converge is r t < r max … expected sum of rewards for N steps

ECE Reinforcement Learning in AI 4 Return functions If r t < r max holds, then note that this bound is very sensitive to the value of note that this bound is very sensitive to the value of The expected average reward Note that the above limit does not always exist!

ECE Reinforcement Learning in AI 5 Relationship between and Consider a finite horizon problem where the horizon is random, i.e. Let’s also assume that the final value for all states is zero Let N be geometrically distributed with parameter, such that the probability of stopping at the N th step is Lemma: we’ll show that under the assumption that | r t | < r max

ECE Reinforcement Learning in AI 6 Relationship between and (cont.) Proof: ∆

ECE Reinforcement Learning in AI 7 Outline Finite Horizon MDPs (cont.) Dynamic Programming

ECE Reinforcement Learning in AI 8 0 Example of a finite horizon MDP Consider the following state diagram: a 21

ECE Reinforcement Learning in AI 9 Why do we need DP techniques ? Explicitly solving the Bellman Optimality equation is hard Computing the optimal policy  solve the RL problem Computing the optimal policy  solve the RL problem Relies on the following three assumptions We have perfect knowledge of the dynamics of the environment We have perfect knowledge of the dynamics of the environment We have enough computational resources We have enough computational resources The Markov property holds The Markov property holds In reality, all three are problematic e.g. Backgammon game: first and last conditions are ok, but computational resources are insufficient Approx state Approx state In many cases we have to settle for approximate solutions (much more on that later …)

ECE Reinforcement Learning in AI 10 Big Picture: Elementary Solution Methods During the next few weeks we’ll talk about techniques for solving the RL problem Dynamic programming – well developed, mathematically, but requires an accurate model of the environment Dynamic programming – well developed, mathematically, but requires an accurate model of the environment Monte Carlo methods – do not require a model, but are not suitable for step-by-step incremental computation Monte Carlo methods – do not require a model, but are not suitable for step-by-step incremental computation Temporal difference learning – methods that do not need a model and are fully incremental Temporal difference learning – methods that do not need a model and are fully incremental More complex to analyze More complex to analyze Launched the revisiting of RL as a pragmatic framework (1988) Launched the revisiting of RL as a pragmatic framework (1988) The methods also differ in efficiency and speed of convergence to the optimal solution

ECE Reinforcement Learning in AI 11 Dynamic Programming Dynamic programming is the collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as an MDP DP constitutes a theoretically optimal methodology In reality often limited since DP is computationally expensive Important to understand  reference to other models Do “just as well” as DP Do “just as well” as DP Require less computations Require less computations Possibly require less memory Possibly require less memory Most schemes will strive to achieve the same effect as DP, without the computational complexity involved

ECE Reinforcement Learning in AI 12 Dynamic Programming (cont.) We will assume finite MDPs (states and actions) The agent has knowledge of transition probabilities and expected immediate rewards, i.e. The key idea of DP (as in RL) is the use of value functions to derive optimal/good policies We’ll focus on the manner by which values are computed Reminder: an optimal policy is easy to derive once the optimal value function (or action-value function) is attained

ECE Reinforcement Learning in AI 13 Dynamic Programming (cont.) Employing the Bellman equation to the optimal value/ action-value function, yields DP algorithms are obtained by turning the Bellman equations into update rules These rules help improve the approximations of the desired value functions We will discuss two main approaches: policy iteration and value iteration

ECE Reinforcement Learning in AI 14 Method #1: Policy Iteration Technique for obtaining the optimal policy Comprises of two complementing steps Policy evaluation – updating the value function in view of current policy (which can be sub-optimal) Policy evaluation – updating the value function in view of current policy (which can be sub-optimal) Policy improvement – updating the policy given the current value function (which can be sub-optimal) Policy improvement – updating the policy given the current value function (which can be sub-optimal) The process converges by “bouncing” between these two steps

ECE Reinforcement Learning in AI 15 Policy Evaluation We’ll consider how to compute the state-value function for an arbitrary policy Recall that (assumes that policy  is always followed) The existence of a unique solution is guaranteed as long as either  <1 or eventual termination is guaranteed from all states under the policy The Bellman equation translates into |S| simultaneous equations with |S| unknowns (the values) Assuming we have an initial guess, we can use the Bellman equation as an update rule 0

ECE Reinforcement Learning in AI 16 Policy Evaluation (cont.) We can write The sequence { V k } converges to the correct value function as k   In each iteration, all state-values are updated a.k.a. full backup a.k.a. full backup A similar method can be applied to state-action ( Q(s,a) ) functions An underlying assumption: all states are visited each time Scheme is computationally heavy Scheme is computationally heavy Can be distributed – given sufficient resources (Q: How?) Can be distributed – given sufficient resources (Q: How?) “In-place” schemes – use a single array and update values based on new estimates Also converge to the correct solution Also converge to the correct solution Order in which states are backed up determines rate of convergence Order in which states are backed up determines rate of convergence

ECE Reinforcement Learning in AI 17 0 Iterative Policy Evaluation algorithm A key consideration is the termination condition Typical stopping condition for iterative policy evaluation is

ECE Reinforcement Learning in AI 18 Policy Improvement Policy evaluation deals with finding the value function under a given policy However, we don’t know if the policy (and hence the value function) is optimal Policy improvement has to do with the above, and with updating the policy if non-optimal values are reached Suppose that for some arbitrary policy, , we’ve computed the value function (using policy evaluation) Let policy  ’ be defined such that in each state s it selects action a that maximizes the first-step value, i.e. It can be shown that  ’ is at least as good as , and if they are equal they are both the optimal policy.

ECE Reinforcement Learning in AI 19 Policy Improvement (cont.) Consider a greedy policy,  ’, that selects the action that would yield the highest expected single-step return Then, by definition, this is the condition for the policy improvement theorem. The above states that following the new policy one step is enough to prove that it is a better policy, i.e. that

ECE Reinforcement Learning in AI 20 0 Proof of the Policy Improvement Theorem

ECE Reinforcement Learning in AI 21 0 Policy Iteration