1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Advertisements

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Markov Decision Process

Batch RL Via Least Squares Policy Iteration

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Decision Theoretic Planning

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

主講人：虞台文大同大學資工所智慧型多媒體研究室

Infinite Horizon Problems

Planning under Uncertainty

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.

Markov Decision Processes

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Department of Computer Science Undergraduate Events More

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

RL for Large State Spaces: Policy Gradient

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Decision Making Under Uncertainty Lec #7: Markov Decision Processes UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Craig.

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Dynamic Programming Applications Lecture 6 Infinite Horizon.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.

Time Complexity of Algorithms (Asymptotic Notations)

Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.

Department of Computer Science Undergraduate Events More

Comparison Value vs Policy iteration

Department of Computer Science Undergraduate Events More

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Markov Decision Process (MDP)

Markov Decision Processes II

Markov Decision Processes

Markov Decision Processes

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Instructor: Vincent Conitzer

Markov Decision Problems

CS 188: Artificial Intelligence Spring 2006

Hidden Markov Models (cont.) Markov Decision Processes

Reinforcement Learning Dealing with Partial Observability

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld

2 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value”  This depends on how we define the value of a policy  There are several choices and the solution algorithms depend on the choice  We will consider two common choices  Finite-Horizon Value  Infinite Horizon Discounted Value

3 Discounted Infinite Horizon MDPs  Defining value as total reward is problematic with infinite horizons  many or all policies have infinite expected reward  some MDPs are ok (e.g., zero-cost absorbing states)  “Trick”: introduce discount factor 0 ≤ β < 1  future rewards discounted by β per time step  Note:  Motivation: economic? prob of death? convenience? Bounded Value

5 Notes: Discounted Infinite Horizon  Optimal policies guaranteed to exist (Howard, 1960)  I.e. there is a policy that maximizes value at each state  Furthermore there is always an optimal stationary policy  Intuition: why would we change action at s at a new time when there is always forever ahead  We define to be the optimal value function.  That is, for some optimal stationary π

6 Policy Evaluation  Value equation for fixed policy  Equation can be derived from original definition of infinite horizon discounted value immediate reward discounted expected value of following policy in the future

7 Policy Evaluation  Value equation for fixed policy  How can we compute the value function for a fixed policy?  we are given R and T  linear system with n variables and n constraints  Variables are values of states: V(s1),…,V(sn)  Constraints: one value equation (above) per state  Use linear algebra to solve for V (e.g. matrix inverse)

9 Policy Evaluation via Matrix Inverse V π and R are n-dimensional column vector (one element for each state) T is an nxn matrix s.t.

10 Computing an Optimal Value Function  Bellman equation for optimal value function  Bellman proved this is always true for an optimal value function immediate reward discounted expected value of best action assuming we we get optimal value in future

12 Computing an Optimal Value Function  Bellman equation for optimal value function  How can we solve this equation for V*?  The MAX operator makes the system non-linear, so the problem is more difficult than policy evaluation  Idea: lets pretend that we have a finite, but very, very long, horizon and apply finite-horizon value iteration  Adjust Bellman Backup to take discounting into account.

Bellman Backups (Revisited) a1a1 a2a2 s4 s1 s3 s2 V k Compute Expectations V k+1 (s) s Compute Max

14 Value Iteration  Can compute optimal policy using value iteration based on Bellman backups, just like finite-horizon problems (but include discount term)  Will it converge to optimal value function as k gets large?  Yes.  Why?

15 Convergence of Value Iteration  Bellman Backup Operator: define B to be an operator that takes a value function V as input and returns a new value function after a Bellman backup  Value iteration is just the iterative application of B:

16 Convergence: Fixed Point Property  Bellman equation for optimal value function  Fixed Point Property: The optimal value function is a fixed-point of the Bellman Backup operator B.  That is B[V*]=V*

17 Convergence: Contraction Property  Let ||V|| denote the max-norm of V, which returns the maximum element of the vector.  E.g. ||( )|| = 100  B[V] is a contraction operator wrt max-norm  For any V and V’, || B[V] – B[V’] || ≤ β || V – V’ ||  That is, applying B to any two value functions causes them to get closer together in the max- norm sense!

18 Convergence  Using the properties of B we can prove convergence of value iteration.  Proof: 1. For any V: || V* - B[V] || = || B[V*] – B[V] || ≤ β|| V* - V|| 2. So applying Bellman backup to any value function V brings us closer to V* by a constant factor β ||V* - V k+1 || = ||V* - B[V k ]|| ≤ β || V* - V k || 3. This means that ||V k – V*|| ≤ β k || V* - V 0 || 4. Thus

20 Value Iteration: Stopping Condition  Want to stop when we can guarantee the value function is near optimal.  Key property: (not hard to prove) If ||V k - V k-1 ||≤ ε then ||V k – V*|| ≤ εβ /(1-β)  Continue iteration until ||V k - V k-1 ||≤ ε  Select small enough ε for desired error guarantee

21 How to Act  Given a V k from value iteration that closely approximates V*, what should we use as our policy?  Use greedy policy: (one step lookahead)  Note that the value of greedy policy may not be equal to V k  Why?

23 How to Act  Use greedy policy: (one step lookahead)  We care about the value of the greedy policy V g  This is how good the policy will be in practice.  How close is V g to V*?

24 Value of Greedy Policy  Define V g to be the value of this greedy policy  This is likely not the same as V k  Property: If ||V k – V*|| ≤ λ then ||V g - V*|| ≤ 2λβ /(1-β)  Thus, V g is not too far from optimal if V k is close to optimal  Set stopping condition so that V g has desired accuracy  Furthermore, there is a finite k s.t. greedy policy is optimal  That is, even if value estimate is off, greedy policy is optimal once it is close enough. Why?

25 Optimization via Policy Iteration  Recall, given policy, can compute its value exactly:  Policy iteration exploits this: iterates steps of policy evaluation and policy improvement 1. Choose a random policy π 2. Loop: (a) Evaluate V π (b) For each s in S, set (c) Replace π with π’ Until no improving action possible at any state Policy improvement

26 Policy Iteration: Convergence  Policy improvement guarantees that π’ is no worse than π. Further if π is not optimal then π’ is strictly better in at least one state.  Local improvements lead to global improvement!  For proof sketch see  I’ll walk you through a proof in a HW problem  Convergence assured  No local maxima in value space (i.e. an optimal policy exists)  Since finite number of policies and each step improves value, then must converge to optimal  Gives exact value of optimal policy

27 Policy Iteration Complexity  Each iteration runs in polynomial time in the number of states and actions  There are at most |A| n policies and PI never repeats a policy  So at most an exponential number of iteations  Not a very good complexity bound  Empirically O(n) iterations are required  Challenge: try to generate an MDP that requires more than that n iterations  Still no polynomial bound on the number of PI iterations (open problem)!  But maybe not anymore …..

28 Value Iteration vs. Policy Iteration  Which is faster? VI or PI  It depends on the problem  VI takes more iterations than PI, but PI requires more time on each iteration  PI must perform policy evaluation on each iteration which involves solving a linear system  VI is easier to implement since it does not require the policy evaluation step  We will see that both algorithms will serve as inspiration for more advanced algorithms

29 Recap: things you should know  What is an MDP?  What is a policy?  Stationary and non-stationary  What is a value function?  Finite-horizon and infinite horizon  How to evaluate policies?  Finite-horizon and infinite horizon  Time/space complexity?  How to optimize policies?  Finite-horizon and infinite horizon  Time/space complexity?  Why they are correct?