Comparison Value vs Policy iteration

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Decision Theoretic Planning
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Linear Separators.
Maximum likelihood (ML) and likelihood ratio (LR) test
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Markov Decision Processes
Maximum likelihood (ML) and likelihood ratio (LR) test
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Department of Computer Science Undergraduate Events More
Linear and generalised linear models
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
1 Systems of Linear Equations Error Analysis and System Condition.
Maximum likelihood (ML)
SOLVING SYSTEMS USING SUBSTITUTION
Dominant Eigenvalues & The Power Method
Section 8.3 – Systems of Linear Equations - Determinants Using Determinants to Solve Systems of Equations A determinant is a value that is obtained from.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Unit 1 Test Review Answers
1. The Simplex Method.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
CS 416 Artificial Intelligence Lecture 19 Making Complex Decisions Chapter 17 Lecture 19 Making Complex Decisions Chapter 17.
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Markov Decision Processes II
Making complex decisions
Markov Decision Processes
CS 188: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Comparison Value vs Policy iteration MDPs Comparison Value vs Policy iteration

Value iteration two important properties of contractions: (i)A contraction has only one fixed point; if there were two fixed points they would not get closer together when the function was applied, so it would not be a contraction. (ii)When the function is applied to any argument, the value must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction always reaches the fixed point in the limit. 2

Value iteration Let Ui denote the vector of utilities for all the states at the ith iteration. Then the Bellman update equation can be written as 3

Value iteration use the max norm, which measures the length of a vector by the length of its biggest component: Let Ui and Ui' be any two utility vectors. Then we have 4

Value iteration In particular, we can replace U,' in Equation (17.7) with the true utilities U, for which B U = U. Then we obtain the inequality Next Figure shows how N varies with y, for different values of the ratio 17.5(b) 5

Value iteration 6

Value iteration From the contraction property (Equation (17.7)), it can be shown that if the update is small (i.e., no state's utility changes by much), then the error, compared with the true utility function, also is small. More precisely, 7

Value iteration policy loss Uπi (s) is the utility obtained if πi is executed starting in s, policy loss is the most the agent can lose by executing πi instead of the optimal policy π* 8

Value iteration The policy loss of πi is connected to the error in Ui by the following inequality: 9

Policy iteration The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0 : Policy evaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were to be executed. Policy improvement: Calculate a new MEU policy πi+1, using one-step look-ahead based on Ui (as in Equation (17.4)). 10

Policy iteration 11

Policy iteration For n states, we have n linear equations with n unknowns, which can be solved exactly in time O(n3) by standard linear algebra methods. For large state spaces, O(n3) time might be prohibitive modified policy iteration The simplified Bellman update for this process 12

Policy iteration In fact, on each iteration, we can pick any subset of states and apply either kind of updating (policy improvement or simplified value iteration) to that subset. This very general algorithm is called asynchronous policy iteration. 13