Markov Systems, Markov Decision Processes, and Dynamic Programming

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Decision Theoretic Planning
Executive Summary: Bayes Nets Andrew W. Moore Carnegie Mellon University, School of Computer Science Note to other teachers.
Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
An Introduction to Markov Decision Processes Sarah Hickmott
April 21st, 2002Copyright © 2002, 2004, Andrew W. Moore Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
A Call Admission Control for Service Differentiation and Fairness Management in WDM Grooming Networks Kayvan Mosharaf, Jerome Talim and Ioannis Lambadaris.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
CS 401R: Intro. to Probabilistic Graphical Models Lecture #6: Useful Distributions; Reasoning with Joint Distributions This work is licensed under a Creative.
Department of Computer Science Undergraduate Events More
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes II
Reinforcement Learning
Making complex decisions
A Crash Course in Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Timothy Boger and Mike Korostelev
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
UAV Route Planning in Delay Tolerant Networks
Markov Decision Processes
CS 188: Artificial Intelligence
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Markov Decision Problems
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Support Vector Machines
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
CS 416 Artificial Intelligence
K-means and Hierarchical Clustering
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Markov Systems, Markov Decision Processes, and Dynamic Programming Prediction and Search in Probabilistic Worlds Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Copyright © 2002, 2004, Andrew W. Moore April 21st, 2002

Discounted Rewards An assistant professor gets paid, say, 20K per year. How much, in total, will the A.P. earn in their life? 20 + 20 + 20 + 20 + 20 + … = Infinity $ $ What’s wrong with this argument? Copyright © 2002, 2004, Andrew W. Moore

Discounted Rewards “A reward (payment) in the future is not worth quite as much as a reward now.” Because of chance of obliteration Because of inflation Example: Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now. Assuming payment n years in future is worth only (0.9)n of payment now, what is the AP’s Future Discounted Sum of Rewards ? Copyright © 2002, 2004, Andrew W. Moore

Discount Factors People in economics and probabilistic decision-making do this all the time. The “Discounted sum of future rewards” using discount factor g” is (reward now) + g (reward in 1 time step) + g 2 (reward in 2 time steps) + g 3 (reward in 3 time steps) + : : (infinite sum) Copyright © 2002, 2004, Andrew W. Moore

The Academic Life Define: 0.6 0.6 0.2 0.2 0.7 B. Assoc. Prof 60 A. Assistant Prof 20 T. Tenured Prof 400 0.2 0.2 S. On the Street 10 D. Dead 0.3 Assume Discount Factor g = 0.9 Define: JA = Expected discounted future rewards starting in state A JB = Expected discounted future rewards starting in state B JT = “ “ “ “ “ “ “ T JS = “ “ “ “ “ “ “ S JD = “ “ “ “ “ “ “ D How do we compute JA, JB, JT, JS, JD ? 0.7 0.3 Copyright © 2002, 2004, Andrew W. Moore

A Markov System with Rewards… Has a set of states {S1 S2 ·· SN} Has a transition probability matrix P11 P12 ·· P1N P= P21 Pij = Prob(Next = Sj | This = Si ) : PN1 ·· PNN Each state has a reward. {r1 r2 ·· rN } There’s a discount factor g . 0 < g < 1 On Each Time Step … 0. Assume your state is Si 1. You get given reward ri 2. You randomly move to another state P(NextState = Sj | This = Si ) = Pij 3. All future rewards are discounted by g Copyright © 2002, 2004, Andrew W. Moore

Solving a Markov System Write J*(Si) = expected discounted sum of future rewards starting in state Si J*(Si) = ri + g x (Expected future rewards starting from your next state) = ri + g(Pi1J*(S1)+Pi2J*(S2)+ ··· PiNJ*(SN)) Using vector notation write J*(S1) r1 P11 P12 ·· P1N J*(S2) r2 P21 ·. J= : R= : P= : J*(SN) rN PN1 PN2 ·· PNN Question: can you invent a closed form expression for J in terms of R P and g ? Copyright © 2002, 2004, Andrew W. Moore

Solving a Markov System with Matrix Inversion Upside: You get an exact answer Downside: Copyright © 2002, 2004, Andrew W. Moore

Solving a Markov System with Matrix Inversion Upside: You get an exact answer Downside: If you have 100,000 states you’re solving a 100,000 by 100,000 system of equations. Copyright © 2002, 2004, Andrew W. Moore

Value Iteration: another way to solve a Markov System Define J1(Si) = Expected discounted sum of rewards over the next 1 time step. J2(Si) = Expected discounted sum rewards during next 2 steps J3(Si) = Expected discounted sum rewards during next 3 steps : Jk(Si) = Expected discounted sum rewards during next k steps J1(Si) = (what?) J2(Si) = (what?) Jk+1(Si) = (what?) Copyright © 2002, 2004, Andrew W. Moore

Value Iteration: another way to solve a Markov System Define J1(Si) = Expected discounted sum of rewards over the next 1 time step. J2(Si) = Expected discounted sum rewards during next 2 steps J3(Si) = Expected discounted sum rewards during next 3 steps : Jk(Si) = Expected discounted sum rewards during next k steps J1(Si) = ri (what?) J2(Si) = (what?) Jk+1(Si) = (what?) N = Number of states Copyright © 2002, 2004, Andrew W. Moore

Let’s do Value Iteration 1/2 WIND  1/2 g = 0.5 SUN  +4 HAIL .::.:.:: -8 1/2 1/2 1/2 1/2 k Jk(SUN) Jk(WIND) Jk(HAIL) 1 2 3 4 5 Copyright © 2002, 2004, Andrew W. Moore

Let’s do Value Iteration 1/2 WIND  1/2 g = 0.5 SUN  +4 HAIL .::.:.:: -8 1/2 1/2 1/2 1/2 k Jk(SUN) Jk(WIND) Jk(HAIL) 1 4 -8 2 5 -1 -10 3 -1.25 -10.75 4.94 -1.44 -11 4.88 -1.52 -11.11 Copyright © 2002, 2004, Andrew W. Moore

Let’s do Value Iteration

Value Iteration for solving Markov Systems Compute J1(Si) for each j Compute J2(Si) for each j : Compute Jk(Si) for each j As k→∞ Jk(Si)→J*(Si) . Why? When to stop? When Max Jk+1(Si) – Jk(Si) < ξ i This is faster than matrix inversion if the transition matrix is sparse Copyright © 2002, 2004, Andrew W. Moore

A Markov Decision Process g = 0.9 1 S You run a startup company. In every state you must choose between Saving money or Advertising. 1 Poor & Unknown +0 Poor & Famous +0 1/2 A A 1/2 S 1/2 1/2 1 1/2 1/2 1/2 A S A 1/2 Rich & Famous +10 Rich & Unknown +10 S 1/2 1/2 Copyright © 2002, 2004, Andrew W. Moore

Markov Decision Processes An MDP has… A set of states {s1 ··· sN} A set of actions {a1 ··· aM} A set of rewards {r1 ··· rN} (one for each state) A transition probability function On each step: 0. Call current state Si 1. Receive reward ri 2. Choose action  {a1 ··· aM} 3. If you choose action ak you’ll move to state Sj with probability 4. All future rewards are discounted by g Copyright © 2002, 2004, Andrew W. Moore

A Policy A policy is a mapping from states to actions. Examples PU PF 1 S 1 PU PF A STATE → ACTION PU S PF A RU RF 1/2 1 Policy Number 1: S A 1/2 RU +10 RF +10 STATE → ACTION PU A PF RU RF 1 1/2 1/2 PU PF A A Policy Number 2: 1/2 1/2 1 RU 10 RF 10 A A How many possible policies in our example? Which of the above two policies is best? How do you compute the optimal policy? Copyright © 2002, 2004, Andrew W. Moore

Interesting Fact For every M.D.P. there exists an optimal policy. It’s a policy such that for every possible start state there is no better option than to follow the policy. Copyright © 2002, 2004, Andrew W. Moore

Computing the Optimal Policy Idea One: Run through all possible policies. Select the best. What’s the problem ?? Copyright © 2002, 2004, Andrew W. Moore

Optimal Value Function Define J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policy 1 B 1 Question What (by inspection) is an optimal policy for that MDP? (assume g = 0.9) A S2 +3 1/2 1/2 A S1 +0 1/3 1/2 B B 1/3 1/2 S3 +2 1/3 A 1 What is J*(S1) ? What is J*(S2) ? What is J*(S3) ? Copyright © 2002, 2004, Andrew W. Moore

Computing the Optimal Value Function with Value Iteration Define Jk(Si) = Maximum possible expected sum of discounted rewards I can get if I start at state Si and I live for k time steps. Note: 1. Without a policy we had: Jk(Si) = Expected discounted sum rewards during next k steps 2. J1(Si) = ri Copyright © 2002, 2004, Andrew W. Moore

Let’s compute Jk(Si) for our example Jk(PU) Jk(PF) Jk(RU) Jk(RF) 1 2 3 4 5 6 Copyright © 2002, 2004, Andrew W. Moore

A Markov Decision Process g = 0.9 1 S You run a startup company. In every state you must choose between Saving money or Advertising. 1 Poor & Unknown +0 Poor & Famous +0 1/2 A A 1/2 S 1/2 1/2 1 1/2 1/2 1/2 A S A 1/2 Rich & Famous +10 Rich & Unknown +10 S 1/2 1/2 Copyright © 2002, 2004, Andrew W. Moore

Let’s compute Jk(Si) for our example Jk(PU) Jk(PF) Jk(RU) Jk(RF) 1 10 2 4.5 14.5 19 3 2.03 6.53 25.08 18.55 4 3.852 12.20 29.63 19.26 5 7.22 15.07 32.00 20.40 6 10.03 17.65 33.58 22.43 Copyright © 2002, 2004, Andrew W. Moore

Bellman’s Equation Value Iteration for solving MDPs Compute J1(Si) for all i Compute J2(Si) for all i : Compute Jn(Si) for all i …..until converged …Also known as Dynamic Programming Copyright © 2002, 2004, Andrew W. Moore

Finding the Optimal Policy Compute J*(Si) for all i using Value Iteration (a.k.a. Dynamic Programming) Define the best action in state Si as Copyright © 2002, 2004, Andrew W. Moore

Recall: Value Iteration Without Define J1(Si) = Expected discounted sum of rewards over the next 1 time step. J2(Si) = Expected discounted sum rewards during next 2 steps J3(Si) = Expected discounted sum rewards during next 3 steps : Jk(Si) = Expected discounted sum rewards during next k steps J1(Si) = ri (what?) J2(Si) = (what?) Jk+1(Si) = (what?) N = Number of states Copyright © 2002, 2004, Andrew W. Moore

Applications of MDPs This extends the search algorithms of your first lectures to the case of probabilistic next states. Many important problems are MDPs…. Robot path planning Travel route planning Elevator scheduling Bank customer retention Autonomous aircraft navigation Manufacturing processes Network switching & routing Copyright © 2002, 2004, Andrew W. Moore