Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.

Slides:



Advertisements
Similar presentations
Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
An Introduction to Markov Decision Processes Sarah Hickmott
Infinite Horizon Problems
Planning under Uncertainty
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Reinforcement Learning
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Markov Decision Processes
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Reinforcement Learning Yishay Mansour Tel-Aviv University.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Reinforcement Learning
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Reinforcement Learning
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Announcements Grader office hours posted on course website
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Markov Decision Processes
Markov Decision Processes
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CSE 473: Artificial Intelligence
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Markov Decision Processes Chapter 17 Mausam

Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect vs. Noisy Deterministic vs. Stochastic Instantaneous vs. Durative

Classical Planning What action next? PerceptsActions Environment Static Fully Observable Perfect Instantaneous Deterministic

Stochastic Planning: MDPs What action next? PerceptsActions Environment Static Fully Observable Perfect Stochastic Instantaneous

MDP vs. Decision Theory Decision theory – episodic MDP -- sequential

Markov Decision Process (MDP) S : A set of states A : A set of actions P r(s’|s,a): transition model C (s,a,s’): cost model G : set of goals s 0 : start state  : discount factor R ( s,a,s’): reward model factored Factored MDP absorbing/ non-absorbing

Objective of an MDP Find a policy  : S → A which optimizes minimizes expected cost to reach a goal maximizes expected reward maximizes expected (reward-cost) given a ____ horizon finite infinite indefinite assuming full observability discounted or undiscount.

Role of Discount Factor (  ) Keep the total reward/total cost finite useful for infinite horizon problems Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 +  r 2 +  2 r 3 + … Total cost: c 1 +  c 2 +  2 c 3 + …

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP Most often studied in planning, graph theory communities Infinite Horizon, Discounted Reward Maximization MDP Most often studied in machine learning, economics, operations research communities Oversubscription Planning: Non absorbing goals, Reward Max. MDP Relatively recent model most popular

AND/OR Acyclic Graphs vs. MDPs P RQST G P RST G a b ab c ccc ccc C(a) = 5, C(b) = 10, C(c) =1 Expectimin works V(Q/R/S/T) = 1 V(P) = 6 – action a Expectimin doesn’t work infinite loop V(R/S/T) = 1 Q(P,b) = 11 Q(P,a) = ???? suppose I decide to take a in P Q(P,a) = * Q(P,a)  = 13.5

Bellman Equations for MDP 1 Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. J* should satisfy the following equation:

Bellman Equations for MDP 2 Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:

Bellman Backup (MDP 2 ) Given an estimate of V* function (say V n ) Backup V n function at state s calculate a new estimate (V n+1 ) : Q n+1 (s,a) : value/cost of the strategy: execute action a in s, execute  n subsequently  n = argmax a ∈ Ap(s) Q n (s,a) V R V  ax

Bellman Backup V 0 = 0 V 0 = 1 V 0 = 2 Q 1 (s,a 1 ) =  Q 1 (s,a 2 ) = 5 +  0.9 £ 1 +  0.1 £ 2 Q 1 (s,a 3 ) =  max V 1 = 6.5 (  a greedy = a a2a2 a1a1 a3a3 s0s0 s1s1 s2s2 s3s

Value iteration [Bellman’57] assign an arbitrary assignment of V 0 to each state. repeat for all states s compute V n+1 (s) by Bellman backup at s. until max s |V n+1 (s) – V n (s)| <  Iteration n+1 Residual(s)  -convergence

Comments Decision-theoretic Algorithm Dynamic Programming Fixed Point Computation Probabilistic version of Bellman-Ford Algorithm for shortest path computation MDP 1 : Stochastic Shortest Path Problem  Time Complexity one iteration: O(| S | 2 | A |) number of iterations: poly(| S |, | A |, 1/  1-  )  Space Complexity: O(| S |)  Factored MDPs = Planning under uncertainty exponential space, exponential time

Convergence Properties V n → V* in the limit as n → 1  - convergence: V n function is within  of V* Optimality: current policy is within 2  of optimal Monotonicity V 0 ≤ p V * ⇒ V n ≤ p V* (V n monotonic from below) V 0 ≥ p V * ⇒ V n ≥ p V* (V n monotonic from above) otherwise V n non-monotonic

Policy Computation Optimal policy is stationary and time-independent. for infinite/indefinite horizon problems Policy Evaluation A system of linear equations in | S | variables. ax R V  R V  V

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration Search in policy space Compute the resulting value

Policy iteration [Howard’60] assign an arbitrary assignment of  0 to each state. repeat Policy Evaluation: compute V n+1  the evaluation of  n Policy Improvement: for all states s compute  n+1 (s): argmax a 2 Ap(s) Q n+1 (s,a) until  n+1   n Advantage searching in a finite (policy) space as opposed to uncountably infinite (value) space ⇒ convergence faster. all other properties follow! costly: O(n 3 ) approximate by value iteration using fixed policy Modified Policy Iteration

Modified Policy iteration assign an arbitrary assignment of  0 to each state. repeat Policy Evaluation: compute V n+1  the approx. evaluation of  n Policy Improvement: for all states s compute  n+1 (s): argmax a 2 Ap(s) Q n+1 (s,a) until  n+1   n Advantage probably the most competitive synchronous dynamic programming algorithm.

Applications  Stochastic Games  Robotics: navigation, helicopter manuevers…  Finance: options, investments  Communication Networks  Medicine: Radiation planning for cancer  Controlling workflows  Optimize bidding decisions in auctions  Traffic flow optimization  Aircraft queueing for landing; airline meal provisioning  Optimizing software on mobiles  Forest firefighting  …

Extensions  Heuristic Search + Dynamic Programming AO*, LAO*, RTDP, …  Factored MDPs add planning graph style heuristics use goal regression to generalize better  Hierarchical MDPs hierarchy of sub-tasks, actions to scale better  Reinforcement Learning learning the probability and rewards acting while learning – connections to psychology  Partially Observable Markov Decision Processes noisy sensors; partially observable environment popular in robotics

Summary: Decision Making  Classical planning sequential decision making in deterministic world domain independent heuristic generation  Decision theory one step decision making under uncertainty value of information utility theory  Markov Decision Process sequential decision making under uncertainty

Asynchronous Value Iteration  States may be backed up in any order instead of an iteration by iteration  As long as all states backed up infinitely often Asynchronous Value Iteration converges to optimal

Asynch VI: Prioritized Sweeping  Why backup a state if values of successors same?  Prefer backing a state whose successors had most change  Priority Queue of (state, expected change in value)  Backup in the order of priority  After backing a state update priority queue for all predecessors

Asynch VI: Real Time Dynamic Programming [Barto, Bradtke, Singh’95] Trial: simulate greedy policy starting from start state; perform Bellman backup on visited states RTDP: repeat Trials until value function converges

Min ? ? s0s0 VnVn VnVn VnVn VnVn VnVn VnVn VnVn Q n+1 (s 0,a) V n+1 (s 0 ) a greedy = a 2 RTDP Trial Goal a1a1 a2a2 a3a3 ?

Comments Properties if all states are visited infinitely often then V n → V* Advantages Anytime: more probable states explored quickly Disadvantages complete convergence can be slow!

Reinforcement Learning

 Still have an MDP Still looking for policy   New twist: don’t know P r and/or R i.e. don’t know which states are good and what actions do  Must actually try out actions to learn

Model based methods  Visit different states, perform different actions  Estimate P r and R  Once model built, do planning using V.I. or other methods  Con: require _huge_ amounts of data

Model free methods  Directly learn Q*(s,a) values  sample = R (s,a,s’) +  max a’ Q n (s’,a’)  Nudge the old estimate towards the new sample  Q n+1 (s,a)  (1-  )Q n (s,a) +  [sample]

Properties  Converges to optimal if If you explore enough If you make learning rate (  ) small enough But not decrease it too quickly ∑ i  (s,a,i) = ∞ ∑ i   (s,a,i) < ∞ where i is the number of visits to (s,a)

Model based vs. Model Free RL  Model based estimate O(| S | 2 | A |) parameters requires relatively larger data for learning can make use of background knowledge easily  Model free estimate O(| S || A |) parameters requires relatively less data for learning

Exploration vs. Exploitation  Exploration: choose actions that visit new states in order to obtain more data for better learning.  Exploitation: choose actions that maximize the reward given current learnt model.   - greedy Each time step flip a coin With prob , take an action randomly With prob 1-  take the current greedy action  Lower  over time increase exploitation as more learning has happened

Q-learning  Problems Too many states to visit during learning Q(s,a) is still a BIG table  We want to generalize from small set of training examples  Techniques Value function approximators Policy approximators Hierarchical Reinforcement Learning

Task Hierarchy: MAXQ Decomposition [Dietterich’00] Root TakeGive Navigate(loc) DeliverFetch Extend-arm GrabRelease Move e Move w Move s Move n Children of a task are unordered

Partially Observable Markov Decision Processes

Partially Observable MDPs What action next? PerceptsActions Environment Static Partially Observable Noisy Stochastic Instantaneous Unpredictable

Stochastic, Fully Observable

Stochastic, Partially Observable

POMDPs  In POMDPs we apply the very same idea as in MDPs.  Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states.  Let b be the belief of the agent about the current state  POMDPs compute a value function over belief space: γ a b, a a

POMDPs  Each belief is a probability distribution, value fn is a function of an entire probability distribution.  Problematic, since probability distributions are continuous.  Also, we have to deal with huge complexity of belief spaces.  For finite worlds with finite state, action, and observation spaces and finite horizons, we can represent the value functions by piecewise linear functions.

Applications  Robotic control helicopter maneuvering, autonomous vehicles Mars rover - path planning, oversubscription planning elevator planning  Game playing - backgammon, tetris, checkers  Neuroscience  Computational Finance, Sequential Auctions  Assisting elderly in simple tasks  Spoken dialog management  Communication Networks – switching, routing, flow control  War planning, evacuation planning