4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
MDPs as Utility-based problem solving agents
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
Markov Decision Processes
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Reinforcement Learning Introduction Presented by Alp Sardağ.
11/19  Connection between MC/HMM and MDP/POMDP  Utility in terms of the value of the vantage point.
5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Announcements Grader office hours posted on course website
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

MDPs as a way of Reducing Planning Complexity MDPs provide a normative basis for talking about optimal plans (policies) in the context of stochastic actions, and complex reward models Optimal policies for MDPs can be computed in polynomial time In contrast, even classical planning is NP-complete or P- Space Complete (depending on whether the plans are polynomial or exponential length)  SO, convert planning problems to MDP problems, so we can get polynomial time performance.  –To see this, note that Sorting problem can be written as a planning problem. But Sorting is only polynomial time. Thus, the inherent complexity of planning is only polynomial. All that MDP conversion does is to let planning exhibit its inherent polynomial complexity.

MDP Complexity: The Real Deal Complexity results are stated in terms of the Size of the input (measured in some way) MDP complexity results are typically in terms of state space size. Planning complexity results are typically in terms of the factored input (in terms of state variables ) State Space is already exponential in terms of State Variables. So, polynomial in state space implies exponential in factored representation –More depressingly, optimal policy construction is exponential and undecidable respectively for POMDPs with finite and infinite horizons even with input size measured in terms of explicit state space. So clearly, we don’t consider compiling planning problems to MDP model for efficiency…

Forget your homework grading. Forget your project grading. We’ll make it look like you remembered

Agenda General (FO)MDP model –Action (Transition) model Action Cost Model –Reward Model –Histories Horizon –Policies –Optimal value and policy –Value iteration/Policy Iteration/RTDP Special cases of MDP model relevant to Planning –Pure cost models (goal states are absorbing) –Reward/Cost models –Over-subscription models –Connections to heuristic search –Efficient approaches for policy construction

Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G : set of goals  s 0 : start state   : discount factor  R ( s,a,s’): reward model

Objective of a Fully Observable MDP  Find a policy  : S → A  which optimises minimises expected cost to reach a goal maximises expected reward maximises expected (reward-cost)  given a ____ horizon finite infinite indefinite  assuming full observability discounted or undiscount.

Histories; Value of Histories; (expected) Value of a policy; Optimal Value & Bellman Principle

Policy evaluation vs. Optimal Value Function in Finite vs. Infinite Horizon

[can generalize to have action costs C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario.. Repeat

What does a solution to an MDP look like? The solution should tell the optimal action to do in each state (called a “Policy”) –Policy is a function from states to actions (* see finite horizon case below*) –Not a sequence of actions anymore Needed because of the non-deterministic actions –If there are |S| states and |A| actions that we can do at each state, then there are |A| |S| policies How do we get the best policy? –Pick the policy that gives the maximal expected reward –For each policy  Simulate the policy (take actions suggested by the policy) to get behavior traces Evaluate the behavior traces Take the average value of the behavior traces. We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

Horizon & Policy How long should behavior traces be? –Each trace is no longer than k (Finite Horizon case) Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon) –Eg: Financial portfolio advice for yuppies vs. retirees. –No limit on the size of the trace (Infinite horizon case) Policy is not horizon dependent We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state) If you are twenty and not a liberal, you are heartless If you are sixty and not a conservative, you are mindless --Churchill

How to handle unbounded state sequences? If we don’t have a horizon, then we can have potentially infinitely long state sequences. Three ways to handle them 1.Use discounted reward model ( i th state in the sequence contributes only ° i R(s i ) 2.Assume that the policy is proper (i.e., each sequence terminates into an absorbing state with non-zero probability). 3.Consider “average reward per-step”

How to evaluate a policy? Step 1: Define utility of a sequence of states in terms of their rewards –Assume “stationarity” of preferences If you prefer future f1 to f2 starting tomorrow, you should prefer them the same way even if they start today –Then, only two reasonable ways to define Utility of a sequence of states –U(s 1, s 2  s n ) =  n R(s i ) –U(s 1, s 2  s n ) =  n ° i R(s i ) (0 · ° · 1) Maximum utility bounded from above by R max /(1 - ° ) Step 2: Utility of a policy ¼ is the expected utility of the behaviors exhibited by an agent following it. E [  1 t=0 ° t R(s t ) | ¼ ] Step 3: Optimal policy ¼ * is the one that maximizes the expectation: argmax ¼ E [  1 t=0 ° t R(s t ) | ¼ ] –Since there are only A |s| different policies, you can evaluate them all in finite time (Haa haa..)

Utility of a State The (long term) utility of a state s with respect to a policy \pi is the expected value of all state sequences starting with s –U ¼ (s) = E [  1 t=0 ° t R(s t ) | ¼, s 0 =s ] The true utility of a state s is just its utility w.r.t optimal policy U(s) =U ¼ *(s) Thus, U and ¼ * are closely related – ¼ * (s) = argmax a  s’ M a ss’ U(s’) As are utilities of neighboring states –U(s) = R(s) + ° argmax a  s’ M a ss’ U(s’) Bellman Eqn

Think of these as h*() values… Called value function U* Think of these as related to h* values Repeat U* is the maximal expected utility (value) assuming optimal policy

Optimal Policies depend on rewards.. -- Repeat - -

Bellman Equations as a basis for computing optimal policy Qn: Is there a simpler way than having to evaluate |A| |S| policies? –Yes… The Optimal Value and Optimal Policy are related by the Bellman Equations –U(s) = R(s) + ° argmax a  s’ M a ss’ U(s’) – ¼ * (s) = argmax a  s’ M a ss’ U(s’) The equations can be solved exactly through –“value iteration” (iteratively compute U and then compute ¼ * ) – “policy iteration” ( iterate over policies) –Or solve approximately through “real-time dynamic programming”

.8.1 U(i) = R(i) + ° max j M a ij U(j) + °

Value Iteration Demo mdp/vi.htmlhttp:// mdp/vi.html Things to note –The way the values change (states far from absorbing states may first reduce and then increase their values) –The convergence speed difference between Policy and value

Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often.8.1

Terminating Value Iteration The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration) –Set a threshold  and stop when the change across two consecutive iterations is less than  –There is a minor problem since value is a vector We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by  Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||U i – U i+1 || < 

Policies converge earlier than values There are finite number of policies but infinite number of value functions. So entire regions of value vector are mapped to a specific policy So policies may be converging faster than values. Search in the space of policies Given a utility vector U i we can compute the greedy policy  ui The policy loss of  ui is ||U  ui  U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) V(S 1 ) V(S 2 ) Consider an MDP with 2 states and 2 actions P1P1 P2P2 P3P3 P4P4 U*U*

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation) n linear equations with n unknowns.

Bellman equations when actions have costs The model discussed in class ignores action costs and only thinks of state rewards –C(s,a) is the cost of doing action a in state s Assume costs are just negative rewards.. –The Bellman equation then becomes U(s) = R(s) + ° max a [ -C(s,a) +  s’ R(s’) M a ss’ ] Notice that the only difference is that -C(s,a) is now inside the maximization With this model, we can talk about “partial satisfaction” planning problems where –Actions have costs; goals have utilities and the optimal plan may not satisfy all goals.