Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2005.

Slides:

Advertisements

Similar presentations

Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC 671 – Fall 2005 material from Lise Getoor, Jean-Claude Latombe, and Daphne Koller.

Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Making Simple Decisions Chapter 16 Some material borrowed from Jean-Claude Latombe and Daphne Koller by way of Marie desJadines,

Making Simple Decisions

Markov Decision Process

Partially Observable Markov Decision Process (POMDP)

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L09: Graphical Models for Decision Problems Nevin.

Decision Theoretic Planning

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

POMDPs: Partially Observable Markov Decision Processes Advanced AI

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Nov 14 th  Homework 4 due  Project 4 due 11/26.

Making Decisions under Probabilistic Uncertainty (Where an agent optimizes what it gets on average, but it may get more... or less ) R&N: Chap. 17, Sect.

Decision Making Under Uncertainty Russell and Norvig: ch 16 CMSC421 – Fall 2006.

Markov Decision Processes

Department of Computer Science Undergraduate Events More

Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.

Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

CMSC 671 Fall 2003 Class #26 – Wednesday, November 26 Russell & Norvig 16.1 – 16.5 Some material borrowed from Jean-Claude Latombe and Daphne Koller by.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect ,Chap. 17 CS121 – Winter 2003.

Decision Making Under Uncertainty CMSC 671 – Fall 2010 R&N, Chapters , , material from Lise Getoor, Jean-Claude Latombe, and.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Decision Making Under Uncertainty CMSC 471 – Spring 2014 Class #12– Thursday, March 6 R&N, Chapters , material from Lise Getoor, Jean-Claude.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.

Department of Computer Science Undergraduate Events More

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Markov Decision Process (MDP)

Web-Mining Agents Agents and Rational Behavior Decision-Making under Uncertainty Simple Decisions Ralf Möller Universität zu Lübeck Institut für Informationssysteme.

Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.

Making Simple Decisions Chapter 16 Some material borrowed from Jean-Claude Latombe and Daphne Koller by way of Marie desJadines,

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Web-Mining Agents Agents and Rational Behavior Decision-Making under Uncertainty Complex Decisions Ralf Möller Universität zu Lübeck Institut für Informationssysteme.

Decision Making Under Uncertainty CMSC 471 – Fall 2011 Class #23-24 – Thursday, November 17 / Tuesday, November 22 R&N, Chapters , ,

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Nevin L. Zhang Room 3504, phone: ,

Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Maximum Expected Utility

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Making Simple Decisions

CS b659: Intelligent Robotics

ECE457 Applied Artificial Intelligence Fall 2007 Lecture #10

Making complex decisions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Instructor: Vincent Conitzer

Chapter 17 – Making Complex Decisions

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Making Simple Decisions

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Reinforcement Nisheeth 18th January 2019.

Deciding Under Probabilistic Uncertainty

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2005

Utility-Based Agent environment agent ? sensors actuators

Non-deterministic vs. Probabilistic Uncertainty ? bac {a,b,c}  decision that is best for worst case ? bac {a(p a ),b(p b ),c(p c )}  decision that maximizes expected utility value Non-deterministic modelProbabilistic model ~ Adversarial search

Expected Utility Random variable X with n values x 1,…,x n and distribution (p 1,…,p n ) E.g.: X is the state reached after doing an action A under uncertainty Function U of X E.g., U is the utility of a state The expected utility of A is EU[A] =  i=1,…,n p(x i |A)U(x i )

s0s0 s3s3 s2s2 s1s1 A EU(A1) = 100 x x x 0.1 = = 62 One State/One Action Example

s0s0 s3s3 s2s2 s1s1 A A2 s4s EU(AI) = 62 EU(A2) = 74 EU(S0) = max{EU(A1),EU(A2)} = 74 One State/Two Actions Example

s0s0 s3s3 s2s2 s1s1 A A2 s4s EU(A1) = 62 – 5 = 57 EU(A2) = 74 – 25 = 49 EU(S0) = max{EU(A1),EU(A2)} = Introducing Action Costs

MEU Principle rational agent should choose the action that maximizes agent’s expected utility this is the basis of the field of decision theory normative criterion for rational choice of action

Not quite… Must have complete model of: Actions Utilities States Even if you have a complete model, will be computationally intractable In fact, a truly rational agent takes into account the utility of reasoning as well---bounded rationality Nevertheless, great progress has been made in this area recently, and we are able to solve much more complex decision theoretic problems than ever before

We’ll look at Decision Theoretic Planning Simple decision making (ch. 16) Sequential decision making (ch. 17)

Decision Networks Extend BNs to handle actions and utilities Also called Influence diagrams Make use of BN inference Can do Value of Information calculations

Decision Networks cont. Chance nodes: random variables, as in BNs Decision nodes: actions that decision maker can take Utility/value nodes: the utility of the outcome state.

R&N example

Prenatal Testing Example

Umbrella Network rain Take Umbrella happiness take/don’t take P(rain) = 0.4 U(~umb, ~rain) = 100 U(~umb, rain) = -100 U(umb,~rain) = 0 U(umb,rain) = -25 umbrella P(umb|take) = 1.0 P(~umb|~take)=1.0

Evaluating Decision Networks Set the evidence variables for current state For each possible value of the decision node: Set decision node to that value Calculate the posterior probability of the parent nodes of the utility node, using BN inference Calculate the resulting utility for action return the action with the highest utility

Umbrella Network rain Take Umbrella happiness take/don’t take P(rain) = 0.4 U(~umb, ~rain) = 100 U(~umb, rain) = -100 U(umb,~rain) = 0 U(umb,rain) = -25 umbrella P(umb|take) = 1.0 P(~umb|~take)=1.0 umbrainP(umb,rain | take) #1 #2: EU(take)

Umbrella Network rain Take Umbrella happiness take/don’t take P(rain) = 0.4 U(~umb, ~rain) = 100 U(~umb, rain) = -100 U(umb,~rain) = 0 U(umb,rain) = -25 umbrella P(umb|take) = 1.0 P(~umb|~take)=1.0 umbrainP(umb,rain | ~take) #1 #2: EU(~take)

Value of Information (VOI) suppose agent’s current knowledge is E. The value of the current best action  is the value of the new best action (after new evidence E’ is obtained): the value of information for E’ is:

Umbrella Network rain forecast Take Umbrella happiness take/don’t take P(rain) = 0.4 U(~umb, ~rain) = 100 U(~umb, rain) = -100 U(umb,~rain) = 0 U(umb,rain) = -25 umbrella P(umb|take) = 1.0 P(~umb|~take)=1.0 RFP(F|R)

VOI VOI(forecast)= P(rainy)EU(  rainy ) + P(~rainy)EU(  ~rainy ) – EU(  )

Umbrella Network rain forecast Take Umbrella happiness take/don’t take P(rain) = 0.4 U(~umb, ~rain) = 100 U(~umb, rain) = -100 U(umb,~rain) = 0 U(umb,rain) = -25 umbrella P(umb|take) = 1.0 P(~umb|~take)=1.0 RFP(F|R) FRP(R|F) P(F=rainy) = 0.4

umbrainP(umb,rain | take, rainy) #1: EU(take|rainy) umbrainP(umb,rain | take, ~rainy) #3: EU(take|~rainy) umbrainP(umb,rain | ~take, rainy) #2: EU(~take|rainy) umbrainP(umb,rain |~take, ~rainy) #4: EU(~take|~rainy)

Umbrella Network rain forecast Take Umbrella happiness take/don’t take P(rain) = 0.4 U(~umb, ~rain) = 100 U(~umb, rain) = -100 U(umb,~rain) = 0 U(umb,rain) = -25 umbrella P(umb|take) = 1.0 P(~umb|~take)=1.0 RFP(F|R) FRP(R|F) P(F=rainy) = 0.4

VOI VOI(forecast)= P(rainy)EU(  rainy ) + P(~rainy)EU(  ~rainy ) – EU(  )

Sequential Decision Making Finite Horizon Infinite Horizon

Simple Robot Navigation Problem In each state, the possible actions are U, D, R, and L

Probabilistic Transition Model In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)

Probabilistic Transition Model In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)

Probabilistic Transition Model In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move) With probability 0.1 the robot moves left one square (if the robot is already in the leftmost row, then it does not move)

Markov Property The transition properties depend only on the current state, not on previous history (how that state was reached)

Sequence of Actions Planned sequence of actions: (U, R) [3,2]

Sequence of Actions Planned sequence of actions: (U, R) U is executed [3,2] [4,2][3,3][3,2]

Histories Planned sequence of actions: (U, R) U has been executed R is executed There are 9 possible sequences of states – called histories – and 6 possible final states for the robot! [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Probability of Reaching the Goal P([4,3] | (U,R).[3,2]) = P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2]) Note importance of Markov property in this derivation P([3,3] | U.[3,2]) = 0.8 P([4,2] | U.[3,2]) = 0.1 P([4,3] | R.[3,3]) = 0.8 P([4,3] | R.[4,2]) = 0.1 P([4,3] | (U,R).[3,2]) = 0.65

Utility Function [4,3] provides power supply [4,2] is a sand area from which the robot cannot escape

Utility Function [4,3] provides power supply [4,2] is a sand area from which the robot cannot escape The robot needs to recharge its batteries

Utility Function [4,3] provides power supply [4,2] is a sand area from which the robot cannot escape The robot needs to recharge its batteries [4,3] or [4,2] are terminal states

Utility of a History [4,3] provides power supply [4,2] is a sand area from which the robot cannot escape The robot needs to recharge its batteries [4,3] or [4,2] are terminal states The utility of a history is defined by the utility of the last state (+1 or –1) minus n/25, where n is the number of moves

Utility of an Action Sequence +1 Consider the action sequence (U,R) from [3,2]

Utility of an Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Utility of an Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories: U =  h U h P(h) [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Optimal Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories The optimal sequence is the one with maximal utility [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Optimal Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories The optimal sequence is the one with maximal utility But is the optimal action sequence what we want to compute? [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1] only if the sequence is executed blindly!

Accessible or observable state Repeat:  s  sensed state  If s is terminal then exit  a  choose action (given s)  Perform a Reactive Agent Algorithm

Policy (Reactive/Closed-Loop Strategy) A policy  is a complete mapping from states to actions

Repeat:  s  sensed state  If s is terminal then exit  a   (s)  Perform a Reactive Agent Algorithm

Optimal Policy +1 A policy  is a complete mapping from states to actions The optimal policy  * is the one that always yields a history (ending at a terminal state) with maximal expected utility Makes sense because of Markov property Note that [3,2] is a “dangerous” state that the optimal policy tries to avoid

Optimal Policy +1 A policy  is a complete mapping from states to actions The optimal policy  * is the one that always yields a history with maximal expected utility This problem is called a Markov Decision Problem (MDP) How to compute  *?

Additive Utility History H = (s 0,s 1,…,s n ) The utility of H is additive iff: U (s 0,s 1,…,s n ) = R (0) + U (s 1,…,s n ) =  R (i) Reward

Additive Utility History H = (s 0,s 1,…,s n ) The utility of H is additive iff: U (s 0,s 1,…,s n ) = R (0) + U (s 1,…,s n ) =  R (i) Robot navigation example: R (n) = +1 if s n = [4,3] R (n) = -1 if s n = [4,2] R (i) = -1/25 if i = 0, …, n-1

Principle of Max Expected Utility History H = (s 0,s 1,…,s n ) Utility of H: U (s 0,s 1,…,s n ) =  R (i)  First-step analysis  U (i) = R (i) + max a  k P (k | a.i) U (k)  *(i) = arg max a  k P (k | a.i) U (k) +1

Value Iteration Initialize the utility of each non-terminal state s i to U 0 (i) = 0 For t = 0, 1, 2, …, do: U t+1 (i)  R (i) + max a  k P (k | a.i) U t (k)

Value Iteration Initialize the utility of each non-terminal state s i to U 0 (i) = 0 For t = 0, 1, 2, …, do: U t+1 (i)  R (i) + max a  k P (k | a.i) U t (k) U t ([3,1]) t Note the importance of terminal states and connectivity of the state-transition graph

Policy Iteration Pick a policy  at random

Policy Iteration Pick a policy  at random Repeat: Compute the utility of each state for  U t+1 (i)  R (i) +  k P (k |  (i).i) U t (k)

Policy Iteration Pick a policy  at random Repeat: Compute the utility of each state for  U t+1 (i)  R (i) +  k P (k |  (i).i) U t (k) Compute the policy  ’ given these utilities  ’(i) = arg max a  k P (k | a.i) U (k)

Policy Iteration Pick a policy  at random Repeat: Compute the utility of each state for  U t+1 (i)  R (i) +  k P (k |  (i).i) U t (k) Compute the policy  ’ given these utilities  ’(i) = arg max a  k P(k | a.i) U (k) If  ’ =  then return  Or solve the set of linear equations: U (i) = R (i) +  k P (k |  (i).i) U (k) (often a sparse system)

Example: Tracking a Target target robot The robot must keep the target in view The target’s trajectory is not known in advance

Example: Tracking a Target

Infinite Horizon In many problems, e.g., the robot navigation example, histories are potentially unbounded and the same state can be reached many times One trick: Use discounting to make infinite Horizon problem mathematically tractible What if the robot lives forever?

POMDP (Partially Observable Markov Decision Problem) A sensing operation returns multiple states, with a probability distribution Choosing the action that maximizes the expected utility of this state distribution assuming “state utilities” computed as above is not good enough, and actually does not make sense (is not rational)

Example: Target Tracking There is uncertainty in the robot’s and target’s positions, and this uncertainty grows with further motion There is a risk that the target escape behind the corner requiring the robot to move appropriately But there is a positioning landmark nearby. Should the robot tries to reduce position uncertainty?

Summary Decision making under uncertainty Utility function Optimal policy Maximal expected utility Value iteration Policy iteration