CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17.

Slides:



Advertisements
Similar presentations
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Advertisements

Markov Decision Process
Chapter 4 Distributed Bellman-Ford Routing
Partially Observable Markov Decision Process (POMDP)
CS 282.  Any question about… ◦ SVN  Permissions?  General Usage? ◦ Doxygen  Remember that Project 1 will require it  However, Assignment 2 is good.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Partially Observable Markov Decision Processes
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
Markov Decision Processes
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
CPSC 502, Lecture 13Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 13 Oct, 25, 2011 Slide credit POMDP: C. Conati.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Department of Computer Science Undergraduate Events More
CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Department of Computer Science Undergraduate Events More
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Making complex decisions
POMDPs Logistics Outline No class Wed
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17

Midterm Results AVG:72 MED:75 STD:12 Rough dividing lines at: 58 (C), 72 (B), 85 (A) AVG:72 MED:75 STD:12 Rough dividing lines at: 58 (C), 72 (B), 85 (A)

Assignment 1 Results AVG: 87 MED: 94 STD: 19 How to interpret the grade sheet… AVG: 87 MED: 94 STD: 19 How to interpret the grade sheet…

Interpreting the grade sheet… You see the tests we ran listed in the first columnYou see the tests we ran listed in the first column The metrics we accumulated are:The metrics we accumulated are: –Solution depth, nodes created, nodes accessed, fringe size –All metrics are normalized by dividing by the value obtained using one of the good solutions from last year The first four columns show these normalized metrics averaged across the entire class’s submissionsThe first four columns show these normalized metrics averaged across the entire class’s submissions The next four columns show these normalized metrics for your submission…The next four columns show these normalized metrics for your submission… –Ex: A value of “1” for “Solution” means your code found a solution at the same depth as the solution from last year. The class average for “solution” might be 1.28 because some submissions searched longer and thus increased the average You see the tests we ran listed in the first columnYou see the tests we ran listed in the first column The metrics we accumulated are:The metrics we accumulated are: –Solution depth, nodes created, nodes accessed, fringe size –All metrics are normalized by dividing by the value obtained using one of the good solutions from last year The first four columns show these normalized metrics averaged across the entire class’s submissionsThe first four columns show these normalized metrics averaged across the entire class’s submissions The next four columns show these normalized metrics for your submission…The next four columns show these normalized metrics for your submission… –Ex: A value of “1” for “Solution” means your code found a solution at the same depth as the solution from last year. The class average for “solution” might be 1.28 because some submissions searched longer and thus increased the average

Interpreting the grade sheet SLOW = more than 30 seconds to completeSLOW = more than 30 seconds to complete –66% credit given to reflect partial credit even though we never obtained firm results N/A = the test would not even launch correctly… it might have crashed or ended without outputN/A = the test would not even launch correctly… it might have crashed or ended without output –33% credit given to reflect that frequently N/A occurs when no attempt was made to create an implementation If you have an N/A but you think your code reflects partial credit, let us know. SLOW = more than 30 seconds to completeSLOW = more than 30 seconds to complete –66% credit given to reflect partial credit even though we never obtained firm results N/A = the test would not even launch correctly… it might have crashed or ended without outputN/A = the test would not even launch correctly… it might have crashed or ended without output –33% credit given to reflect that frequently N/A occurs when no attempt was made to create an implementation If you have an N/A but you think your code reflects partial credit, let us know.

Gambler’s Ruin Consider working out examples of gambler’s ruin for $4 and $8 by hand Ben created some graphs to show solution of gambler’s ruin for $8 $0 bets are not permitted! Consider working out examples of gambler’s ruin for $4 and $8 by hand Ben created some graphs to show solution of gambler’s ruin for $8 $0 bets are not permitted!

$8-ruin using batch update Converges after three iterations. Value vector is only updated after a complete iteration has completed Converges after three iterations. Value vector is only updated after a complete iteration has completed

$8-ruin using in-place updating Convergence occurs more quickly Updates to value function occur in-place starting from $1 Convergence occurs more quickly Updates to value function occur in-place starting from $1

$100-ruin A more detailed graph than provided in the assignment

Trying it by hand Assume value update is working… What’s the best action at $5? Assume value update is working… What’s the best action at $5? $1$2$3$4$5$6$7$ When tied… pick the smallest action

Office hours Sunday: 4 – 5 in Thornton Stacks Send to Ben by Saturday at midnight to reserve a slot Also make sure you have stepped through your code (say for the $8 example) to make sure that it is implementing your logic Sunday: 4 – 5 in Thornton Stacks Send to Ben by Saturday at midnight to reserve a slot Also make sure you have stepped through your code (say for the $8 example) to make sure that it is implementing your logic

Compilation Just for grins Take your Visual Studio code and compile using g++: g++ foo.cpp –o foo -Wall g++ foo.cpp –o foo -Wall Just for grins Take your Visual Studio code and compile using g++: g++ foo.cpp –o foo -Wall g++ foo.cpp –o foo -Wall

Partially observable Markov Decision Processes (POMDPs) Relationship to MDPs Value and Policy Iteration assume you know a lot about the world:Value and Policy Iteration assume you know a lot about the world: –current state, action, next state, reward for state, … In real world, you don’t exactly know what state you’re inIn real world, you don’t exactly know what state you’re in –Is the car in front braking hard or braking lightly? –Can you successfully kick the ball to your teammate? Relationship to MDPs Value and Policy Iteration assume you know a lot about the world:Value and Policy Iteration assume you know a lot about the world: –current state, action, next state, reward for state, … In real world, you don’t exactly know what state you’re inIn real world, you don’t exactly know what state you’re in –Is the car in front braking hard or braking lightly? –Can you successfully kick the ball to your teammate?

Partially observable Consider not knowing what state you’re in… Go left, left, left, left, leftGo left, left, left, left, left Go up, up, up, up, upGo up, up, up, up, up –You’re probably in upper- left corner Go right, right, right, right, rightGo right, right, right, right, right Consider not knowing what state you’re in… Go left, left, left, left, leftGo left, left, left, left, left Go up, up, up, up, upGo up, up, up, up, up –You’re probably in upper- left corner Go right, right, right, right, rightGo right, right, right, right, right

Extending the MDP model MDPs have an explicit transition function T(s, a, s’) We add O (s, o)We add O (s, o) –The probability of observing o when in state s We add the belief state, bWe add the belief state, b –The probability distribution over all possible states –b(s) = belief that you are in state s MDPs have an explicit transition function T(s, a, s’) We add O (s, o)We add O (s, o) –The probability of observing o when in state s We add the belief state, bWe add the belief state, b –The probability distribution over all possible states –b(s) = belief that you are in state s

Two parts to the problem Figure out what state you’re in Use Filtering from Chapter 15Use Filtering from Chapter 15 Figure out what to do in that state Bellman’s equation is useful againBellman’s equation is useful again The optimal action depends only on the agent’s current belief state Figure out what state you’re in Use Filtering from Chapter 15Use Filtering from Chapter 15 Figure out what to do in that state Bellman’s equation is useful againBellman’s equation is useful again The optimal action depends only on the agent’s current belief state Update b(s) and  (s) / U(s) after each iteration

Selecting an action  is normalizing constant that makes belief state sum to 1  is normalizing constant that makes belief state sum to 1 b’ = FORWARD (b, a, o)b’ = FORWARD (b, a, o) Optimal policy maps belief states to actionsOptimal policy maps belief states to actions –Note that the n-dimensional belief-state is continuous  Each belief value is a number between 0 and 1  is normalizing constant that makes belief state sum to 1  is normalizing constant that makes belief state sum to 1 b’ = FORWARD (b, a, o)b’ = FORWARD (b, a, o) Optimal policy maps belief states to actionsOptimal policy maps belief states to actions –Note that the n-dimensional belief-state is continuous  Each belief value is a number between 0 and 1

A slight hitch The previous slide required that you know the outcome o of action a in order to update the belief state If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a The previous slide required that you know the outcome o of action a in order to update the belief state If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a

Predicting future belief states Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? b provides a guess about initial stateb provides a guess about initial state a is knowna is known Any observation could be realized… any subsequent state could be realized… any new belief state could be realizedAny observation could be realized… any subsequent state could be realized… any new belief state could be realized Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? b provides a guess about initial stateb provides a guess about initial state a is knowna is known Any observation could be realized… any subsequent state could be realized… any new belief state could be realizedAny observation could be realized… any subsequent state could be realized… any new belief state could be realized

Predicting future belief states The probability of perceiving o, given action a and belief state b, is given by summing over all the actual states the agent might reach

Predicting future belief states We just computed the odds of receiving o We want new belief state Let  (b, a, b’) be the belief transition functionLet  (b, a, b’) be the belief transition function We just computed the odds of receiving o We want new belief state Let  (b, a, b’) be the belief transition functionLet  (b, a, b’) be the belief transition function Equal to 1 if b′ = FORWARD(b, a, o) Equal to 0 otherwise

Predicted future belief states Combining previous two slides This is a transition model through belief states Combining previous two slides This is a transition model through belief states

Relating POMDPs to MDPs We’ve found a model for transitions through belief states Note MDPs had transitions through states (the real things)Note MDPs had transitions through states (the real things) We need a model for rewards based on beliefs Note MDPs had a reward function based on stateNote MDPs had a reward function based on state We’ve found a model for transitions through belief states Note MDPs had transitions through states (the real things)Note MDPs had transitions through states (the real things) We need a model for rewards based on beliefs Note MDPs had a reward function based on stateNote MDPs had a reward function based on state

Bringing it all together We’ve constructed a representation of POMDPs that make them look like MDPs Value and Policy Iteration can be used for POMDPsValue and Policy Iteration can be used for POMDPs The optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representationThe optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation We’ve constructed a representation of POMDPs that make them look like MDPs Value and Policy Iteration can be used for POMDPsValue and Policy Iteration can be used for POMDPs The optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representationThe optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation

Continuous vs. discrete Our POMDP in MDP-form is continuous Cluster continuous space into regions and try to solve for approximations within these regionsCluster continuous space into regions and try to solve for approximations within these regions Our POMDP in MDP-form is continuous Cluster continuous space into regions and try to solve for approximations within these regionsCluster continuous space into regions and try to solve for approximations within these regions

Final answer to POMDP problem [l, u, u, r, u, u, r, u, u, r, …] It’s deterministic (it already takes into account the absence of observations)It’s deterministic (it already takes into account the absence of observations) It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…)It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) It is successful 86.6%It is successful 86.6% In general, POMDPs with a few dozen states are nearly impossible to optimize [l, u, u, r, u, u, r, u, u, r, …] It’s deterministic (it already takes into account the absence of observations)It’s deterministic (it already takes into account the absence of observations) It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…)It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) It is successful 86.6%It is successful 86.6% In general, POMDPs with a few dozen states are nearly impossible to optimize