Download presentation
Presentation is loading. Please wait.
Published byKathlyn Wilkinson Modified over 9 years ago
1
CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17
2
Midterm Results AVG:72 MED:75 STD:12 Rough dividing lines at: 58 (C), 72 (B), 85 (A) AVG:72 MED:75 STD:12 Rough dividing lines at: 58 (C), 72 (B), 85 (A)
3
Assignment 1 Results AVG: 87 MED: 94 STD: 19 How to interpret the grade sheet… AVG: 87 MED: 94 STD: 19 How to interpret the grade sheet…
4
Interpreting the grade sheet… You see the tests we ran listed in the first columnYou see the tests we ran listed in the first column The metrics we accumulated are:The metrics we accumulated are: –Solution depth, nodes created, nodes accessed, fringe size –All metrics are normalized by dividing by the value obtained using one of the good solutions from last year The first four columns show these normalized metrics averaged across the entire class’s submissionsThe first four columns show these normalized metrics averaged across the entire class’s submissions The next four columns show these normalized metrics for your submission…The next four columns show these normalized metrics for your submission… –Ex: A value of “1” for “Solution” means your code found a solution at the same depth as the solution from last year. The class average for “solution” might be 1.28 because some submissions searched longer and thus increased the average You see the tests we ran listed in the first columnYou see the tests we ran listed in the first column The metrics we accumulated are:The metrics we accumulated are: –Solution depth, nodes created, nodes accessed, fringe size –All metrics are normalized by dividing by the value obtained using one of the good solutions from last year The first four columns show these normalized metrics averaged across the entire class’s submissionsThe first four columns show these normalized metrics averaged across the entire class’s submissions The next four columns show these normalized metrics for your submission…The next four columns show these normalized metrics for your submission… –Ex: A value of “1” for “Solution” means your code found a solution at the same depth as the solution from last year. The class average for “solution” might be 1.28 because some submissions searched longer and thus increased the average
5
Interpreting the grade sheet SLOW = more than 30 seconds to completeSLOW = more than 30 seconds to complete –66% credit given to reflect partial credit even though we never obtained firm results N/A = the test would not even launch correctly… it might have crashed or ended without outputN/A = the test would not even launch correctly… it might have crashed or ended without output –33% credit given to reflect that frequently N/A occurs when no attempt was made to create an implementation If you have an N/A but you think your code reflects partial credit, let us know. SLOW = more than 30 seconds to completeSLOW = more than 30 seconds to complete –66% credit given to reflect partial credit even though we never obtained firm results N/A = the test would not even launch correctly… it might have crashed or ended without outputN/A = the test would not even launch correctly… it might have crashed or ended without output –33% credit given to reflect that frequently N/A occurs when no attempt was made to create an implementation If you have an N/A but you think your code reflects partial credit, let us know.
6
Gambler’s Ruin Consider working out examples of gambler’s ruin for $4 and $8 by hand Ben created some graphs to show solution of gambler’s ruin for $8 $0 bets are not permitted! Consider working out examples of gambler’s ruin for $4 and $8 by hand Ben created some graphs to show solution of gambler’s ruin for $8 $0 bets are not permitted!
7
$8-ruin using batch update Converges after three iterations. Value vector is only updated after a complete iteration has completed Converges after three iterations. Value vector is only updated after a complete iteration has completed
8
$8-ruin using in-place updating Convergence occurs more quickly Updates to value function occur in-place starting from $1 Convergence occurs more quickly Updates to value function occur in-place starting from $1
9
$100-ruin A more detailed graph than provided in the assignment
10
Trying it by hand Assume value update is working… What’s the best action at $5? Assume value update is working… What’s the best action at $5? $1$2$3$4$5$6$7$8.064.16.256.4.496.64.7841 When tied… pick the smallest action
11
Office hours Sunday: 4 – 5 in Thornton Stacks Send email to Ben (hocking@virginia.edu) by Saturday at midnight to reserve a slot Also make sure you have stepped through your code (say for the $8 example) to make sure that it is implementing your logic Sunday: 4 – 5 in Thornton Stacks Send email to Ben (hocking@virginia.edu) by Saturday at midnight to reserve a slot Also make sure you have stepped through your code (say for the $8 example) to make sure that it is implementing your logic
12
Compilation Just for grins Take your Visual Studio code and compile using g++: g++ foo.cpp –o foo -Wall g++ foo.cpp –o foo -Wall Just for grins Take your Visual Studio code and compile using g++: g++ foo.cpp –o foo -Wall g++ foo.cpp –o foo -Wall
13
Partially observable Markov Decision Processes (POMDPs) Relationship to MDPs Value and Policy Iteration assume you know a lot about the world:Value and Policy Iteration assume you know a lot about the world: –current state, action, next state, reward for state, … In real world, you don’t exactly know what state you’re inIn real world, you don’t exactly know what state you’re in –Is the car in front braking hard or braking lightly? –Can you successfully kick the ball to your teammate? Relationship to MDPs Value and Policy Iteration assume you know a lot about the world:Value and Policy Iteration assume you know a lot about the world: –current state, action, next state, reward for state, … In real world, you don’t exactly know what state you’re inIn real world, you don’t exactly know what state you’re in –Is the car in front braking hard or braking lightly? –Can you successfully kick the ball to your teammate?
14
Partially observable Consider not knowing what state you’re in… Go left, left, left, left, leftGo left, left, left, left, left Go up, up, up, up, upGo up, up, up, up, up –You’re probably in upper- left corner Go right, right, right, right, rightGo right, right, right, right, right Consider not knowing what state you’re in… Go left, left, left, left, leftGo left, left, left, left, left Go up, up, up, up, upGo up, up, up, up, up –You’re probably in upper- left corner Go right, right, right, right, rightGo right, right, right, right, right
15
Extending the MDP model MDPs have an explicit transition function T(s, a, s’) We add O (s, o)We add O (s, o) –The probability of observing o when in state s We add the belief state, bWe add the belief state, b –The probability distribution over all possible states –b(s) = belief that you are in state s MDPs have an explicit transition function T(s, a, s’) We add O (s, o)We add O (s, o) –The probability of observing o when in state s We add the belief state, bWe add the belief state, b –The probability distribution over all possible states –b(s) = belief that you are in state s
16
Two parts to the problem Figure out what state you’re in Use Filtering from Chapter 15Use Filtering from Chapter 15 Figure out what to do in that state Bellman’s equation is useful againBellman’s equation is useful again The optimal action depends only on the agent’s current belief state Figure out what state you’re in Use Filtering from Chapter 15Use Filtering from Chapter 15 Figure out what to do in that state Bellman’s equation is useful againBellman’s equation is useful again The optimal action depends only on the agent’s current belief state Update b(s) and (s) / U(s) after each iteration
17
Selecting an action is normalizing constant that makes belief state sum to 1 is normalizing constant that makes belief state sum to 1 b’ = FORWARD (b, a, o)b’ = FORWARD (b, a, o) Optimal policy maps belief states to actionsOptimal policy maps belief states to actions –Note that the n-dimensional belief-state is continuous Each belief value is a number between 0 and 1 is normalizing constant that makes belief state sum to 1 is normalizing constant that makes belief state sum to 1 b’ = FORWARD (b, a, o)b’ = FORWARD (b, a, o) Optimal policy maps belief states to actionsOptimal policy maps belief states to actions –Note that the n-dimensional belief-state is continuous Each belief value is a number between 0 and 1
18
A slight hitch The previous slide required that you know the outcome o of action a in order to update the belief state If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a The previous slide required that you know the outcome o of action a in order to update the belief state If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a
19
Predicting future belief states Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? b provides a guess about initial stateb provides a guess about initial state a is knowna is known Any observation could be realized… any subsequent state could be realized… any new belief state could be realizedAny observation could be realized… any subsequent state could be realized… any new belief state could be realized Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? b provides a guess about initial stateb provides a guess about initial state a is knowna is known Any observation could be realized… any subsequent state could be realized… any new belief state could be realizedAny observation could be realized… any subsequent state could be realized… any new belief state could be realized
20
Predicting future belief states The probability of perceiving o, given action a and belief state b, is given by summing over all the actual states the agent might reach
21
Predicting future belief states We just computed the odds of receiving o We want new belief state Let (b, a, b’) be the belief transition functionLet (b, a, b’) be the belief transition function We just computed the odds of receiving o We want new belief state Let (b, a, b’) be the belief transition functionLet (b, a, b’) be the belief transition function Equal to 1 if b′ = FORWARD(b, a, o) Equal to 0 otherwise
22
Predicted future belief states Combining previous two slides This is a transition model through belief states Combining previous two slides This is a transition model through belief states
23
Relating POMDPs to MDPs We’ve found a model for transitions through belief states Note MDPs had transitions through states (the real things)Note MDPs had transitions through states (the real things) We need a model for rewards based on beliefs Note MDPs had a reward function based on stateNote MDPs had a reward function based on state We’ve found a model for transitions through belief states Note MDPs had transitions through states (the real things)Note MDPs had transitions through states (the real things) We need a model for rewards based on beliefs Note MDPs had a reward function based on stateNote MDPs had a reward function based on state
24
Bringing it all together We’ve constructed a representation of POMDPs that make them look like MDPs Value and Policy Iteration can be used for POMDPsValue and Policy Iteration can be used for POMDPs The optimal policy, *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representationThe optimal policy, *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation We’ve constructed a representation of POMDPs that make them look like MDPs Value and Policy Iteration can be used for POMDPsValue and Policy Iteration can be used for POMDPs The optimal policy, *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representationThe optimal policy, *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation
25
Continuous vs. discrete Our POMDP in MDP-form is continuous Cluster continuous space into regions and try to solve for approximations within these regionsCluster continuous space into regions and try to solve for approximations within these regions Our POMDP in MDP-form is continuous Cluster continuous space into regions and try to solve for approximations within these regionsCluster continuous space into regions and try to solve for approximations within these regions
26
Final answer to POMDP problem [l, u, u, r, u, u, r, u, u, r, …] It’s deterministic (it already takes into account the absence of observations)It’s deterministic (it already takes into account the absence of observations) It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…)It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) It is successful 86.6%It is successful 86.6% In general, POMDPs with a few dozen states are nearly impossible to optimize [l, u, u, r, u, u, r, u, u, r, …] It’s deterministic (it already takes into account the absence of observations)It’s deterministic (it already takes into account the absence of observations) It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…)It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) It is successful 86.6%It is successful 86.6% In general, POMDPs with a few dozen states are nearly impossible to optimize
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.