Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter 2003.

Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter 2003

Non-deterministic vs. Probabilistic Uncertainty Non-deterministic model ? bac {a,b,c}  decision that is best for worst case ~ Adversarial search ? bac {a(p a ),b(p b ),c(p c )}  decision that maximizes expected utility value Probabilistic model

s0s0 s3s3 s2s2 s1s1 A1 0.20.70.1 1005070  utility of state U(S0) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1 = 20 + 35 + 7 = 62 One State/One Action Example

s0s0 s3s3 s2s2 s1s1 A1 0.20.70.1 1005070 A2 s4s4 0.20.8 80 U1(S0) = 62 U2(S0) = 74 U(S0) = max{U1(S0),U2(S0)} = 74 One State/Two Actions Example

s0s0 s3s3 s2s2 s1s1 A1 0.20.70.1 1005070 A2 s4s4 0.20.8 80 U1(S0) = 62 – 5 = 57 U2(S0) = 74 – 25 = 49 U(S0) = max{U1(S0),U2(S0)} = 57 -5-25 Introducing Action Costs

Example: Finding Juliet A robot, Romeo, is in Charles’ office and must deliver a letter to Juliet Juliet is either in her office, or in the conference room. Without other prior knowledge, each possibility has probability 0.5 The robot’s goal is to minimize the time spent in transit Charles’ off. Juliet’s off. Conf. room 10min 5min 10min

Example: Finding Juliet States are: S0: Romeo in Charles’ office S1: Romeo in Juliet’s office and Juliet here S2: Romeo in Juliet’s office and Juliet not here S3: Romeo in conference room and Juliet here S4: Romeo in conference room and Juliet not here Actions are: GJO (go to Juliet’s office) GCR (go to conference room)

Utility Computation 0 GJO GCR 1 1 2 3 4 3 0 0 0 0 -10 -15 -10 Charles’ off. Juliet’s off. Conf. room 10min 5min 10min

n-Step Decision Process There is a single initial state States reached after i steps are all different from those reached after j  i steps Each state i has a its reward R(i) Each state reached after n steps is terminal The goal is to maximize the sum of rewards 01 2 3

n-Step Decision Process Utility of state i: U (i) = R (i) + max a  k P(k | a.i) U (k) 0123 i Two possible actions from i : a 1 and a 2 a 1 leads to k 11 with probability P 11 or k 12 with probability P 12 P 11 = P(k 11 | a 1.i) P 12 = P(k 12 | a 1.i) a 2 leads to k 21 with probability P 21 or k 22 with probability P 22 U(i) = R(i) + max {P 11 U(k 11 ) + P 12 U(k 12 ), P 21 U(k 21 ) + P 22 U(k 22 )}

n-Step Decision Process For j = n-1, n-2, …, 0 do: For every state s i attained after step j  Compute the utility of s i  Label that state with the corresponding best action Utility of state i: U (i) = R (i) + max a  k P(k | a.i) U (k) Best choice of action at state i:  *(i) = arg max a  k P(k | a.i) U (k) 0123 i Optimal policy

n-Step Decision Process … Utility of state i: U (i) = max a (  k P(k | a.i) U (k) – C a ) Best choice of action at state i:  *(i) = arg max a (  k P(k|a.i) U(k) – C a ) 0123 i … with costs on actions instead of rewards on states

0 GJO GCR 1 1 2 3 4 3 0123 i

0 GJO GCR 1 2 4 3 0123 i

Target Tracking target robot The robot must keep the target in view The target’s trajectory is not known in advance The environment may or may not be known

States Are Indexed by Time State = (robot-position, target-position, time) Action = (stop, up, down, right, left) Outcome of an action = 5 possible states, each with probability 0.2 ([i,j], [u,v], t) ([i+1,j], [u,v], t+1) ([i+1,j], [u-1,v], t+1) ([i+1,j], [u+1,v], t+1) ([i+1,j], [u,v-1], t+1) ([i+1,j], [u,v+1], t+1) right Each state has 25 successors

h-Step Planning Process Planning horizon h “Terminal” states: States where the target is not visible States at depth h Reward function: Target visible  +1 Target not visible  0 Maximizing the sum of rewards ~ maximizing escape time R (state) =  t where 0 <  < 1 discounting

h-Step Planning Process Planning horizon h The planner computes the optimal policy over this tree of states, but only the first step of the policy is executed. Then everything is repeated again … (sliding horizon)

h-Step Planning Process Planning horizon h

h-Step Planning Process Planning horizon h h is chosen such that the computation of the optimal policy over the tree can be computed in one increment of time

Example With No Planner

Example With Planner

Other Example

h-Step Planning Process Planning horizon h The optimal policy over this tree is not the optimal policy that would have been computed if a prior model of the environment had been available, along with an arbitrarily fast computer

0 GJO GCR 1 1 2 3 4 3 0123 i

Simple Robot Navigation Problem In each state, the possible actions are U, D, R, and L

Probabilistic Transition Model In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)

Probabilistic Transition Model In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)

Probabilistic Transition Model In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move) With probability 0.1 the robot moves left one square (if the robot is already in the leftmost row, then it does not move)

Markov Property The transition properties depend only on the current state, not on previous history (how that state was reached)

Sequence of Actions Planned sequence of actions: (U, R) U is executed 2 3 1 4321

Sequence of Actions Planned sequence of actions: (U, R) U is executed R is executed 2 3 1 4321

Sequence of Actions Planned sequence of actions: (U, R) 2 3 1 4321 [3,2]

Sequence of Actions Planned sequence of actions: (U, R) U is executed 2 3 1 4321 [3,2] [4,2][3,3][3,2]

Histories Planned sequence of actions: (U, R) U has been executed R is executed There are 9 possible sequences of states – called histories – and 6 possible final states for the robot! 4321 2 3 1 [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Probability of Reaching the Goal P([4,3] | (U,R).[3,2]) = P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2]) 2 3 1 4321 Note importance of Markov property in this derivation P([3,3] | U.[3,2]) = 0.8 P([4,2] | U.[3,2]) = 0.1 P([4,3] | R.[3,3]) = 0.8 P([4,3] | R.[4,2]) = 0.1 P([4,3] | (U,R).[3,2]) = 0.65

Utility Function The robot needs to recharge its batteries [4,3] provides power supply [4,2] is a sand area from which the robot cannot escape [4,3] or [4,2] are terminal states Reward of a terminal state: +1 or -1 Reward of a non-terminal state: -1/25 Utility of a history: sum of rewards of traversed states Goal: Maximize the utility of the history +1 2 3 1 4321

+1 2 3 1 4321 Histories are potentially unbounded and the same state can be reached many times

Utility of an Action Sequence +1 Consider the action sequence (U,R) from [3,2] 2 3 1 4321

Utility of an Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability 2 3 1 4321 [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Utility of an Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories: U =  h U h P(h) 2 3 1 4321 [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Optimal Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories The optimal sequence is the one with maximal utility 2 3 1 4321 [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1]

Optimal Action Sequence +1 Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories The optimal sequence is the one with maximal utility But is the optimal action sequence what we want to compute? 2 3 1 4321 [3,2] [4,2][3,3][3,2] [3,3][3,2][4,1][4,2][4,3][3,1] NO!! Except if sequence is executed blindly (open-loop strategy)

observable state Repeat:  s  sensed state  If s is terminal then exit  a  choose action (given s)  Perform a Reactive Agent Algorithm

Utility of state i: U (i) = max a (  k P(k | a.i) U (k) – C a ) Best choice of action at state i:  *(i) = arg max a (  k P(k|a.i) U(k) – C a ) 0123 i

Policy (Reactive/Closed-Loop Strategy) A policy  is a complete mapping from states to actions +1 2 3 1 4321

Repeat:  s  sensed state  If s is terminal then exit  a   (s)  Perform a Reactive Agent Algorithm

Optimal Policy +1 A policy  is a complete mapping from states to actions The optimal policy  * is the one that always yields a history (ending at a terminal state) with maximal expected utility 2 3 1 4321 Makes sense because of Markov property Note that [3,2] is a “dangerous” state that the optimal policy tries to avoid

Optimal Policy +1 A policy  is a complete mapping from states to actions The optimal policy  * is the one that always yields a history with maximal expected utility 2 3 1 4321 This problem is called a Markov Decision Problem (MDP) How to compute  *?

Utility of state i: U (i) = max a (  k P(k | a.i) U (k) – C a ) Best choice of action at state i:  *(i) = arg max a (  k P(k|a.i) U(k) – C a ) 0123 i

The trick used in target-tracking (indexing state by time) can be applied …  +1 … but would yield a large tree + sub-optimal policy

First-Step Analysis Simulate one step: U (i) = R (i) + max a  k P (k | a.i) U (k)  *(i) = arg max a  k P (k | a.i) U (k) (  Principle of Max Expected Utility) +1 ?

What is the Difference?  *(i) = arg max a  k P (k | a.i) U (k) U (i) = R (i) + max a  k P (k | a.i) U (k) 0123 +1

Value Iteration Initialize the utility of each non-terminal state s i to U 0 (i) = 0 For t = 0, 1, 2, …, do: U t+1 (i)  R (i) + max a  k P (k | a.i) U t (k) +1 2 3 1 4321

Value Iteration Initialize the utility of each non-terminal state s i to U 0 (i) = 0 For t = 0, 1, 2, …, do: U t+1 (i)  R (i) + max a  k P (k | a.i) U t (k) U t ([3,1]) t 0302010 0.611 0.5 0 +1 2 3 1 4321 0.7050.6550.3880.611 0.762 0.8120.8680.918 0.660 Note the importance of terminal states and connectivity of the state-transition graph Not very different from indexing states by time

Policy Iteration Pick a policy  at random

Policy Iteration Pick a policy  at random Repeat: Compute the utility of each state for  U (i) = R (i) +  k P (k |  (i).i) U (k) Set of linear equations (often a sparse system)

Policy Iteration Pick a policy  at random Repeat: Compute the utility of each state for  U (i) = R (i) +  k P (k |  (i).i) U (k) Compute the policy  ’ given these utilities  ’(i) = arg max a  k P (k | a.i) U (k) If  ’ =  then return  else  ’

Application of First Step Analysis Computing the Probability of Folding p fold of a Protein Unfolded stateFolded state p fold 1- p fold HIV integrase

Computation Through Simulation 10K to 30K independent simulations

Capture the stochastic nature of molecular motion by a probabilistic roadmap vivi vjvj P ij

Edge probabilities Follow Metropolis criteria: Self-transition probability: vjvj vivi P ij P ii

F: Folded setU: Unfolded set First-Step Analysis P ij i k j l m P ik P il P im Let f i = p fold (i) After one step: f i = P ii f i + P ij f j + P ik f k + P il f l + P im f m =1  One linear equation per node  Solution gives p fold for all nodes  No explicit simulation run  All pathways are taken into account  Sparse linear system

Partially Observable Markov Decision Problem Uncertainty in sensing: A sensing operation returns multiple states, with a probability distribution

Example: Target Tracking There is uncertainty in the robot’s and target’s positions, and this uncertainty grows with further motion There is a risk that the target escape behind the corner requiring the robot to move appropriately But there is a positioning landmark nearby. Should the robot tries to reduce position uncertainty?

Summary Probabilistic uncertainty Utility function Optimal policy Maximal expected utility Value iteration Policy iteration

Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter 2003.

Similar presentations

Presentation on theme: "Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter 2003.

Similar presentations

Presentation on theme: "Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter 2003."— Presentation transcript:

Similar presentations

About project

Feedback