Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 20071 Fully Observable MDP.

Similar presentations


Presentation on theme: "Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 20071 Fully Observable MDP."— Presentation transcript:

1 Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 20071 Fully Observable MDP

2 Automated Planning and Decision Making2 Markov Decision Process (MDP)  ○S – State space. ○A – Actions set. ○Tr - SxA→  [S] State space over S. Example for transaction function: Tr(s, a, s’) = p - The probability to reach s’ from s with a. s – the state before. a – action. s’ – the state after. ○R - SxA→ ℝ The reward for doing a  A at state S.

3 Automated Planning and Decision Making3 Example for MDP  S = (s 1 …s 11 )  A = { , , ,  } 0.8  Uncertainty on movement - 0.1    0.1 (For example when moving up)  An attempt to move to a water square keeps you in place. 11 7 4 1098 65 321 1 + 1-

4 Automated Planning and Decision Making4 How long will the agent be acting?  Finite horizon ○Agent takes n actions for some fixed n  Unbounded Horizon ○Agent takes finite number of actions, but n is not known ahead of time  Infinite horizon ○  steps forward

5 Automated Planning and Decision Making5 What is a “plan”  Classical plan is a series of actions a 1,a 2,…,a n  Can’t work here: ○If I performed a 1 it is not clear that I will be able to perform a 2 because I don’t know what state I will be in. ○Even if I can execute a 2 after a 1, depending on the outcome, it’s not clear that a 2 is the best action. ○However, I can observe the state once I’m done with a 1  Solution: ○History h: the sequence of all past states and actions: s 0,a 1,s 1, a 2,s 2,…,a n,s n ○A policy p: h →A. Tells us what to do as a function of histroy.

6 Automated Planning and Decision Making6 Infinite horizon Stationary Policy  Our model is Markovian – so we should not care about past history?  What about time?  When the horizon is infinite, time is meaningless – we always have infinite steps to follow. ○What happens when the horizon is finite?  If we examine the search tree, nodes representing the same state at different level have isomorphic sub-trees. Isomorphic trees This is why policy is a function: p: S→A SiSi SiSi

7 Automated Planning and Decision Making7 Infinite horizon  We want the best policy  How to evaluate a policy? ○Consider a trajectory derived by some policy: Discounted Reward:  i=1…  γ i R(s i, a i ), 0< γ <1 Average Reward: 1 / n  i=1…  R(s i, a i ) lower bound (n  ) liminf n   1 / n  i=1…  R(s i, a i ) (lower bound on converging subsequences) ○Policy value is the average of all possible trajectories values. s1s1 s2s2 s3s3 s4s4 a1a1 a2a2 a3a3 … (infinite sequence)

8 Automated Planning and Decision Making8 Infinite horizon – Discounted Reward Value Function  V p : S  ℝ. ○ Value function. ○ Discounted sum of rewards, with respect to an initial state s and policy p.  Claim:  p* such that  s  S and for every policy p, V p* (s)  V p (s). Intuitive explanation: assume p 1 and p 2 policies, s 1,s 2 states. p 1 (s 1 )  p 2 (s 1 ) and p 2 (s 2 )  p 1 (s 2 ). A policy that assigns p 1 (s 1 ) at s 1 and p 2 (s 2 ) at s 2 improves both and the number of policies is finite.

9 Automated Planning and Decision Making9 Infinite horizon Value Function  How to calculate V p ? ○v p (s 1 ) = R(s 1, p(s 1 )) + γ  s`  S Tr(s 1, p(s 1 ), s`) v p (s`) ○v p (s 2 ) = R(s 2, p(s 2 )) + γ  s`  S Tr(s 2, p(s 2 ), s`) v p (s`) ○… ○v p (s n ) = R(s n, p(s n )) + γ  s`  S Tr(s n, p(s n ), s`) v p (s`) Now solve n linear equations with n variables. (n = | S |)

10 Automated Planning and Decision Making10 Infinite horizon From Value to Policy  Denote the optimal policy by p* and its value function by v*.  Given v* we can derive p*(s) for every s  S p*(s) = argmax a  A [R(s,a) + γ  s`  S Tr(s,a,s`) v*(s`)] SiSi SjSj SkSk SlSl SlSl SmSm 0.3 0.7 0.5

11 Automated Planning and Decision Making11 MDP Algorithms  Value Iteration ○“finding v* without knowing p*” using Bellman equations. ○v*(s) = max a  A [R (s, a) + γ  s`  S Tr(s, a, s`) v*(s`)] ○This is no longer n equations… what now? ○Iterate the solution! v 0 *(s)=max a  A R (s, a)... v i+1 *(s) = max a  A [R (s, a) + γ  s`  S Tr(s, a, s`) v i *(s`)] ○Claim: v i *  v* when i  .

12 Automated Planning and Decision Making12 MDP Algorithms  Value Iteration – Key Points ○Given v* it is easy to find p*. ○Bellman equations define v* directly. ○Using iterative calculation it is possible to estimate v*

13 Automated Planning and Decision Making13 MDP Algorithms  Policy Iteration (Howard) ○“start at some arbitrary policy and improve it”. ○Algorithm: 1 (Initial policy) Choose initial policy p. 2 (policy evaluation) Evaluate v p. 3 (policy improvement) If  s state such that  a action, a  p(s) for which v p (s)<R (s, a) + γ  s`  S Tr(s, a, s`) v p (s`) 3.1 p(s) = a 3.2 goto 2. 4 End.

14 Automated Planning and Decision Making14 MDP Algorithms  Policy Iteration – Pseudo Proof Assume we ended.  Nothing more to improve   s  S,  a  A: v p (s)  R(s,a) + γ  s`  S Tr(s,a,s`)v p (s`)  v p (s)=max a  A [R (s, a) + γ  s`  S Tr(s, a, s`)v p (s`)]  v p = v*  p = p*

15 Automated Planning and Decision Making15 Linear Programming  Target: min/max  i=1..n c i x i  Constraints: a 1 x 1 +a 2 x 2 +…+a n x n  c b 1 x 1 +b 2 x 2 +…+b n x n  d  x i are real, non-negative variables.  Solution is an assignment for x 1..x n that suffice all the Constraints and maximize Target.

16 Automated Planning and Decision Making16 Linear Programming Finding the Optimal value  The variables v*(s i ) for s i  S: min  i v*(s i ) ○v*(s 1 )  R(s 1,a 1 )+  s`  S Tr(s,a 1,s`)v*(s`) v*(s 1 )  R(s 1,a 2 )+  s`  S Tr(s,a 2,s`)v*(s`) … ○v*(s 2 )  R(s 2,a 1 )+  s`  S Tr(s,a 1,s`)v*(s`) v*(s 2 )  R(s 2,a 2 )+  s`  S Tr(s,a 2,s`)v*(s`) … ○…  Complexity? Poly(|S||A|)

17 Automated Planning and Decision Making17 Finite Horizon  Search space is an AND/OR tree.  Recall previous example:  S = (s 1 …s 11 )  A = { , , ,  }  R(s,a)= ○ -0.04+1, s=s 4 ○ -0.04-1, s=s 7 ○ -0.04+0, otherwise 0.8  Uncertainty on movement - 0.1    0.1  Attempt to move to a water square keeps you in place. 11 7 4 1098 65 321 1 + 1-

18 Automated Planning and Decision Making18 Finite Horizon AND/OR Tree S8S8 S5S5 S8S8 S9S9 S8S8 S5S5     AND node OR node 0.8 0.1     S5S5 S1S1 S1S1 S5S5 S8S8 Behavior is defined by choices on OR nodes. -0.08

19 Automated Planning and Decision Making19 Finite Horizon How to solve MDP using the tree?  Backward Induction ○Evaluating the nodes values bottom-up. ○On OR node choose the maximal value from its children values. ○On AND node assign the average of its children values. ○This can be wasteful - we evaluate all paths! Instead, same states (nodes) at same level get the same value.  Key Point: The value of a node depends only on its sub- tree – what can be done from now on -- and not on the “history” which got us to it – the parent tree.

20 Automated Planning and Decision Making20 Finite Horizon Algorithm for Solving MDP 1 Build an AND/OR tree T for the initial solution. 2 Assign the leaves with their values according to R. 3 For each level L above leaves, bottom-up: 3.1 Calculate the immediate reward + value of the children, assign it to the node and choose the maximizing step. v(s, t)= max a  A R(s,a)+  s` Tr(s,a,s`)v(s`,t+1) number of steps made state

21 Automated Planning and Decision Making21 AO*-- A* for And/Or Trees  Still, we have to evaluate the entire tree  Many path are irrelevant  Using forward heuristic search, we can focus on relevant parts of the tree  For non-leaf nodes – need a heuristic function.  At every step choose which node to expand next (starting at the root).  Using an admissible heuristic function, we can prune early.

22 Automated Planning and Decision Making22 AO*  Maintain an open list of unexpanded nodes O  Initially O contains the initial state 1 At each step 1.1 choose an open node in the sub-tree that matches the “current optimal” policy. 1.2 expand it one step and add the new non- terminal leaves to O 1.3 Calculate new optimal policy for the new tree.

23 Automated Planning and Decision Making23 AO* Pruning “Currently Optimal” x Caused the expansion Of this sub tree x Became “Optimal” Greatly Reduced X

24 At(Start) At(R1) HavePic(R1) Lost At(R3) HavePic(R1) At(R3) HavePic(R1) HavePic(R3) At(R3) HavePic(R3) Navigate(Start, R1) At(R2) HavePic(R2) Navigate(Start, R2) Navigate(R1, R3) TakePic(R2) TakePic(R1) TakePic(R3) Navigate(R1, R3) AND-OR גרף

25 At(Start) At(R1) HavePic(R1) Lost At(R3) HavePic(R1) At(R3) HavePic(R1) HavePic(R3) At(R3) HavePic(R3) Navigate(Start, R1) At(R2) HavePic(R2) Navigate(Start, R2) Navigate(R1, R3) TakePic(R2) TakePic(R1) TakePic(R3) Navigate(R1, R3) פתרון אפשרי

26 Navigate(Start, R1) Navigate(Start, R2) TakePic(R2) TakePic(R3) Navigate(R1, R3) $10 $20 0.75 0.25 $8 Navigate(R1, R3) TakePic(R1) איך פותרים: תכנון דינמי

27 Navigate(Start, R1) Navigate(Start, R2) TakePic(R2) TakePic(R3) Navigate(R1, R3) V = 0 $10 $20 0.75 0.25 $8 Navigate(R1, R3) TakePic(R1)

28 איך פותרים: תכנון דינמי Navigate(Start, R1) Navigate(Start, R2) TakePic(R2) TakePic(R3) Navigate(R1, R3) V = 0 V = 6 V = 8 V = 20 V = 15 $10 $20 0.75 0.25 $8 Navigate(R1, R3) TakePic(R1) V = 8 V = 16 Q = 8 Q = 6 Q = 16 Q = 12 Q = 20 Q = 15 Q = 6 Q = 8

29 איך פותרים: תכנון דינמי Navigate(Start, R1) Navigate(Start, R2) TakePic(R2) TakePic(R3) Navigate(R1, R3) $10 $20 0.75 0.25 $8 Navigate(R1, R3) TakePic(R1) V = 0 V = 6 V = 8 V = 20 V = 15 V = 8 V = 16 Q = 8 Q = 6 Q = 16 Q = 12 Q = 20 Q = 15 Q = 6 Q = 8

30 AO* Navigate(Start, R1) Navigate(Start, R2) 0.75 0.25 H = 20 V = 0 H = 24 : open : closed : terminal

31 AO* Navigate(Start, R1) Navigate(Start, R2) 0.75 0.25 H = 20 V = 0 H = 24 V = 18 : open : closed : terminal Q = 18Q = 15

32 AO* Navigate(Start, R1) Navigate(Start, R2) 0.75 0.25 H = 20 V = 0 H = 24 V = 18 : open : closed : terminal Q = 18Q = 15

33 AO* Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) $10 0.75 0.25 TakePic(R1) : open : closed : terminal H = 20 Q = 15 V = 0

34 AO* Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) V = 0 H = 6 H = 8 H = 20 $10 0.75 0.25 TakePic(R1) : open : closed : terminal Q = 15

35 Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) V = 0 H = 6 H = 8 H = 20 V = 15 $10 0.75 0.25 TakePic(R1) V = 16 AO* : open : closed : terminal Q = 16 Q = 12Q = 15 Q = 6

36 Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) V = 0 H = 6 H = 8 H = 20 V = 15 $10 0.75 0.25 TakePic(R1) V = 16 AO* : open : closed : terminal Q = 16 Q = 12Q = 15 Q = 6

37 AO* Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) $10 0.75 0.25 TakePic(R1) TakePic(R2) $20 : open : closed : terminal H = 6 H = 8 V = 16 Q = 16 Q = 12 Q = 6

38 AO* Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) V = 0 H = 8 $10 0.75 0.25 TakePic(R1) TakePic(R2) V = 0 $20H = 6 : open : closed : terminal V = 16 Q = 16 Q = 12 Q = 6

39 AO* Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) V = 0 H = 6 H = 8 V = 20 V = 15 $10 0.75 0.25 TakePic(R1) V = 16 : open : closed : terminal TakePic(R2) V = 0 $20 Q = 16 Q = 12 Q = 20 Q = 15 Q = 6

40 AO* Navigate(Start, R1) Navigate(Start, R2) Navigate(R1, R3) V = 0 H = 6 H = 8 V = 20 V = 15 $10 0.75 0.25 TakePic(R1) V = 16 : open : closed : terminal TakePic(R2) V = 0 $20 Q = 16 Q = 12 Q = 20 Q = 15 Q = 6 DONE


Download ppt "Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 20071 Fully Observable MDP."

Similar presentations


Ads by Google