Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate POMDP planning: Overcoming the curse of history!

Similar presentations


Presentation on theme: "Approximate POMDP planning: Overcoming the curse of history!"— Presentation transcript:

1 Approximate POMDP planning: Overcoming the curse of history!
Presented by: Joelle Pineau Joint work with: Geoff Gordon and Sebastian Thrun Machine Learning Lunch - March 10, 2003

2 To use or not to use a POMDP
POMDPs provide a rich framework for sequential decision-making, which can model: varying rewards across actions and goals uncertainty in the action effects uncertainty in the state of the world Machine Learning Lunch - March 10, 2003

3 Existing applications of POMDPs
Maintenance scheduling Puterman, 1994 Robot navigation Koenig & Simmons, 1995; Roy & Thrun, 1999 Helicopter control Bagnell & Schneider, 2001; Ng et al., 2002 Dialogue modeling Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 Preference elicitation Boutilier, 2002 Machine Learning Lunch - March 10, 2003

4 Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st Machine Learning Lunch - March 10, 2003

5 Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st (s) rt-1 (s) rt at-1 at Machine Learning Lunch - March 10, 2003

6 Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st ot-1 (s) rt-1 ot (s) rt What we see: at-1 at Machine Learning Lunch - March 10, 2003

7 Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st ot-1 rt-1 ot rt What we see: at-1 at (b) (b) What we infer: bt-1 bt Machine Learning Lunch - March 10, 2003

8 Understanding the belief state
A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2} 1 P(s1) Machine Learning Lunch - March 10, 2003

9 Understanding the belief state
A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2, s3} 1 P(s1) P(s2) 1 Machine Learning Lunch - March 10, 2003

10 Understanding the belief state
A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2, s3 , s4} 1 P(s3) P(s1) P(s2) 1 Machine Learning Lunch - March 10, 2003

11 The first curse of POMDP planning
The curse of dimensionality: dimension of the belief = # of states dimension of planning problem = # of states related to the MDP curse of dimensionality Machine Learning Lunch - March 10, 2003

12 Planning for POMDPs Learning a value function V(b) bB:
Learning an action-selection policy (b) bB: Machine Learning Lunch - March 10, 2003

13 Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes V0(b) P(s1) b Machine Learning Lunch - March 10, 2003

14 Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes V1(b) P(s1) b Machine Learning Lunch - March 10, 2003

15 Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes V1(b) P(s1) b Machine Learning Lunch - March 10, 2003

16 Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 2 27 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

17 Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 2 27 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

18 Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 2 27 ,348,907 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

19 Properties of exact value iteration
Value function is always piecewise-linear convex Many hyper-planes can be pruned away |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 5 3 9 4 7 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

20 Iteration # hyper-planes
Is pruning sufficient? |S|=20, |A|=6, ||=8 Iteration # hyper-planes 0 1 1 5 ????? Not for this problem! Machine Learning Lunch - March 10, 2003

21 The second curse of POMDP planning
The curse of dimensionality: the dimension of each hyper-plane = # of states The curse of history: the number of hyper-planes grows exponentially with the planning horizon Machine Learning Lunch - March 10, 2003

22 The second curse of POMDP planning
The curse of dimensionality: the dimension of each hyper-plane = # of states The curse of history: the number of hyper-planes grows exponentially with the planning horizon dimensionality history Complexity of POMDP value iteration: Machine Learning Lunch - March 10, 2003

23 Possible approximation approaches
Ignore the belief: Discretize the belief: Compress the belief: Plan for trajectories: - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, slow-changing gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002] Machine Learning Lunch - March 10, 2003

24 A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

25 A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points Plan for those belief points only V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

26 A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points  Focus on reachable beliefs Plan for those belief points only V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

27 A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points  Focus on reachable beliefs Plan for those belief points only  Learn value and its gradient V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

28 Point-based value update
V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

29 Point-based value update
Initialize the value function (…and skip ahead a few iterations) Vn(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

30 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: Vn(b) P(s1) b Machine Learning Lunch - March 10, 2003

31 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Vn(b) P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

32 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

33 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

34 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) b Machine Learning Lunch - March 10, 2003

35 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

36 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Max over actions: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

37 Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Max over actions: Vn+1(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

38 Complexity of value update
Exact Update Point-based Update I - Projection S2An S2AB II - Sum S2An SAB2 III - Max SAn SAB where: S = # states n = # solution vectors at iteration n A = # actions B = # belief points  = # observations n+1 Machine Learning Lunch - March 10, 2003

39 Theoretical properties of point-based updates
Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n=||VnB-Vn*|| is bounded by: Machine Learning Lunch - March 10, 2003

40 Back to the full algorithm
Main idea: Select a small set of belief points  PART II Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

41 Experimental results: Lasertag domain
State space = RobotPosition  OpponentPosition Observable: RobotPosition - always OpponentPosition - only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003

42 Performance of PBVI on Lasertag domain
Opponent tagged 59% of trials Opponent tagged 17% of trials Machine Learning Lunch - March 10, 2003

43 Performance on well-known POMDPs
Maze33 |S|=36, |A|=5, ||=17 Hallway |S|=60, |A|=5, ||=20 Hallway2 |S|=92, |A|=5, ||=17 Method QMDP Grid PBUA PBVI Reward 0.198 0.94 2.30 2.25 Time(s) 0.19 n.v. 12166 3448 B n.a. 174 660 470 %Goal 47 n.v 100 95 Reward 0.261 n.v. 0.53 Time(s) 0.51 n.v. 450 288 B n.a. 300 86 %Goal 22 98 100 Reward 0.109 n.v. 0.35 0.34 Time(s) 1.44 n.v. 27898 360 B n.a. 337 1840 95 Machine Learning Lunch - March 10, 2003

44 Back to the full algorithm
Main idea: Select a small set of belief points  PART II Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

45 Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. Machine Learning Lunch - March 10, 2003

46 Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. a1,o1 a2,o1 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

47 Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. What can we learn from MDP exploration techniques? Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

48 Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. What can we learn from MDP exploration techniques? Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

49 How does PBVI actually select belief points?
Start with B  b0 For any belief point bB: P(s1) b Machine Learning Lunch - March 10, 2003

50 How does PBVI actually select belief points?
Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) b ba1 a1,o2 Machine Learning Lunch - March 10, 2003

51 How does PBVI actually select belief points?
Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

52 How does PBVI actually select belief points?
Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. Add to B the belief ba which is farthest away from any bB. B  argmax{ba} [ minbB s(ba(s) - b(s)) ] P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

53 How does PBVI actually select belief points?
Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. Add to B the belief ba which is farthest away from any bB. B  argmax{ba} [ minbB s |ba(s) - b(s)| ] Repeat until |B| = set size P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

54 The anytime PBVI algorithm
Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Machine Learning Lunch - March 10, 2003

55 The anytime PBVI algorithm
Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: 13 phases: |B|=1334 ran out of time! Machine Learning Lunch - March 10, 2003

56 The anytime PBVI algorithm
Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: 13 phases: |B|=1334 ran out of time! Hallway2 results: 8 phases: |B|=95 found good policy. Machine Learning Lunch - March 10, 2003

57 Alternative belief expansion heuristics
Compare 4 approaches to belief expansion: Random (RA) P(s1) Machine Learning Lunch - March 10, 2003

58 Alternative belief expansion heuristics
Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) P(s1) ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

59 Alternative belief expansion heuristics
Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) Stochastic Simulation with Greedy Action (SSGA) (b)=a2 P(s1) ba2,o2 b a2,o2 Machine Learning Lunch - March 10, 2003

60 Validation of the belief expansion heuristic
Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) Stochastic Simulation with Greedy Action (SSGA) Stochastic Simulation with Explorative Action (SSEA) maxa||b-ba|| P(s1) ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

61 Validation of the belief expansion heuristic
Hallway domain: |S|=60, |A|=5, ||=20 Machine Learning Lunch - March 10, 2003

62 Validation of the belief expansion heuristic
Hallway2 domain: |S|=92, |A|=5, ||=17 Machine Learning Lunch - March 10, 2003

63 Validation of the belief expansion heuristic
Tag domain: |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003

64 Summary POMDPs suffer from the curse of history
# of beliefs grows exponentially with the planning horizon PBVI addresses the curse of history by limiting planning to a small set of likely beliefs. Strengths of PBVI include: anytime algorithm; polynomial-time value updates; bounded approximation error; empirical results showing we can solve problems up to ~1000 states. Machine Learning Lunch - March 10, 2003


Download ppt "Approximate POMDP planning: Overcoming the curse of history!"

Similar presentations


Ads by Google