Download presentation
Presentation is loading. Please wait.
Published byHandoko Kusnadi Modified over 6 years ago
1
Approximate POMDP planning: Overcoming the curse of history!
Presented by: Joelle Pineau Joint work with: Geoff Gordon and Sebastian Thrun Machine Learning Lunch - March 10, 2003
2
To use or not to use a POMDP
POMDPs provide a rich framework for sequential decision-making, which can model: varying rewards across actions and goals uncertainty in the action effects uncertainty in the state of the world Machine Learning Lunch - March 10, 2003
3
Existing applications of POMDPs
Maintenance scheduling Puterman, 1994 Robot navigation Koenig & Simmons, 1995; Roy & Thrun, 1999 Helicopter control Bagnell & Schneider, 2001; Ng et al., 2002 Dialogue modeling Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 Preference elicitation Boutilier, 2002 Machine Learning Lunch - March 10, 2003
4
Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st Machine Learning Lunch - March 10, 2003
5
Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st (s) rt-1 (s) rt at-1 at Machine Learning Lunch - March 10, 2003
6
Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st ot-1 (s) rt-1 ot (s) rt What we see: at-1 at Machine Learning Lunch - March 10, 2003
7
Graphical Model Representation
POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st ot-1 rt-1 ot rt What we see: at-1 at (b) (b) What we infer: bt-1 bt Machine Learning Lunch - March 10, 2003
8
Understanding the belief state
A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2} 1 P(s1) Machine Learning Lunch - March 10, 2003
9
Understanding the belief state
A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2, s3} 1 P(s1) P(s2) 1 Machine Learning Lunch - March 10, 2003
10
Understanding the belief state
A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2, s3 , s4} 1 P(s3) P(s1) P(s2) 1 Machine Learning Lunch - March 10, 2003
11
The first curse of POMDP planning
The curse of dimensionality: dimension of the belief = # of states dimension of planning problem = # of states related to the MDP curse of dimensionality Machine Learning Lunch - March 10, 2003
12
Planning for POMDPs Learning a value function V(b) bB:
Learning an action-selection policy (b) bB: Machine Learning Lunch - March 10, 2003
13
Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes V0(b) P(s1) b Machine Learning Lunch - March 10, 2003
14
Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes V1(b) P(s1) b Machine Learning Lunch - March 10, 2003
15
Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes V1(b) P(s1) b Machine Learning Lunch - March 10, 2003
16
Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 2 27 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003
17
Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 2 27 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003
18
Exact value iteration for POMDPs
Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 2 27 ,348,907 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003
19
Properties of exact value iteration
Value function is always piecewise-linear convex Many hyper-planes can be pruned away |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 5 3 9 4 7 … V2(b) P(s1) b Machine Learning Lunch - March 10, 2003
20
Iteration # hyper-planes
Is pruning sufficient? |S|=20, |A|=6, ||=8 Iteration # hyper-planes 0 1 1 5 ????? … Not for this problem! Machine Learning Lunch - March 10, 2003
21
The second curse of POMDP planning
The curse of dimensionality: the dimension of each hyper-plane = # of states The curse of history: the number of hyper-planes grows exponentially with the planning horizon Machine Learning Lunch - March 10, 2003
22
The second curse of POMDP planning
The curse of dimensionality: the dimension of each hyper-plane = # of states The curse of history: the number of hyper-planes grows exponentially with the planning horizon dimensionality history Complexity of POMDP value iteration: Machine Learning Lunch - March 10, 2003
23
Possible approximation approaches
Ignore the belief: Discretize the belief: Compress the belief: Plan for trajectories: - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, slow-changing gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002] Machine Learning Lunch - March 10, 2003
24
A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003
25
A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points Plan for those belief points only V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003
26
A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points Focus on reachable beliefs Plan for those belief points only V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003
27
A new algorithm: Point-based value iteration
Main idea: Select a small set of belief points Focus on reachable beliefs Plan for those belief points only Learn value and its gradient V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003
28
Point-based value update
V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003
29
Point-based value update
Initialize the value function (…and skip ahead a few iterations) Vn(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003
30
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: Vn(b) P(s1) b Machine Learning Lunch - March 10, 2003
31
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Vn(b) P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003
32
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003
33
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003
34
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) b Machine Learning Lunch - March 10, 2003
35
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003
36
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Max over actions: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003
37
Point-based value update
Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Max over actions: Vn+1(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003
38
Complexity of value update
Exact Update Point-based Update I - Projection S2An S2AB II - Sum S2An SAB2 III - Max SAn SAB where: S = # states n = # solution vectors at iteration n A = # actions B = # belief points = # observations n+1 Machine Learning Lunch - March 10, 2003
39
Theoretical properties of point-based updates
Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n=||VnB-Vn*|| is bounded by: Machine Learning Lunch - March 10, 2003
40
Back to the full algorithm
Main idea: Select a small set of belief points PART II Plan for those belief points only PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003
41
Experimental results: Lasertag domain
State space = RobotPosition OpponentPosition Observable: RobotPosition - always OpponentPosition - only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003
42
Performance of PBVI on Lasertag domain
Opponent tagged 59% of trials Opponent tagged 17% of trials Machine Learning Lunch - March 10, 2003
43
Performance on well-known POMDPs
Maze33 |S|=36, |A|=5, ||=17 Hallway |S|=60, |A|=5, ||=20 Hallway2 |S|=92, |A|=5, ||=17 Method QMDP Grid PBUA PBVI Reward 0.198 0.94 2.30 2.25 Time(s) 0.19 n.v. 12166 3448 B n.a. 174 660 470 %Goal 47 n.v 100 95 Reward 0.261 n.v. 0.53 Time(s) 0.51 n.v. 450 288 B n.a. 300 86 %Goal 22 98 100 Reward 0.109 n.v. 0.35 0.34 Time(s) 1.44 n.v. 27898 360 B n.a. 337 1840 95 Machine Learning Lunch - March 10, 2003
44
Back to the full algorithm
Main idea: Select a small set of belief points PART II Plan for those belief points only PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003
45
Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. Machine Learning Lunch - March 10, 2003
46
Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. a1,o1 a2,o1 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003
47
Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. What can we learn from MDP exploration techniques? Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003
48
Selecting good belief points
What can we learn from policy search methods? Focus on reachable beliefs. What can we learn from MDP exploration techniques? Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003
49
How does PBVI actually select belief points?
Start with B b0 For any belief point bB: P(s1) b Machine Learning Lunch - March 10, 2003
50
How does PBVI actually select belief points?
Start with B b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) b ba1 a1,o2 Machine Learning Lunch - March 10, 2003
51
How does PBVI actually select belief points?
Start with B b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003
52
How does PBVI actually select belief points?
Start with B b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. Add to B the belief ba which is farthest away from any bB. B argmax{ba} [ minbB s(ba(s) - b(s)) ] P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003
53
How does PBVI actually select belief points?
Start with B b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. Add to B the belief ba which is farthest away from any bB. B argmax{ba} [ minbB s |ba(s) - b(s)| ] Repeat until |B| = set size P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003
54
The anytime PBVI algorithm
Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Machine Learning Lunch - March 10, 2003
55
The anytime PBVI algorithm
Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: 13 phases: |B|=1334 ran out of time! Machine Learning Lunch - March 10, 2003
56
The anytime PBVI algorithm
Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: 13 phases: |B|=1334 ran out of time! Hallway2 results: 8 phases: |B|=95 found good policy. Machine Learning Lunch - March 10, 2003
57
Alternative belief expansion heuristics
Compare 4 approaches to belief expansion: Random (RA) P(s1) Machine Learning Lunch - March 10, 2003
58
Alternative belief expansion heuristics
Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) P(s1) ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003
59
Alternative belief expansion heuristics
Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) Stochastic Simulation with Greedy Action (SSGA) (b)=a2 P(s1) ba2,o2 b a2,o2 Machine Learning Lunch - March 10, 2003
60
Validation of the belief expansion heuristic
Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) Stochastic Simulation with Greedy Action (SSGA) Stochastic Simulation with Explorative Action (SSEA) maxa||b-ba|| P(s1) ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003
61
Validation of the belief expansion heuristic
Hallway domain: |S|=60, |A|=5, ||=20 Machine Learning Lunch - March 10, 2003
62
Validation of the belief expansion heuristic
Hallway2 domain: |S|=92, |A|=5, ||=17 Machine Learning Lunch - March 10, 2003
63
Validation of the belief expansion heuristic
Tag domain: |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003
64
Summary POMDPs suffer from the curse of history
# of beliefs grows exponentially with the planning horizon PBVI addresses the curse of history by limiting planning to a small set of likely beliefs. Strengths of PBVI include: anytime algorithm; polynomial-time value updates; bounded approximation error; empirical results showing we can solve problems up to ~1000 states. Machine Learning Lunch - March 10, 2003
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.