Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an.

Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an anytime algorithm for POMDPs Workshop on Advances in Machine Learning - June, 2003

Joelle PineauWorkshop on Advances in Machine Learning Why use a POMDP? POMDPs provide a rich framework for sequential decision-making, which can model: –varying rewards across actions and goals –actions with random effects –uncertainty in the state of the world

Joelle PineauWorkshop on Advances in Machine Learning Existing applications of POMDPs –Maintenance scheduling »Puterman, 1994 –Robot navigation »Koenig & Simmons, 1995; Roy & Thrun, 1999 –Helicopter control »Bagnell & Schneider, 2001; Ng et al., 2002 –Dialogue modeling »Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 –Preference elicitation »Boutilier, 2002

Joelle PineauWorkshop on Advances in Machine Learning POMDP Model POMDP is n-tuple { S, A, , T, O, R }: What goes on: s t-1 stst a t-1 atat T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function S = state set A = action set  = observation set What we see: o t-1 otot What we infer: b t-1 btbt

Joelle PineauWorkshop on Advances in Machine Learning Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 –E.g. Let S={s 1, s 2 } P(s 1 ) 0 1

Joelle PineauWorkshop on Advances in Machine Learning Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 –E.g. Let S={s 1, s 2, s 3 } P(s 1 ) P(s 2 ) 0 1 1

Joelle PineauWorkshop on Advances in Machine Learning Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 –E.g. Let S={s 1, s 2, s 3, s 4 } P(s 1 ) P(s 2 ) 0 1 1 P(s 3 )

Joelle PineauWorkshop on Advances in Machine Learning The first curse of POMDP planning The curse of dimensionality: –dimension of planning problem = # of states –related to the MDP curse of dimensionality

Joelle PineauWorkshop on Advances in Machine Learning POMDP value functions V(b) = expected total discounted future reward starting from b Represent V as the upper surface of a set of hyper-planes. V is piecewise-linear convex Backup operator T: V  TV P(s 1 ) V(b) b

Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, |  |=2 Iteration# hyper-planes 0 1 P(s 1 ) V 0 (b) b

Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, |  |=2 Iteration# hyper-planes 0 1 1 3 P(s 1 ) V 1 (b) b

Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, |  |=2 Iteration# hyper-planes 0 1 1 3 2 27 P(s 1 ) V 2 (b) b

Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, |  |=2 Iteration# hyper-planes 0 1 1 3 2 27 3 2187 P(s 1 ) V 2 (b) b

Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, |  |=2 Iteration# hyper-planes 0 1 1 3 2 27 3 2187 4 14,348,907 P(s 1 ) V 2 (b) b

Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, |  |=2  Many hyper-planes can be pruned away P(s 1 ) V 2 (b) b Iteration# hyper-planes 01 13 25 39 47 5 13 10 27 15 47 20 59

Joelle PineauWorkshop on Advances in Machine Learning Is pruning sufficient? |S|=20, |A|=6, |  |=8 Iteration# hyper-planes 01 15 2 213 3 ????? … Not for this problem!

Joelle PineauWorkshop on Advances in Machine Learning Certainly not for this problem! Physiotherapy Patient room Robot home |S|=576, |A|=19, |O|=17 State Features: { RobotLocation, ReminderGoal, UserLocation, UserMotionGoal, UserStatus, UserSpeechGoal }

Joelle PineauWorkshop on Advances in Machine Learning The second curse of POMDP planning The curse of dimensionality: –the dimension of each hyper-plane = # of states The curse of history: –the number of hyper-planes grows exponentially with the planning horizon

Joelle PineauWorkshop on Advances in Machine Learning The second curse of POMDP planning The curse of dimensionality: –the dimension of each hyper-plane = # of states The curse of history: –the number of hyper-planes grows exponentially with the planning horizon Complexity of POMDP value iteration: dimensionalityhistory

Joelle PineauWorkshop on Advances in Machine Learning Possible approximation approaches Ignore the belief: Discretize the belief: Compress the belief: Plan for trajectories: s1s1 s0s0 s2s2 - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, small gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002]

Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points P(s 1 ) V(b) b1b1 b0b0 b2b2

Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points –Plan for those belief points only P(s 1 ) V(b) b1b1 b0b0 b2b2

Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points  Focus on reachable beliefs –Plan for those belief points only P(s 1 ) V(b) b1b1 b0b0 b2b2 a,o

Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points  Focus on reachable beliefs –Plan for those belief points only  Learn value and its gradient P(s 1 ) V(b) b1b1 b0b0 b2b2 a,o

Joelle PineauWorkshop on Advances in Machine Learning Point-based value update P(s 1 ) V(b) b1b1 b0b0 b2b2

Joelle PineauWorkshop on Advances in Machine Learning Point-based value update Initialize the value function (…and skip ahead a few iterations) P(s 1 ) V n (b) b1b1 b0b0 b2b2

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: Point-based value update P(s 1 ) V n (b) b

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: –For each (a,o): Project forward b  b a,o and find best value: Point-based value update P(s 1 ) V n (b) bb a1,o2 b a2,o2 b a2,o1 b a1,o1

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: –For each (a,o): Project forward b  b a,o and find best value: Point-based value update P(s 1 ) V n (b) bb a1,o2 b a2,o2 b a2,o1 b a1,o1  b a1,o1,  b a2,o1  b a2,o2  b a1,o2

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: –For each (a,o): Project forward b  b a,o and find best value: –Sum over observations: Point-based value update P(s 1 ) V n (b) bb a1,o2 b a2,o2 b a2,o1  b a1,o1,  b a2,o1  b a2,o2  b a1,o2 b a1,o1

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: –For each (a,o): Project forward b  b a,o and find best value: –Sum over observations: Point-based value update P(s 1 ) V n (b) b  b a1,o1,  b a2,o1  b a2,o2  b a1,o2

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: –For each (a,o): Project forward b  b a,o and find best value: –Sum over observations: Point-based value update P(s 1 ) V n+1 (b) b  b a1  b a2

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: –For each (a,o): Project forward b  b a,o and find best value: –Sum over observations: –Max over actions: Point-based value update P(s 1 ) V n+1 (b) b  b a1  b a2

Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b  B: –For each (a,o): Project forward b  b a,o and find best value: –Sum over observations: –Max over actions: Point-based value update P(s 1 ) V n+1 (b) b1b1 b2b2 b0b0

Joelle PineauWorkshop on Advances in Machine Learning Complexity of value update Exact UpdatePoint-based Update I - ProjectionS 2 A  n S 2 A  B II - SumSA  n  SA  B 2 III - MaxSA  n SAB where:S = # states  n = # solution vectors at iteration n A = # actionsB = # belief points  = # observations  n+1

Joelle PineauWorkshop on Advances in Machine Learning A bound on the approximation error Bound error of the point-based backup operator. Bound depends on how densely we sample belief points. –Let  be the set of reachable beliefs. –Let B be the set of belief points. Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm  n =||V n B -V n * || is bounded by:

Joelle PineauWorkshop on Advances in Machine Learning Experimental results: Lasertag domain State space = RobotPosition  OpponentPosition Observable: RobotPosition - always OpponentPosition - only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, |  |=30

Joelle PineauWorkshop on Advances in Machine Learning Performance of PBVI on Lasertag domain Opponent tagged 70% of trials Opponent tagged 17% of trials

Joelle PineauWorkshop on Advances in Machine Learning Performance on well-known POMDPs Maze33 |S|=36, |A|=5, |  |=17 Hallway |S|=60, |A|=5, |  |=20 Hallway2 |S|=92, |A|=5, |  |=17 Reward 0.198 0.94 2.30 2.25 Reward 0.261 n.v. 0.53 Reward 0.109 n.v. 0.35 0.34 Time(s) 0.19 n.v. 12166 3448 Time(s) 0.51 n.v. 450 288 Time(s) 1.44 n.v. 27898 360 B - 174 660 470 B - n.v. 300 86 B - 337 1840 95 %Goal 22 98 100 98 %Goal 47 n.v 100 95 Method QMDP Grid PBUA PBVI

Joelle PineauWorkshop on Advances in Machine Learning Selecting good belief points What can we learn from policy search methods? –Focus on reachable beliefs. P(s 1 ) bb a1,o2 b a2,o2 b a2,o1 b a1,o1 a2,o2a1,o2 a2,o1 a1,o1

Joelle PineauWorkshop on Advances in Machine Learning Selecting good belief points What can we learn from policy search methods? –Focus on reachable beliefs. How can we avoid including all reachable beliefs? –Reachability analysis considers all actions, but stochastic observation choice. P(s 1 ) bb a1,o2 b a2,o1 a1,o2 a2,o1 b a2,o2 b a1,o1

Joelle PineauWorkshop on Advances in Machine Learning Selecting good belief points What can we learn from policy search methods? –Focus on reachable beliefs. How can we avoid including all reachable beliefs? –Reachability analysis considers all actions, but stochastic observation choice. What can we learn from our error bound? –Select widely-spaced beliefs, rather than near-by beliefs. P(s 1 ) bb a1,o2 b a2,o1 a1,o2 a2,o1

Joelle PineauWorkshop on Advances in Machine Learning Validation of the belief expansion heuristic Hallway domain: |S|=60, |A|=5, |  |=20

Joelle PineauWorkshop on Advances in Machine Learning Validation of the belief expansion heuristic Tag domain: |S|=870, |A|=5, |  |=30

Joelle PineauWorkshop on Advances in Machine Learning The anytime PBVI algorithm Alternate between: –Growing the set of belief point (e.g. B doubles in size everytime) –Planning for those belief points Terminate when you run out of time or have a good policy.

Joelle PineauWorkshop on Advances in Machine Learning The anytime PBVI algorithm Alternate between: –Growing the set of belief point (e.g. B doubles in size everytime) –Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: –13 phases: |B|=1334 –ran out of time!

Joelle PineauWorkshop on Advances in Machine Learning The anytime PBVI algorithm Alternate between: –Growing the set of belief point (e.g. B doubles in size everytime) –Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: –13 phases: |B|=1334 –ran out of time! Hallway2 results: –8 phases: |B|=95 –found good policy.

Joelle PineauWorkshop on Advances in Machine Learning Summary POMDPs suffer from the curse of history »# of beliefs grows exponentially with the planning horizon PBVI addresses the curse of history by limiting planning to a small set of likely beliefs. Strengths of PBVI include: »anytime algorithm; »polynomial-time value updates; »bounded approximation error; »empirical results showing we can solve problems up to 870 states.

Joelle PineauWorkshop on Advances in Machine Learning Recent work Current hurdle to solving even larger POMDPs: PBVI complexity is O(S 2 A  B + SA  B 2 ) –Addressing S 2 : »Combine PBVI with belief compression techniques. But sparse transition matrices mean: S 2  S –Addressing B 2 : »Use ball-trees to structure belief points. »Find better belief selection heuristics.

Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an.

Similar presentations

Presentation on theme: "Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an.

Similar presentations

Presentation on theme: "Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an."— Presentation transcript:

Similar presentations

About project

Feedback