Approximate POMDP planning: Overcoming the curse of history!

Slides:

Advertisements

Similar presentations

Dialogue Policy Optimisation

Advertisements

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an.

Decision Theoretic Planning

Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

Planning under Uncertainty

POMDPs: Partially Observable Markov Decision Processes Advanced AI

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.

An Introduction to PO-MDP Presented by Alp Sardağ.

Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Based on slides by Nicholas Roy, MIT Finding Approximate POMDP Solutions through Belief Compression.

Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.

Search and Planning for Inference and Learning in Computer Vision

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Solving POMDPs through Macro Decomposition

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Tractable Planning for Real-World Robotics: The promises and challenges of dealing with uncertainty Joelle Pineau Robotics Institute Carnegie Mellon University.

Probabilistic approaches to reasoning and control: Towards autonomous interactive mobile robots Joelle Pineau Carnegie Mellon University TAMALE Seminar.

Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.

Deterministic Algorithms for Submodular Maximization Problems Moran Feldman The Open University of Israel Joint work with Niv Buchbinder.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

C.V. Education – Ben-Gurion University (Israel). –B.Sc. – –M.Sc. – –PhD. – Work: –After B.Sc. – Software Engineer at Microsoft,

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.

Lecture 3: Uninformed Search

Yifeng Zeng Aalborg University Denmark

Keep the Adversary Guessing: Agent Security by Policy Randomization

Announcements Grader office hours posted on course website

POMDPs Logistics Outline No class Wed

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Joelle Pineau Robotics Institute Carnegie Mellon University

Markov Decision Processes

Markov Decision Processes

CS 188: Artificial Intelligence

Hierarchical POMDP Solutions

High-level robot behavior control using POMDPs

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.

Instructor: Vincent Conitzer

Heuristic Search Value Iteration

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Reinforcement Nisheeth 18th January 2019.

Reinforcement Learning (2)

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Approximate POMDP planning: Overcoming the curse of history! Presented by: Joelle Pineau Joint work with: Geoff Gordon and Sebastian Thrun Machine Learning Lunch - March 10, 2003

To use or not to use a POMDP POMDPs provide a rich framework for sequential decision-making, which can model: varying rewards across actions and goals uncertainty in the action effects uncertainty in the state of the world Machine Learning Lunch - March 10, 2003

Existing applications of POMDPs Maintenance scheduling Puterman, 1994 Robot navigation Koenig & Simmons, 1995; Roy & Thrun, 1999 Helicopter control Bagnell & Schneider, 2001; Ng et al., 2002 Dialogue modeling Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 Preference elicitation Boutilier, 2002 Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st (s) rt-1 (s) rt at-1 at Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st ot-1 (s) rt-1 ot (s) rt What we see: at-1 at Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function What goes on: st-1 st ot-1 rt-1 ot rt What we see: at-1 at (b) (b) What we infer: bt-1 bt Machine Learning Lunch - March 10, 2003

Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2} 1 P(s1) Machine Learning Lunch - March 10, 2003

Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2, s3} 1 P(s1) P(s2) 1 Machine Learning Lunch - March 10, 2003

Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 E.g. Let S={s1, s2, s3 , s4} 1 P(s3) P(s1) P(s2) 1 Machine Learning Lunch - March 10, 2003

The first curse of POMDP planning The curse of dimensionality: dimension of the belief = # of states dimension of planning problem = # of states related to the MDP curse of dimensionality Machine Learning Lunch - March 10, 2003

Planning for POMDPs Learning a value function V(b) bB: Learning an action-selection policy (b) bB: Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 V0(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 V1(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 V1(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 4 14,348,907 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Properties of exact value iteration Value function is always piecewise-linear convex Many hyper-planes can be pruned away |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 5 3 9 4 7 5 13 10 27 15 47 20 59 … V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Iteration # hyper-planes Is pruning sufficient? |S|=20, |A|=6, ||=8 Iteration # hyper-planes 0 1 1 5 2 213 3 ????? … Not for this problem! Machine Learning Lunch - March 10, 2003

The second curse of POMDP planning The curse of dimensionality: the dimension of each hyper-plane = # of states The curse of history: the number of hyper-planes grows exponentially with the planning horizon Machine Learning Lunch - March 10, 2003

The second curse of POMDP planning The curse of dimensionality: the dimension of each hyper-plane = # of states The curse of history: the number of hyper-planes grows exponentially with the planning horizon dimensionality history Complexity of POMDP value iteration: Machine Learning Lunch - March 10, 2003

Possible approximation approaches Ignore the belief: Discretize the belief: Compress the belief: Plan for trajectories: - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, slow-changing gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002] Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration Main idea: Select a small set of belief points V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration Main idea: Select a small set of belief points Plan for those belief points only V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration Main idea: Select a small set of belief points  Focus on reachable beliefs Plan for those belief points only V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration Main idea: Select a small set of belief points  Focus on reachable beliefs Plan for those belief points only  Learn value and its gradient V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

Point-based value update V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) Vn(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: Vn(b) P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Vn(b) P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn(b) ba1,o1, ba2,o1 ba1,o2 ba2,o2 P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Max over actions: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update Initialize the value function (…and skip ahead a few iterations) For each bB: For each (a,o): Project forward bba,o and find best value: Sum over observations: Max over actions: Vn+1(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

Complexity of value update Exact Update Point-based Update I - Projection S2An S2AB II - Sum S2An SAB2 III - Max SAn SAB where: S = # states n = # solution vectors at iteration n A = # actions B = # belief points  = # observations n+1 Machine Learning Lunch - March 10, 2003

Theoretical properties of point-based updates Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n=||VnB-Vn*|| is bounded by: Machine Learning Lunch - March 10, 2003

Back to the full algorithm Main idea: Select a small set of belief points  PART II Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

Experimental results: Lasertag domain State space = RobotPosition  OpponentPosition Observable: RobotPosition - always OpponentPosition - only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003

Performance of PBVI on Lasertag domain Opponent tagged 59% of trials Opponent tagged 17% of trials Machine Learning Lunch - March 10, 2003

Performance on well-known POMDPs Maze33 |S|=36, |A|=5, ||=17 Hallway |S|=60, |A|=5, ||=20 Hallway2 |S|=92, |A|=5, ||=17 Method QMDP Grid PBUA PBVI Reward 0.198 0.94 2.30 2.25 Time(s) 0.19 n.v. 12166 3448 B n.a. 174 660 470 %Goal 47 n.v 100 95 Reward 0.261 n.v. 0.53 Time(s) 0.51 n.v. 450 288 B n.a. 300 86 %Goal 22 98 100 Reward 0.109 n.v. 0.35 0.34 Time(s) 1.44 n.v. 27898 360 B n.a. 337 1840 95 Machine Learning Lunch - March 10, 2003

Back to the full algorithm Main idea: Select a small set of belief points  PART II Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

Selecting good belief points What can we learn from policy search methods? Focus on reachable beliefs. Machine Learning Lunch - March 10, 2003

Selecting good belief points What can we learn from policy search methods? Focus on reachable beliefs. a1,o1 a2,o1 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

Selecting good belief points What can we learn from policy search methods? Focus on reachable beliefs. What can we learn from MDP exploration techniques? Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

Selecting good belief points What can we learn from policy search methods? Focus on reachable beliefs. What can we learn from MDP exploration techniques? Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

How does PBVI actually select belief points? Start with B  b0 For any belief point bB: P(s1) b Machine Learning Lunch - March 10, 2003

How does PBVI actually select belief points? Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) b ba1 a1,o2 Machine Learning Lunch - March 10, 2003

How does PBVI actually select belief points? Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

How does PBVI actually select belief points? Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. Add to B the belief ba which is farthest away from any bB. B  argmax{ba} [ minbB s(ba(s) - b(s)) ] P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

How does PBVI actually select belief points? Start with B  b0 For any belief point bB: For each action aA: Generate a new belief ba by applying a and stochastically picking an observation o. Add to B the belief ba which is farthest away from any bB. B  argmax{ba} [ minbB s |ba(s) - b(s)| ] Repeat until |B| = set size P(s1) ba2 b ba1 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

The anytime PBVI algorithm Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Machine Learning Lunch - March 10, 2003

The anytime PBVI algorithm Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: 13 phases: |B|=1334 ran out of time! Machine Learning Lunch - March 10, 2003

The anytime PBVI algorithm Alternate between: Growing the set of belief point (e.g. B doubles in size everytime) Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: 13 phases: |B|=1334 ran out of time! Hallway2 results: 8 phases: |B|=95 found good policy. Machine Learning Lunch - March 10, 2003

Alternative belief expansion heuristics Compare 4 approaches to belief expansion: Random (RA) P(s1) Machine Learning Lunch - March 10, 2003

Alternative belief expansion heuristics Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) P(s1) ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

Alternative belief expansion heuristics Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) Stochastic Simulation with Greedy Action (SSGA) (b)=a2 P(s1) ba2,o2 b a2,o2 Machine Learning Lunch - March 10, 2003

Validation of the belief expansion heuristic Compare 4 approaches to belief expansion: Random (RA) Stochastic Simulation with Random Action (SSRA) Stochastic Simulation with Greedy Action (SSGA) Stochastic Simulation with Explorative Action (SSEA) maxa||b-ba|| P(s1) ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

Validation of the belief expansion heuristic Hallway domain: |S|=60, |A|=5, ||=20 Machine Learning Lunch - March 10, 2003

Validation of the belief expansion heuristic Hallway2 domain: |S|=92, |A|=5, ||=17 Machine Learning Lunch - March 10, 2003

Validation of the belief expansion heuristic Tag domain: |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003

Summary POMDPs suffer from the curse of history # of beliefs grows exponentially with the planning horizon PBVI addresses the curse of history by limiting planning to a small set of likely beliefs. Strengths of PBVI include: anytime algorithm; polynomial-time value updates; bounded approximation error; empirical results showing we can solve problems up to ~1000 states. Machine Learning Lunch - March 10, 2003