U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Department of Computer Science Christopher Amato Carnegie Mellon University Feb 5 th, 2010 Increasing Scalability in Algorithms for Centralized and Decentralized.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
MDPs (cont) & Reinforcement Learning
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato Daniel S. Bernstein Shlomo Zilberstein University of Massachusetts Amherst January 6, 2006

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Overview POMDPs and their solutions Fixing memory with controllers Previous approaches Representing the optimal controller Some experimental results

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 POMDPs Partially observable Markov decision process (POMDP) Agent interacts with the environment Sequential decision making under uncertainty At each stage receives: an observation rather than the actual state Receives an immediate reward Environment a o,r

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 POMDP definition A POMDP can be defined with the following tuple: M =  S, A, P, R, , O  S, a finite set of states with designated initial state distribution b 0 A, a finite set of actions P, the state transition model: P(s'| s, a) R, the reward model: R(s, a) , a finite set of observations O, the observation model: O(o|s',a)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 POMDP solutions A policy is a mapping  :  *  A Goal is to maximize expected discounted reward over an infinite horizon Use a discount factor, , to calculate this

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 Example POMDP: Hallway Minimize number of steps to the starred square for a given start state distribution States: grid cells with orientation Actions: turn,,, move forward, stay Transitions: noisy Observations: red lines Goal: starred square

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Previous work Optimal algorithms Large space requirement Can only solve small problems Approximation algorithms provide weak optimality guarantees, if any

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Policies as controllers Fixed memory Randomness used to offset memory limitations Action selection, ψ : Q →  A Transitions, η : Q × A × O →  Q Value given by Bellman equation:

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 a1a1 a1a1 a2a2 o1o1 Controller example Stochastic controller 2 nodes, 2 actions, 2 obs Parameters P(a|q) P(q’|q,a,o) 1 2 o2o o1o1 1.0 o2o2 o2o2

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Optimal controllers How do we set the parameters of the controller? Deterministic controllers - traditional methods such as branch and bound (Meuleau et al. 99) Stochastic controllers - continuous optimization

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Gradient ascent Gradient ascent (GA)- Meuleau et al. 99 Create cross-product MDP from POMDP and controller Matrix operations then allow a gradient to be calculated

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Problems with GA Incomplete gradient calculation Computationally challenging Locally optimal

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 BPI Bounded Policy Iteration (BPI) - Poupart & Boutilier 03 Alternates between improvement and evaluation until convergence Improvement: For each node, find a probability distribution over one-step lookahead values that is greater than the current node’s value for all states Evaluation: Finds values of all nodes in all states

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 BPI - Linear program For a given node, q Variables: x(a) = P(a|q), x(q’,a,o)=P(q’,a|q,o) Objective: Maximize  Improvement Constraints:  s  S Probability constraints: a  A Also, all probabilities must sum to 1 and be greater than 0

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Problems with BPI Difficult to improve value for all states May require more nodes for a given start state Linear program (one step lookahead) results in local optimality Must add nodes when stuck

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 QCLP optimization Quadratically constrained linear program (QCLP) Consider node value as a variable Improvement and evaluation all in one step Add constraints to maintain valid values

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 QCLP intuition Value variable allows improvement and evaluation at the same time (infinite lookahead) While iterative process of BPI can “get stuck” the QCLP provides the globally optimal solution

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Variables: x(q’,a,q,o) = P(q’,a|q,o), y(q,s)= V(q,s) Objective: Maximize Value Constraints:  s  S, q  Q Probability constraints:  q  Q, a  A, o  Ω Also, all probabilities must sum to 1 and be greater than 0 QCLP representation

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 19 Optimality Theorem: An optimal solution of the QCLP results in an optimal stochastic controller for the given size and initial state distribution.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 20 Pros and cons of QCLP Pros Retains fixed memory and efficient policy representation Represents optimal policy for given size Takes advantage of known start state Cons Difficult to solve optimally

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 21 Experiments Nonlinear programming algorithm (snopt) - sequential quadratic programming (SQP) Guarantees locally optimal solution NEOS server 10 random initial controllers for a range of sizes Compare the QCLP with BPI

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 22 Results (a) best and (b) mean results of the QCLP and BPI on the hallway domain (57 states, 21 obs, 5 acts) (a)(b)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 23 Results (a)(b) (a) best and (b) mean results of the QCLP and BPI on the machine maintenance domain (256 states, 16 obs, 4 acts)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 24 Results Computation time is comparable to BPI Increase as controller size grows offset by better performance Hallway Machine

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 25 Conclusion Introduced new fixed-size optimal representation Showed consistent improvement over BPI with a locally optimal solver In general, the QCLP may allow small optimal controllers to be found Also, may provide concise near-optimal approximations of large controllers

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 26 Future Work Investigate more specialized solution techniques for QCLP formulation Greater experimentation and comparison with other methods Extension to the multiagent case