U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Department of Computer Science Undergraduate Events More
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Department of Computer Science Christopher Amato Carnegie Mellon University Feb 5 th, 2010 Increasing Scalability in Algorithms for Centralized and Decentralized.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Markov Decision Process (MDP)
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
Markov Decision Problems
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel S. Bernstein Shlomo Zilberstein University of Massachusetts Amherst May 9, 2006

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Overview DEC-POMDPs and their solutions Fixing memory with controllers Previous approaches Representing the optimal controller Some experimental results

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 DEC-POMDPs Decentralized partially observable Markov decision process (DEC-POMDP) Multiagent sequential decision making under uncertainty At each stage, each agent receives: A local observation rather than the actual state A joint immediate reward Environment a1a1 o1o1 a2a2 o2o2 r

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 DEC-POMDP definition A two agent DEC-POMDP can be defined with the tuple: M =  S, A 1, A 2, P, R,  1,  2, O  S, a finite set of states with designated initial state distribution b 0 A 1 and A 2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a 1, a 2 ) R, the reward model: R(s, a 1, a 2 )  1 and  2, each agent’s finite set of observations O, the observation model: O(o 1, o 2 | s', a 1, a 2 )

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 DEC-POMDP solutions A policy for each agent is a mapping from their observation sequences to actions,  *  A, allowing distributed execution A joint policy is a policy for each agent Goal is to maximize expected discounted reward over an infinite horizon Use a discount factor, , to calculate this

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 Example: Grid World States: grid cell pairs Actions: move,,,, stay Transitions: noisy Observations: red lines Goal: share same square

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Previous work Optimal algorithms Very large space and time requirements Can only solve small problems Approximation algorithms provide weak optimality guarantees, if any

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Policies as controllers Finite state controller for each agent i Fixed memory Randomness used to offset memory limitations Action selection, ψ : Q i →  A i Transitions, η : Q i × A i × O i →  Q i Value for a pair is given by the Bellman equation: Where the subscript denotes the agent and lowercase values are elements of the uppercase sets above

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 a1a1 a1a1 a2a2 o1o1 Controller example Stochastic controller for a single agent 2 nodes, 2 actions, 2 obs Parameters P(a|q) P(q’|q,a,o) 1 2 o2o o1o1 1.0 o2o2 o2o2

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Optimal controllers How do we set the parameters of the controllers? Deterministic controllers - traditional methods such as best-first search (Szer and Charpillet 05) Stochastic controllers - continuous optimization

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Decentralized BPI Decentralized Bounded Policy Iteration (DEC- BPI) - (Bernstein, Hansen and Zilberstein 05) Alternates between improvement and evaluation until convergence Improvement: For each node of each agent’s controller, find a probability distribution over one-step lookahead values that is greater than the current node’s value for all states and controllers for the other agents Evaluation: Finds values of all nodes in all states

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 DEC-BPI - Linear program NEED TO FIX THIS SLIDE IF I WANT TO USE IT! For a given node, q Variables: , P(a i, q i ’|q i, o i ) Objective: Maximize  Improvement Constraints:  s  S, q –i  Q –i Probability constraints:  a  A Also, all probabilities must sum to 1 and be greater than 0

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Problems with DEC-BPI Difficult to improve value for all states and other agents’ controllers May require more nodes for a given start state Linear program (one step lookahead) results in local optimality Correlation device can somewhat improve performance

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Optimal controllers Use nonlinear programming (NLP) Consider node value as a variable Improvement and evaluation all in one step Add constraints to maintain valid values

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 NLP intuition Value variable allows improvement and evaluation at the same time (infinite lookahead) While iterative process of DEC-BPI can “get stuck” the NLP does define the globally optimal solution

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 Variables:, Objective: Maximize Value Constraints:  s  S,  Q Linear constraints are needed to ensure controllers are independent Also, all probabilities must sum to 1 and be greater than 0 NLP representation

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Optimality Theorem: An optimal solution of the NLP results in optimal stochastic controllers for the given size and initial state distribution.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Pros and cons of the NLP Pros Retains fixed memory and efficient policy representation Represents optimal policy for given size Takes advantage of known start state Cons Difficult to solve optimally

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 19 Experiments Nonlinear programming algorithms (snopt and filter) - sequential quadratic programming (SQP) Guarantees locally optimal solution NEOS server 10 random initial controllers for a range of sizes Compared the NLP with DEC-BPI With and without a small correlation device

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 20 Two agents share a broadcast channel (4 states, 5 obs, 2 acts) Very simple near-optimal policy mean quality of the NLP and DEC-BPI implementations Results: Broadcast Channel

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 21 Results: Recycling Robots mean quality of the NLP and DEC-BPI implementations on the recycling robot domain (4 states, 2 obs, 3 acts)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 22 Results: Grid World mean quality of the NLP and DEC-BPI implementations on the meeting in a grid (16 states, 2 obs, 5 acts)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 23 Results: Running time Running time mostly comparable to DEC-BPI corr The increase as controller size grows offset by better performance Broadcast Recycle Grid

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 24 Conclusion Defined the optimal fixed-size stochastic controller using NLP Showed consistent improvement over DEC-BPI with locally optimal solvers In general, the NLP may allow small optimal controllers to be found Also, may provide concise near-optimal approximations of large controllers

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 25 Future Work Explore more efficient NLP formulations Investigate more specialized solution techniques for NLP formulation Greater experimentation and comparison with other methods