Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Markov Decision Process

POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Optimal Policies for POMDP Presented by Alp Sardağ.

Meeting 3 POMDP (Partial Observability MDP) 資工四阮鶴鳴李運寰 Advisor: 李琳山教授.

Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.

CS594 Automated decision making University of Illinois, Chicago

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

An Introduction to Markov Decision Processes Sarah Hickmott

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

主講人：虞台文大同大學資工所智慧型多媒體研究室

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Markov Decision Processes

Planning under Uncertainty

1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.

Game-Theoretic Approaches to Multi-Agent Systems Bernhard Nebel.

POMDPs: Partially Observable Markov Decision Processes Advanced AI

Generalizing Plans to New Environments in Relational MDPs

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Department of Computer Science Christopher Amato Carnegie Mellon University Feb 5 th, 2010 Increasing Scalability in Algorithms for Centralized and Decentralized.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California

Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.

Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.

Solving POMDPs through Macro Decomposition

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

MDPs (cont) & Reinforcement Learning

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.

Decision Making Under Uncertainty Lec #10: Partially Observable MDPs UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.

Department of Computer Science Undergraduate Events More

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Game theory basics A Game describes situations of strategic interaction, where the payoff for one agent depends on its own actions as well as on the actions.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Markov Decision Processes

Markov Decision Processes

Approximate POMDP planning: Overcoming the curse of history!

Reinforcement Learning Dealing with Partial Observability

Reinforcement Nisheeth 18th January 2019.

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher Amato, Eric A. Hansen, Shlomo Zilberstein June 23, 2004

Extending the MDP Framework The MDP framework can be extended to incorporate partial observability and multiple agents Can we still do dynamic programming? – Lots of work on the single-agent case (POMDP) Sondik 78, Cassandra et al. 97, Hansen 98 – Some work on the multi-agent case, but limited theoretical guarantees Varaiya & Walrand 78, Nair et al. 03

Our contribution We extend DP to the multi-agent case For cooperative agents (DEC-POMDP): – First optimal DP algorithm For noncooperative agents: – First DP algorithm for iterated elimination of dominated strategies Unifies ideas from game theory and partially observable MDPs

Game Theory Normal form game Only one decision to make – no dynamics A mixed strategy is a distribution over strategies 3,30,4 4,01,1 a1a1 a2a2 b1b1 b2b2

Solving Games One approach to solving games is iterated elimination of dominated strategies Roughly speaking, this removes all unreasonable strategies Unfortunately, can’t always prune down to a single strategy per player

Dominance A strategy is dominated if for every joint distribution over strategies for the other players, there is another strategy that is at least as good Dominance test looks like this: Can be done using linear programming a1a1 a2a2 b1b1 b2b2 a3a3 dominated

Dynamic Programming for POMDPs We’ll start with some important concepts: a1a1 s2s2 s1s1 policy treelinear value functionbelief state s1s s2s s3s a2a2 a3a3 a3a3 a2a2 a1a1 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2

Dynamic Programming a1a1 a2a2 s1s1 s2s2

s1s1 s2s2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

s1s1 s2s2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

s1s1 s2s2

Properties of Dynamic Programming After T steps, the best policy tree for s 0 is contained in the set The pruning test is exactly the same as in elimination of dominated strategies in normal form games

Partially Observable Stochastic Game Multiple agents control a Markov process Each can have a different observation and reward function world a1a1 o 1, r 1 o 2, r 2 a2a2 1 2

POSG – Formal Definition A POSG is  S, A 1, A 2, P, R,  1  2, O , where – S is a finite state set, with initial state s 0 – A 1, A 2 are finite action sets – P(s, a 1, a 2, s’) is a state transition function – R 1 (s, a 1, a 2 ) and R 2 (s, a 1, a 2 ) are reward functions –  1,  2 are finite observation sets – O(s, o 1, o 2 ) is an observation function Straightforward generalization to n agents

POSG – More Definitions A local policy is a mapping  i :  i *  A i A joint policy is a pair  ,  2  Each agent wants to maximize its own expected reward over T steps Although execution is distributed, planning is centralized

Strategy Elimination in POSGs Could simply convert to normal form But the number of strategies is doubly exponential in the horizon length R 11 1, R 11 2 …R 1n 1, R 1n 2 ……… R m1 1, R m1 2 …R mn 1, R mn 2 … …

A Better Way to Do Elimination We use dynamic programming to eliminate dominated strategies without first converting to normal form Pruning a subtree eliminates the set of trees containing it a1a1 a1a1 a2a2 a2a2 a2a2 a3a3 a3a3 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 a3a3 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a3a3 a3a3 a2a2 a2a2 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 prune eliminate

Generalizing Dynamic Programming Build policy trees as in single agent case Pruning rule is a natural generalization Normal form gamestrategy  (strategies of other agents) POMDPpolicy tree  (states) POSGpolicy tree  (states  policy trees of other agents) What to pruneSpace for pruning

Dynamic Programming a1a1 a2a2 a1a1 a2a2

a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2

a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2

a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

Correctness of Dynamic Programming Theorem: DP performs iterated elimination of dominated strategies in the normal form of the POSG. Corollary: DP can be used to find an optimal joint policy in a cooperative POSG.

Dynamic Programming in Practice Initial empirical results show that much pruning is possible Can solve problems with small state sets And we can import ideas from the POMDP literature to scale up to larger problems Boutilier & Poole 96, Hauskrecht 00, Feng & Hansen 00, Hansen & Zhou 03, Theocharous & Kaelbling 03

Conclusion First exact DP algorithm for POSGs Natural combination of two ideas – Iterated elimination of dominated strategies – Dynamic programming for POMDPs Initial experiments on small problems, ideas for scaling to larger problems