Download presentation
Presentation is loading. Please wait.
Published bySabina Zoe Lynch Modified over 9 years ago
1
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher Amato, Eric A. Hansen, Shlomo Zilberstein June 23, 2004
2
Extending the MDP Framework The MDP framework can be extended to incorporate partial observability and multiple agents Can we still do dynamic programming? – Lots of work on the single-agent case (POMDP) Sondik 78, Cassandra et al. 97, Hansen 98 – Some work on the multi-agent case, but limited theoretical guarantees Varaiya & Walrand 78, Nair et al. 03
3
Our contribution We extend DP to the multi-agent case For cooperative agents (DEC-POMDP): – First optimal DP algorithm For noncooperative agents: – First DP algorithm for iterated elimination of dominated strategies Unifies ideas from game theory and partially observable MDPs
4
Game Theory Normal form game Only one decision to make – no dynamics A mixed strategy is a distribution over strategies 3,30,4 4,01,1 a1a1 a2a2 b1b1 b2b2
5
Solving Games One approach to solving games is iterated elimination of dominated strategies Roughly speaking, this removes all unreasonable strategies Unfortunately, can’t always prune down to a single strategy per player
6
Dominance A strategy is dominated if for every joint distribution over strategies for the other players, there is another strategy that is at least as good Dominance test looks like this: Can be done using linear programming a1a1 a2a2 b1b1 b2b2 a3a3 dominated
7
Dynamic Programming for POMDPs We’ll start with some important concepts: a1a1 s2s2 s1s1 policy treelinear value functionbelief state s1s1 0.25 s2s2 0.40 s3s3 0.35 a2a2 a3a3 a3a3 a2a2 a1a1 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2
8
Dynamic Programming a1a1 a2a2 s1s1 s2s2
9
s1s1 s2s2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2
10
s1s1 s2s2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2
11
s1s1 s2s2
12
Properties of Dynamic Programming After T steps, the best policy tree for s 0 is contained in the set The pruning test is exactly the same as in elimination of dominated strategies in normal form games
13
Partially Observable Stochastic Game Multiple agents control a Markov process Each can have a different observation and reward function world a1a1 o 1, r 1 o 2, r 2 a2a2 1 2
14
POSG – Formal Definition A POSG is S, A 1, A 2, P, R, 1 2, O , where – S is a finite state set, with initial state s 0 – A 1, A 2 are finite action sets – P(s, a 1, a 2, s’) is a state transition function – R 1 (s, a 1, a 2 ) and R 2 (s, a 1, a 2 ) are reward functions – 1, 2 are finite observation sets – O(s, o 1, o 2 ) is an observation function Straightforward generalization to n agents
15
POSG – More Definitions A local policy is a mapping i : i * A i A joint policy is a pair , 2 Each agent wants to maximize its own expected reward over T steps Although execution is distributed, planning is centralized
16
Strategy Elimination in POSGs Could simply convert to normal form But the number of strategies is doubly exponential in the horizon length R 11 1, R 11 2 …R 1n 1, R 1n 2 ……… R m1 1, R m1 2 …R mn 1, R mn 2 … …
17
A Better Way to Do Elimination We use dynamic programming to eliminate dominated strategies without first converting to normal form Pruning a subtree eliminates the set of trees containing it a1a1 a1a1 a2a2 a2a2 a2a2 a3a3 a3a3 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 a3a3 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a3a3 a3a3 a2a2 a2a2 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 prune eliminate
18
Generalizing Dynamic Programming Build policy trees as in single agent case Pruning rule is a natural generalization Normal form gamestrategy (strategies of other agents) POMDPpolicy tree (states) POSGpolicy tree (states policy trees of other agents) What to pruneSpace for pruning
19
Dynamic Programming a1a1 a2a2 a1a1 a2a2
20
a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2
21
a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2
22
a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2
23
a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2
24
a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2
26
Correctness of Dynamic Programming Theorem: DP performs iterated elimination of dominated strategies in the normal form of the POSG. Corollary: DP can be used to find an optimal joint policy in a cooperative POSG.
27
Dynamic Programming in Practice Initial empirical results show that much pruning is possible Can solve problems with small state sets And we can import ideas from the POMDP literature to scale up to larger problems Boutilier & Poole 96, Hauskrecht 00, Feng & Hansen 00, Hansen & Zhou 03, Theocharous & Kaelbling 03
28
Conclusion First exact DP algorithm for POSGs Natural combination of two ideas – Iterated elimination of dominated strategies – Dynamic programming for POMDPs Initial experiments on small problems, ideas for scaling to larger problems
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.