POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4.

POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4

Planning using Partially Observable Markov Decision Processes: A Tutorial Presenters: Eric Hansen, Mississippi State University Daniel Bernstein, University of Massachusetts/Amherst Zhengzhu Feng, University of Massachusetts/Amherst Rong Zhou, Mississippi State University

Introduction and foundations Definition of POMDP Goals, rewards and optimality criteria Examples and applications Computational complexity Belief states and Bayesian conditioning

Planning under partial observability Environment Action Imperfect observation Goal Environment

Two Approaches to Planning under Partial Observability Nondeterministic planning Uncertainty is represented by set of possible states No possibility is considered more likely than any other Probabilistic (decision-theoretic) planning Uncertainty is represented by probability distribution over possible states In this tutorial we consider the second, more general approach

Markov models

Definition of POMDP s0s0 S1S1 S2S2 z0z0 a0a0 r0r0 z1z1 a1a1 hidden states: r1r1 z2z2 a2a2 r2r2 observations: actions: rewards:

Goals, rewards and optimality criteria Rewards are additive and time-separable, and objective is to maximize expected total reward Traditional planning goals can be encoded in reward function Example: achieving a state satisfying property P at minimal cost is encoded by making any state satisfying P a zero-reward absorbing state, and assigning all other states negative reward. POMDP allows partial satisfaction of goals and tradeoffs among competing goals Planning horizon can be finite, infinite or indefinite

Machine Maintenance X Canonical application of POMDPs in Operations Research

Robot Navigation Actions: N, S, E, W, Stop +1 –1 Start 0.8 0.1 Canonical application of POMDPs in AI Toy example from Russell & Norvig’s AI textbook Observations: sense surrounding walls

Many other applications Helicopter control [Bagnell & Schneider 2001] Dialogue management [Roy, Pineau & Thrun 2000] Preference elicitation [Boutilier 2002] Optimal search and sensor scheduling [Krishnamurthy & Singh 2000] Medical diagnosis and treatment [Hauskrecht & Fraser 2000] Packet scheduling in computer networks [Chang et al. 2000; Bent & Van Hentenryck 2004]

Computational complexity Finite-horizon PSPACE-hard [Papadimitriou & Tsitsiklis 1987] NP-complete if unobservable Infinite-horizon Undecidable [Madani, Hanks & Condon 1999] NP-hard for  -approximation [Lusena, Goldsmith & Mundhenk 2001] NP-hard for memoryless or bounded-memory control problem [Littman 1994; Meuleau et al. 1999]

Planning for fully observable MDPs Dynamic programming Value iteration [Bellman 1957] Policy iteration [Howard 1960] Scaling up State aggregation and factored representation [Dearden & Boutilier 1997] Hierarchical task decomposition [Dietterich 2000] Heuristic search [Barto et al 1995; Hansen & Zilberstein 2001] Sparse sampling [Kearns et al 2003]

Value iteration Initial value function -optimal value function DP update improves value function -optimal policy Finds exact solutions for finite-horizon problems Finds  -optimal solutions for infinite-horizon problems

Value-Iteration (Recap) DP update – a step in value-iteration MDP S – finite set of states in the world A – finite set of actions T: SxA -> Π(S)(e.g. T(s,a,s’) = 0.2) R: SxA -> R (e.g. R(s,a) = 10) Algm

POMDP tuple S, A, T, R of MDP Ω – finite set of observations O:SxA-> Π(Ω) Belief state - information state – b, probability distribution over S - b(s1)

POMDP Goal is to maximize expected long-term reward from the initial state distribution State is not directly observed world a o

Two sources of POMDP complexity Curse of dimensionality size of state space shared by other planning problems Curse of memory size of value function (number of vectors) or equivalently, size of controller (memory) unique to POMDPs Complexity of each iteration of DP: dimensionalitymemory

Two representations of policy Policy maps history to action Since history grows exponentially with horizon, it needs to be summarized, especially in infinite-horizon case Two ways to summarize history belief state finite-state automaton – partitions history into finite number of “states”

Belief simplex S1S1 S0S0 S2S2 0 S0S0 S1S1 0 3 states 2 states (1, 0) (0, 1) (0, 0, 1) (0, 1, 0) (1, 0, 0)

Belief state has Markov property The process of maintaining the belief state is Markovian For any belief state, the successor belief state depends only on the action and observation z1z1 z2z2 z2z2 z1z1 a2a2 a1a1 P(s 0 ) = 0P(s 0 ) = 1

Belief-state MDP State space: the belief simplex Actions: same as before State transition function: P(b’|b,a) =  e  E P(b’|b,a,e)P(e|b,a) Reward function: r(b,a) =  s  S b(s)r(s,a) Bellman optimality equation: Should be Integration…

Belief-state controller P(b|b,a,e) Current Belief State (Register) Policy  Obs. e b b a Action Update belief state after action and observation Policy maps belief state to action Policy is found by solving the belief-state MDP “State Estimation”

POMDP as MDP in Belief Space

POMDP - SE SE – State Estimator updates belief state based on previous belief state last action, current observation SE(b,a,o) = b’

Dynamic Programming for POMDPs We’ll start with some important concepts: a1a1 s2s2 s1s1 policy treelinear value functionbelief state s1s1 0.25 s2s2 0.40 s3s3 0.35 a2a2 a3a3 a3a3 a2a2 a1a1 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2

Dynamic Programming for POMDPs a1a1 a2a2 s1s1 s2s2

s1s1 s2s2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

s1s1 s2s2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

s1s1 s2s2

POMDP Value Iteration: Basic Idea [Finitie Horizon Case]

First Problem Solved Key insight: value function piecewise linear & convex (PWLC) Convexity makes intuitive sense Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward Each line (hyperplane) represented with vector Coefficients of line (hyperplane) e.g. V(b) = c 1 x b(s1) + c 2 x (1-b(s1)) To find value function at b, find vector with largest dot pdt with b

Two states: 0 and 1 R(0)=0 ; R(1) = 1 [stay]  0.9 stay; 0.1 go [go]  0.9 go; 0.1 stay Sensor reports correct state with 0.6 prob Discount facto=1 POMDP Value Iteration: Phase 1: One action plans

POMDP Value Iteration: Phase 2: Two action (conditional) plans stay 0 1

Point-based Value Iteration: Approximating with Exemplar Belief States

Solving infinite-horizon POMDPs Value iteration: iteration of dynamic programming operator computes value function that is arbitrarily close to optimal Optimal value function is not necessarily piecewise linear, since optimal control may require infinite memory But in many cases, as Sondik (1978) and Kaelbling et al (1998) noticed, value iteration converges to a finite set of vectors. In these cases, an optimal policy is equivalent to a finite-state controller.

Policy evaluation q1q1 q2q2 o1o1 o2o2 q2q2 s1q1s1q1 s2q1s2q1 s1q2s1q2 s2q2s2q2 o2o2 o2o2 o2o2 o2o2 o1o1 o1o1 o1o1 o1o1 As in the fully observable case, policy evaluation involves solving a system of linear equations. There is one unknown (and one equation) for each pair of system state and controller node

Policy improvement 0a00a0 1a11a1 3a13a1 2a02a0 4a04a0 z0z0 z 0,z 1 z1z1 0a00a0 1a11a1 z0z0 z1z1 z0z0 z1z1 0a00a0 4a04a0 3a13a1 z0z0 z1z1 01 V(b) 01 01 0,2 4 3 1 1 0 0 3 4

Online Action Selection for POMDPs

POMDP Look Ahead Tree RTDP (for MDPs) Uncertainty is in observations, not action outcomes

Per-Iteration Complexity of POMDP value iteration.. Number of  vectors needed at t th iteration Time for computing each  vector

Approximating POMDP value function with bounds It is possible to get approximate value functions for POMDP in two ways Over constrain it to be a NOMDP: You get Blind Value function which ignores the observation A “conformant” policy For infinite horizon, it will be same action always! (only |A| policies) Relax it to be a FOMDP: You assume that the state is fully observable. A “state-based” policy Under-estimates value (over-estimates cost) Over-estimates value (under-estimates cost) Per iteration

Upper bounds for leaf nodes can come from FOMDP VI and lower bounds from NOMDP VI Observations are written as o or z

Comparing POMDPs with Non- deterministic conditional planning POMDPNon-Deterministic Case

RTDP-Bel doesn’t do look ahead, and also stores the current estimate of value function (see update)

---SLIDES BEYOND THIS NOT COVERED--

POMDP - SE

POMDP - Π Focus on Π component POMDP-> “Belief MDP” MDP parameters: S => B, set of belief states A => same T => τ(b,a,b’) R => ρ(b, a) Solve with value-iteration algm

POMDP - Π τ(b,a,b’) ρ(b, a)

Two Problems How to represent value function over continuous belief space? How to update value function V t with V t-1 ? POMDP -> MDP S => B, set of belief states A => same T => τ(b,a,b’) R => ρ(b, a)

Running Example POMDP with Two states (s1 and s2) Two actions (a1 and a2) Three observations (z1, z2, z3) 1D belief space for a 2 state POMDP Probability that state is s1

Second Problem Can’t iterate over all belief states (infinite) for value- iteration but… Given vectors representing V t-1, generate vectors representing V t

Horizon 1 No future Value function consists only of immediate reward e.g. R(s1, a1) = 0, R(s2, a1) = 1.5, R(s1, a2) = 1, R(s2, a2) = 0 b = Value of doing a1 = 1 x b(s1) + 0 x b(s2) = 1 x 0.25 + 0 x 0.75 Value of doing a2 = 0 x b(s1) + 1.5 x b(s2) = 0 x 0.25 + 1.5 x 0.75

Second Problem Break problem down into 3 steps -Compute value of belief state given action and observation -Compute value of belief state given action -Compute value of belief state

Horizon 2 – Given action & obs If in belief state b,what is the best value of doing action a1 and seeing z1? Best value = best value of immediate action + best value of next action Best value of immediate action = horizon 1 value function

Horizon 2 – Given action & obs Assume best immediate action is a1 and obs is z1 What’s the best action for b’ that results from initial b when perform a1 and observe z1? Not feasible – do this for all belief states (infinite)

Horizon 2 – Given action & obs Construct function over entire (initial) belief space from horizon 1 value function with belief transformation built in

Horizon 2 – Given action & obs S(a1, z1) corresponds to paper’s S() built in: - horizon 1 value function - belief transformation - “Weight” of seeing z after performing a - Discount factor - Immediate Reward S() PWLC

Horizon 2 – Given action What is the horizon 2 value of a belief state given immediate action is a1? Horizon 2, do action a1 Horizon 1, do action…?

Horizon 2 – Given action What’s the best strategy at b? How to compute line (vector) representing best strategy at b? (easy) How many strategies are there in figure? What’s the max number of strategies (after taking immediate action a1)?

Horizon 2 – Given action How can we represent the 4 regions (strategies) as a value function? Note: each region is a strategy

Horizon 2 – Given action Sum up vectors representing region Sum of vectors = vectors (add lines, get lines) Correspond to paper’s transformation

Horizon 2 – Given action What does each region represent? Why is this step hard (alluded to in paper)?

Horizon 2 a1 a2 U

Horizon 2 This tells you how to act! =>

Second Problem Break problem down into 3 steps -Compute value of belief state given action and observation -Compute value of belief state given action -Compute value of belief state Use horizon 2 value function to update horizon 3’s...

The Hard Step Easy to visually inspect to obtain different regions But in higher dimensional space, with many actions and observations….hard problem

Naïve way - Enumerate How does Incremental Pruning do it?

Incremental Pruning How does IP improve naïve method? Will IP ever do worse than naïve method? Combinations Purge/ Filter

Incremental Pruning What’s other novel idea(s) in IP? RR: Come up with smaller set D as argument to Dominate() RR has more linear pgms but less contraints in the worse case. Empirically ↓ constraints saves more time than ↑ linear programs require

Incremental Pruning What’s other novel idea(s) in IP? RR: Come up with smaller set D as argument to Dominate() Why are the terms after U needed?

Identifying Witness Witness Thm: -Let Ua be a set of vectors representing value function -Let u be in Ua (e.g. u = α z1,a2 + α z2,a1 + α z3,a1 ) -If there is a vector v which differs from u in one observation (e.g. v = α z1,a1 + α z2,a1 + α z3,a1 ) and there is a b such that b.v > b.u, -then Ua is not equal to the true value function

Witness Algm Randomly choose a belief state b Compute vector representing best value at b (easy) Add vector to agenda While agenda is not empty Get vector V top from top of agenda b’ = Dominate(V top, Ua) If b’ is not null (there is a witness), compute vector u for best value at b’ and add it to Ua compute all vectors v’s that differ from u at one observation and add them to agenda b’b’’ b’ b’’ b

Linear Support If value function is incorrect, biggest diff is at edges (convexity)

Linear Support

Policy Iteration Only for infinite-horizon problems Takes many fewer iterations to converge than value iteration Improve policy (DP udpate) Evaluate policy Initial policy -optimal policy

Belief state Under partially observability, the entire history of the process may be relevant for decision making But a vector of state probabilities updated by Bayesian conditioning, called a belief state, contains all relevant information from history Equation for updating belief state b after action a and observation z, to create belief state b’ b’(s’)=  s  S P(s’|s,a)P(z|s’,a)b(s) / P(z|b,a) 

Basic Planning Algorithms for POMDPs Grid-based approximation Value iteration Policy iteration

Crude but widely-used approximation Solve fully-observable MDP Choose action based on belief state and value function of fully-observable MDP: argmax a  s  S b(s) [r(s,a) +  s’ P(s’|s,a) V(s’)] Advantage: can solve very large problems Drawback: assumes perfect information after each action, so can’t handle information- gathering

Grid-based approximation Oldest approach to solving POMDPs Continuous state space of belief-state MDP approximated by finite set of grid points Evaluate non-grid points using interpolation “Crude approximation” of previous slide is simplest kind of grid Grid-based approximation is used for other continuous-space problems

Convexity of optimal value function V(b) b P(s 0 ) = 1 P(s 0 ) = 0

Interpolation and convex combination Interpolation requires finding subset of grid points that is a “convex combination“ Among many possible convex combinations, problem is to find a good one quickly

Fixed-resolution regular grid [Lovejoy 1991] M = 1 M = 2 M = 4

Non-regular grid [Hauskrecht 1997; Brafman 1997]

Variable-resolution regular grid [Zhou & Hansen 2001] Combines advantages of previous methods: Regular grid: Efficient interpolation based on (generalized) triangulation Variable resolution: Avoids explosion in grid size. Grid is only refined where needed.

Grid refinement in variable-resolution regular grid Refining point M = 2 sub-simplex M = 4 sub-simplex max.error successor belief state

Interpolation in variable-resolution regular grid Virtual point Smallest complete sub-simplex Virtual sub-simplex

Time to solve grid-MDP as function of grid size

Larger grids allow better approximations

Backup: Grid-based vs. vector-based a0a0 a1a1 z0z0 z0z0 z1z1 z1z1 Current belief state Successor belief states Grid-based: Compute backed-up value of current belief state Evaluate non-grid successor belief states using interpolation Vector-based: Compute vector of backed-up values, one for each underlying state

Piecewise linear and convex value function [Smallwood & Sondik 1973] V(b) b P(s 0 ) = 1 P(s 0 ) = 0 00 11 22

Value of policy tree a0a0 a1a1 a0a0 a0a0 a1a1 a0a0 a1a1 z0z0 z0z0 z0z0 z1z1 z1z1 z1z1 Policy tree for horizon 2 Linear value function 5.0 2.0 0 1 0.4 V(0.4) = 0.6 * 5.0 + 0.4 * 2.0 = 3 + 0.8 = 3.8

Number of of policy trees |A| |Z| T-1 at horizon T Example for |A| = 4 and |Z| = 2 Horizon# of policy trees 01 14 264 316,384 41,073,741,824

Pointwise dominance Pruning a dominated vector has no effect on the value of any belief state V(b) pointwise dominated vector b P(s 0 ) = 1 P(s 0 ) = 0

Linear program test for dominance Variables: Maximize: Constraints: Prune vector  if d ≤ 0.

Linear program test for dominance P(s 0 ) = 1 P(s 0 ) = 0 V(b) dominated vector b This test can identify any dominated vector, including vectors not pointwise dominated

Dynamic programming operator Given a set of vectors  t-1 representing value function V t-1, a set of vectors  representing value function V t is computed as follows: Generate all |A||  | |Z| stage-t vectors Prune dominated vectors This very simple algorithm is attributed to Monahan More efficient algorithms avoid the expensive step of generating all possible vectors

Incremental pruning Each  a,z,k is a |S|-vector associated with a particular action a, observation z, and successor vector  k. A value vector can be decomposed into | Z | components, one for each observation.

Policy graph 2 states z0z0 z0z0 z1z1 z1z1 z 0, z 1 00 11 22 00 11 22 z0z0 z1z1 z0z0 z1z1 V(b)

Policy iteration for POMDPs Sondik’s (1978) algorithm represents policy as a mapping from belief states to actions only works under special assumptions very difficult to implement never used Hansen’s (1998) algorithm represents policy as a finite-state controller fully general easy to implement faster than value iteration

Properties of policy iteration Theoretical Monotonically improves finite-state controller Converges to  -optimal finite-state controller after finite number of iterations Empirical Runs from 10 to over 100 times faster than value iteration

Improved performance

Stochastic finite-state controllers Further generalization of policy search approach Finite-state controller is stochastic when Action choice at each node is stochastic (based on probability distribution for that node) Transition to next node is stochastic (based on probability distribution for that transition) Because a stochastic policy is a continuous function of its parameters, gradient ascent can be used to improve it [Merleau et al 1999; Baxter & Bartlett 2000]. Policy iteration can also be generalized to find stochastic finite-state controllers [Poupart & Boutilier 2004]

Scaling up State abstraction and factored representation Belief compression Forward search and sampling approaches Hierarchical task decomposition

State abstraction and factored representation of POMDP DP algorithms are typically state-based Most AI representations are “feature-based” |S| is typically exponential in the number of features (or variables) – the “curse of dimensionality” State-based representations for problems with more than a few variables are impractical Factored representations exploit regularities in transition and observation probabilities, and reward

Example: Part-painting problem [Draper, Hanks, Weld 1994] Boolean state variables flawed (FL), blemished (BL), painted (PA), processed (PR), notified (NO) actions Inspect, Paint, Ship, Reject, Notify cost function Cost of 1 for each action Cost of 1 for shipping unflawed part that is not painted Cost of 10 for shipping flawed part or rejecting unflawed part initial belief state Pr(FL) = 0.3, Pr(BL|FL) = 1.0, Pr(BL|  FL) = 0.0, Pr(PA) = 0.0, Pr(PR) = 0.0, Pr(NO) = 0.0

Factored representation of MDP [Boutilier et al. 1995; Hoey, St. Aubin, Hu, & Boutilier 1999] Dynamic Bayesian network captures variable independence Algebraic decision diagram captures value independence FL SL-FL NO PA SH RE FL’ SL-FL’ NO’ PA’ SH’ RE’ FL FL’ T 1.0 F 0.0 PA SH RE NO PA’ T T/F T/F T/F 1.0 F F F F 0.95 F T T/F T/F 0.0 F T/F T T/F 0.0 F T/F T/F T 0.0 FL’ FL 1.0 0.0 PA’ PA SH RE NO 0.95 0.0 1.0 Dynamic Belief Network Decision Diagrams Probability Tables

Decision diagrams X Y Z TRUEFALSE X Y Z 5.83.6 Y Z 18.69.5 Binary decision Diagram (BDD) Algebraic decision diagram (ADD)

Addition (subtraction), multiplication (division), minimum (maximum), marginalization, expected value Complexity of operators depends on size of decision diagrams, not number of states! Operations on decision diagrams = + X Y Z 11.012.0 Y Z 22.0 23.0 X Z 10.0 20.0 Y 30.0 X Y Z 1.02.03.0 33.0

Symbolic dynamic programming for factored POMDPs [Hansen & Feng 2000] Factored representation of value function: replace |S|-vectors with ADDs that only make relevant state distinctions Two steps of DP algorithm Generate new ADDs for value function Prune dominated ADDs State abstraction is based on aggregating states with the same value

 i  t obs1 obs2 transition probabilities observation probabilities obs3 action reward  k  t+1  Generation step: Symbolic implementation

Pruning step: Symbolic implementation pruning is the most computationally expensive part of algorithm must solve a linear program for each (potential) ADD in value function because state abstraction reduces the dimensionality of linear programs, it significantly improves efficiency

Improved performance Number abstract states Number primitive states Degree of abstraction =

Optimal plan (controller) for part-painting problem Inspect Reject Notify PaintShip ~OK OK ~OK PR   NO PR  NO  PR FL  FL  FL  PA FL  FL   PA  FL FL  FL   BL  FL  BL FL   BL  FL   BL

Approximate state aggregation  = 0.4 Simplify each ADD in value function by merging leaves that differ in value by less than .

Approximate pruning Prune vectors from value function that add less than  to value of any belief state 11 22 33 44 (0,1)(1,0) 

Error bound These two methods of approximation share the same error bound “Weak convergence,” i.e., convergence to within 2  /(1-  ) of optimal (where  is discount factor) After “weak convergence,” decreasing  allows further improvement Starting with relatively high  and gradually decreasing it accelerates convergence

Strategy: ignore differences of value less than some threshold  Complementary methods Approximate state aggregation Approximate pruning …address two sources of complexity size of state space size of value function (memory) Approximate dynamic programming

Belief compression Reduce dimensionality of belief space by approximating the belief state Examples of approximate belief states tuple of mostly-likely state plus entropy of belief state [Roy & Thrun 1999] belief features learned by exponential family Principal Components Analysis [Roy & Gordon 2003] standard POMDP algorithms can be applied in the lower-dimensional belief space, e.g., grid-based approximation

Forward search a0a0 a1a1 z0z0 a0a0 a0a0 a0a0 a0a0 z0z0 z1z1 z1z1 a1a1 a1a1 a1a1 a1a1 z0z0 z0z0 z1z1 z0z0 z0z0 z0z0 z0z0 z0z0 z0z0 z1z1 z1z1 z1z1 z1z1 z1z1 z1z1 z1z1

Sparse sampling Forward search can be combined with Monte Carlo sampling of possible observations and action outcomes [Kearns et al 2000; Ng & Jordan 2000] Remarkably, complexity is independent of size of state space!!! On-line planner selects  -optimal action for current belief state

State-space decomposition For some POMDPs, each action/observation pair identifies a specific region of the state space

Motivating Example Continued A “deterministic observation” reveals that world is in one of a small number of possible states Same for “hybrid POMDPs”, which are POMDPs with some fully observable and some partially observable state variables

Region-based dynamic programming Tetrahedron and surfaces

Hierarchical task decomposition We have considered abstraction in state space Now we consider abstraction in action space For fully observable MDPs: Options [Sutton, Precup & Singh 1999] HAMs [Parr & Russell 1997] Region-based decomposition [Hauskrecht et al 1998] MAXQ [Dietterich 2000] Hierarchical approach may cause sub-optimality, but limited forms of optimality can be guaranteed Hierarchical optimality (Parr and Russell) Recursive optimality (Dietterich)

Hierarchical approach to POMDPs Theocharous & Mahadevan (2002) based on hierarchical hidden Markov model approximation ~1000 state robot hallway-navigation problem Pineau et al (2003) based on Dietterich’s MAXQ decomposition approximation ~1000 state robot navigation and dialogue Hansen & Zhou (2003) also based on Dietterich’s MAXQ decomposition convergence guarantees and epsilon-optimality

Macro action as finite-state controller Allows exact modeling of macro’s effects macro state transition probabilities macro rewards West Stop NorthEast South goal clear wall clear wall

Taxi example [Dietterich 2000]

Task hierarchy [Dietterich 2000] Taxi Get Put Navigate Pickup Putdown North South East West

Hierarchical finite-state controller GetPut Nav. Pickup Nav. Putdown East North South East West Stop

MAXQ-hierarchical policy iteration Create initial sub-controller for each sub-POMDP in hierarchy Repeat until error bound is less than  Identify subtask that contributes most to overall error Use policy iteration to improve the corresponding controller For each node of controller, create abstract action (for parent task) and compute its model Propagate error up through hierarchy

Modular structure of controller

Complexity reduction Per-iteration complexity of policy iteration |A| |Q| |Z|, where A is set of actions Z is set of observations Q is set of controller nodes Per-iteration complexity of hierarchical PI |A|  i |Q i | |Z|, where |Q| =  i |Q i | With hierarchical decomposition, complexity is sum of the complexity of the subproblems, instead of product

Scalability MAXQ-hierarchical policy iteration can solve any POMDP, if it can decompose it into sub- POMDPs that can be solved by policy iteration Although each sub-controller is limited in size, the hierarchical controller is not limited in size Although the (abstract) state space of each subtask is limited in size, the total state space is not limited in size

Multi-Agent Planning with POMDPs Partially observable stochastic games Generalized dynamic programming

Multi-Agent Planning with POMDPs Many planning problems involve multiple agents acting in a partially observable environment The POMDP framework can be extended to address this world a1a1 z 1, r 1 z 2, r 2 a2a2 1 2

Partially observable stochastic game (POSG) A POSG is  S, A 1, A 2,  1  2,, P, r 1, r 2 , where S is a finite state set, with initial state s 0 A 1, A 2 are finite action sets Z 1, Z 2 are finite observation sets P(s’|s, a 1, a 2 ) is state transition function P(z 1, z 2 | s, a 1, a 2 ) is observation function r 1 (s, a 1, a 2 ) and r 2 (s, a 1, a 2 ) are reward functions Special cases: All agents share the same reward function Zero-sum games

Plans and policies A local policy is a mapping  i :  i *  A i A joint policy is a pair  ,  2  Each agent wants to maximize its own long-term expected reward Although execution is distributed, planning can be centralized

Beliefs in POSGs With a single agent, a belief is a distribution over states How does this generalize to multiple agents? Could have beliefs over beliefs over beliefs, but there is no algorithm for working with these

Example States: grid cell pairs Actions: , , ,  Transitions: noisy Goal: pick up balls Observations: red lines

Another Example States: who has a message to send? Actions: send or don’t send Reward: +1 for successful broadcast 0if collision or channel not used Observations: was there a collision? (noisy) msg

Strategy Elimination in POSGs Could simply convert to normal form But the number of strategies is doubly exponential in the horizon length R 11 1, R 11 2 …R 1n 1, R 1n 2 ……… R m1 1, R m1 2 …R mn 1, R mn 2 … …

Generalized dynamic programming Initialize 1 -step policy trees to be actions Repeat: Evaluate all pairs of t -step trees from current sets Iteratively prune dominated policy trees Form exhaustive sets of t+1 -step trees from remaining t -step trees

What Generalized DP Does The algorithm performs iterated elimination of dominated strategies in the normal form game without first writing it down For cooperative POSGs, the final sets contain the optimal joint policy

Some Implementation Issues As before, pruning can be done using linear programming Algorithm keeps value function and policy trees in memory (unlike POMDP case) Currently no way to prune in an incremental fashion

A Better Way to Do Elimination We use dynamic programming to eliminate dominated strategies without first converting to normal form Pruning a subtree eliminates the set of trees containing it a1a1 a1a1 a2a2 a2a2 a2a2 a3a3 a3a3 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 a3a3 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a3a3 a3a3 a2a2 a2a2 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 prune eliminate

Dynamic Programming Build policy tree sets simultaneously Prune using a generalized belief space s1s1 s2s2  agent 2 state space  agent 1 state space a3a3 a1a1 a1a1 o1o1 o2o2 a2a2 a3a3 a1a1 o1o1 o2o2 p1p1 p2p2 a2a2 a2a2 a2a2 o1o1 o2o2 a3a3 a1a1 a2a2 o1o1 o2o2 q1q1 q2q2

Dynamic Programming a1a1 a2a2 a1a1 a2a2

a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2

a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a1a1 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2

a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a1a1 a2a2 a2a2 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a2a2 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2 a1a1 a1a1 a1a1 o1o1 o2o2 a1a1 a1a1 a2a2 o1o1 o2o2 a1a1 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a2a2 o1o1 o2o2

Normal Form Representation Normal form game Normal form representation of a POSG is huge (doubly exponential), so we cannot work with it directly 3,30,4 4,01,1 a1a1 a2a2 b1b1 b2b2

Complexity of POSGs The cooperative finite-horizon case is NEXP-hard, even with two agents whose observations completely determine the state [Bernstein et al. 2002] Implications: The problem is provably intractable (because P  NEXP) It probably requires doubly exponential time to solve in the worst case

Generalized beliefs Can generalize belief states to include uncertainty about the other agents’ future policies [Hansen et al. 2004] s1s1 s2s2  agent 2 state space  agent 1 state space a3a3 a1a1 a1a1 o1o1 o2o2 a2a2 a3a3 a1a1 o1o1 o2o2 p1p1 p2p2 a2a2 a2a2 a2a2 o1o1 o2o2 a3a3 a1a1 a2a2 o1o1 o2o2 q1q1 q2q2

What Generalized DP Does When a subtree is pruned, trees containing it are eliminated from the normal form game These policies are dominated a1a1 a1a1 a2a2 a2a2 a2a2 a3a3 a3a3 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 a3a3 a2a2 a1a1 o1o1 o2o2 a2a2 a2a2 a3a3 a3a3 a2a2 a2a2 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 prune eliminate

Ongoing work Infinite-horizon extension using finite-state controllers to represent policies Can get convergence to optimality in the cooperative case Dealing with computational complexity Could be doubly exponential in the horizon However, approximation techniques from the POMDP literature apply

Wrap up Summary and review of key ideas Discussion of future research directions and open problems

POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4.

Similar presentations

Presentation on theme: "POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4.

Similar presentations

Presentation on theme: "POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4."— Presentation transcript:

Similar presentations

About project

Feedback