Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Slides:

Advertisements

Similar presentations

Shortest Vector In A Lattice is NP-Hard to approximate

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Batch RL Via Least Squares Policy Iteration

Partially Observable Markov Decision Process (POMDP)

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Dynamic Bayesian Networks (DBNs)

Optimal Policies for POMDP Presented by Alp Sardağ.

CS594 Automated decision making University of Illinois, Chicago

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor,

Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

Infinite Horizon Problems

Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.

1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.

POMDPs: Partially Observable Markov Decision Processes Advanced AI

Generalizing Plans to New Environments in Relational MDPs

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

1 Distributed localization of networked cameras Stanislav Funiak Carlos Guestrin Carnegie Mellon University Mark Paskin Stanford University Rahul Sukthankar.

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.

Markov Decision Processes

Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.

A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

MAKING COMPLEX DEClSlONS

1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.

Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Solving POMDPs through Macro Decomposition

Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.

Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)

Decision Making Under Uncertainty Lec #10: Partially Observable MDPs UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.

Comparison Value vs Policy iteration

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

Yifeng Zeng Aalborg University Denmark

Keep the Adversary Guessing: Agent Security by Policy Randomization

POMDPs Logistics Outline No class Wed

István Szita & András Lőrincz

Structured Models for Multi-Agent Interactions

Approximate POMDP planning: Overcoming the curse of history!

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Symbolic Dynamic Programming

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University

Policy Iteration for POMDPs [Hansen ‘98] 1:a 1 2:a 2 O1O1 O1O1 O2O2 O2O2 1 2 V b 1 2 V b 3 Value Determination DP Step Policy Improvement

Policy Iteration for POMDPs [Hansen ‘98] 1 2 V b 1 2 V b 3 1:a 1 3:a 1 2:a 2 O1O1 O1O1 O1O1 O2O2 O2O2 O2O2 Value Determination DP Step Policy Improvement

POMDP Complexity Number of vectors can grow exponentially: Avoid generating unneeded facets: Witness, IP, etc; Approximate by discarding similar vectors, etc. Each vector has a large representation: One dimension for each state; 2 n dimensions for n state variables; Can try structured representations of the vectors. [Boutilier & Poole ’96] [Hansen & Feng ’00] POMDPs have multiple sources of complexity:

Factored POMDPs Total reward adding sub-rewards: R=R 1 +R 2 R2R2 Z R1R1 Y’Z’YX’X Time tt+1 Subset of variables are observed OZ’ AZAZ OX’ AXAX Actions only change small parts of model

Exploiting Structure Structured vectors approach: [Boutilier & Poole ’96], [Hansen & Feng ’00] Within a vector, many dimensions may be equivalent; Collapse using a tree; Works well if DBN structure leads to clean decomposition; Doesn’t always hold up, even in MDPs. 1 2 V b=P(XYZ) X Z Structure in model might imply structure in vectors;

Our Approach Not all structured POMDPs have structured vectors; Embed structure into value function space a priori: Project  -vectors into structured vector space; Efficiently find closest approximation to “true”  -vectors. Linear Combination of Structured Features

Exploiting Structure in PI and Incremental Pruning 1:a 1 2:a 2 O1O1 O1O1 O2O2 O2O2 1 2 V b 1 2 V b 3 Value Determination DP Step Policy Improvement Best Pointwise Dominates Best

Factored Best Want to find vector with highest value for given belief state: Factorization decomposes dot-product: 1 2 V b 3 b

Factored Best Example Assume 4 state variables, 3 basis functions: Decomposition of dot product: Summands depend only on marginal probabilities

Factored Best Properties Avoids exponential blowup in belief state representation; Exponential in size of basis function domains; Suggests a belief state decomposition; Factored Best only requires marginals; Useful at execution time; Monitoring belief state: Can represent belief state as product of marginals; [Boyen & Koller ’98] Analyze policy loss from belief state approximation. [McAllester & Singh ’99] [Poupart & Boutilier ’01]

Exploiting Structure in PI and Incremental Pruning 1:a 1 2:a 2 O1O1 O1O1 O2O2 O2O2 1 2 V b 1 2 V b 3 Value Determination DP Step Policy Improvement Best Pointwise Dominates Pointwise Dominates

Pointwise Domination 1 2 V b 3 Does  2 dominate  4 pointwise ? 4 Minimum  0 Factored value functions: Minimization over exponential state space! Minimization over factored function efficient with cost networks. [Bertele and Brioschi ‘72], [Dechter ‘99]

Exploiting Structure in PI and Incremental Pruning 1:a 1 2:a 2 O1O1 O1O1 O2O2 O2O2 1 2 V b 1 2 V b 3 Value Determination DP Step Policy Improvement Best Pointwise Dominates Value Determination

Value Determination 1 2 V b 3 1:a 1 3:a 1 2:a 2 O1O1 O1O1 O1O1 O2O2 O2O2 O2O2 Value of policy, starting from 1 Expected Future RewardValue Observed O 1 Observed O 2

Approximate Value Determination Exact value determination exponential number of equations; Factored approximation efficient: Find best approximation in max-norm; Algorithm exploits factored model; Analogous to factored MDP case (see Max-norm Projection IJCAI talk on Thursday).

Exploiting Structure: Summary 1:a 1 2:a 2 O1O1 O1O1 O2O2 O2O2 1 2 V b 1 2 V b 3 Value Determination DP Step Policy Improvement

Conclusions Factored POMDPs can represent complex systems; Factorization in model doesn’t always imply factorization in solution: Linear approximation reduces dimensionality of problem; Can efficiently find closest linear approximation; Can modify standard POMDP algorithms to use factored linear value functions efficiently; Complexity function of DBN and basis structure.

Our Approach V b(s1) b(s2) One dimension for each state V h 1 (s) h 2 (s) Projection One dimension for each feature (<< #states)