Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Batch RL Via Least Squares Policy Iteration
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Constraint Optimization Presentation by Nathan Stender Chapter 13 of Constraint Processing by Rina Dechter 3/25/20131Constraint Optimization.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Decision Theoretic Planning
An Accelerated Gradient Method for Multi-Agent Planning in Factored MDPs Sue Ann HongGeoff Gordon CarnegieMellonUniversity.
Markov Decision Processes
Infinite Horizon Problems
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
Generalizing Plans to New Environments in Relational MDPs
Temporal Action-Graph Games: A New Representation for Dynamic Games Albert Xin Jiang University of British Columbia Kevin Leyton-Brown University of British.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Recent Development on Elimination Ordering Group 1.
Markov Decision Processes
Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Daniel Kroening and Ofer Strichman Decision Procedures An Algorithmic Point of View Deciding ILPs with Branch & Bound ILP References: ‘Integer Programming’
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
Decision Procedures An Algorithmic Point of View
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)
1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.
Planning and Execution with Phase Transitions Håkan L. S. Younes Carnegie Mellon University Follow-up paper to Younes & Simmons’ “Solving Generalized Semi-Markov.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
ANTs PI Meeting, Nov. 29, 2000W. Zhang, Washington University1 Flexible Methods for Multi-agent distributed resource Allocation by Exploiting Phase Transitions.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
CS584 - Software Multiagent Systems Lecture 12 Distributed constraint optimization II: Incomplete algorithms and recent theoretical results.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Keep the Adversary Guessing: Agent Security by Policy Randomization
István Szita & András Lőrincz
Exact Inference Continued
Decision Making under Uncertainty MURI Meeting, June 2001
Structured Models for Multi-Agent Interactions
CS 416 Artificial Intelligence
Variable Elimination Graphical Models – Carlos Guestrin
Symbolic Dynamic Programming
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University

Motivation MDPs: plan over atomic system states; Policy — specifies action at every state; Polytime algorithms for finding optimal policy. Number of states exponential in state variables.

Motivation: BNs meet MDPs Real-world MDPS have: Hundreds of variables; Googles of states. Can we exploit problem specific structure? For representation; For planning. Goal: Merge BN and MDPs for Efficient Computation.

Factored MDPs [Boutilier et al.] Total reward adding sub-rewards: R=R 1 +R 2 R2R2 Z R1R1 Y’ Z’ Y X’X Time tt+1 Actions only change small parts of model. Value function: Value of policy starting at state s.

Exploiting Structure Structured value function approach: [Boutilier et al. ‘95] Collapse value function using a tree; Works well only when many states have  same value. X Z Model structure may imply structured value function;

Decomposable Value Functions Each h i is the status of some small part(s) of a complex system: status of a machine; inventory of a store. Linear combination of restricted domain functions. [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] K basis functions 2 n states h 1 (s1) h 2 (s1)... h 1 (s2) h 2 (s2)…. A=

Our Approach Embed structure into value function space a priori: Project into structured vector space of factored value functions; Efficiently find closest approximation to “true” value. Linear Combination of Structured Features

Policy Iteration Value of acting on  Guess V  = greedy(V) V = value of acting on  (2 n x2 n ) (2 n x1) ValueRewardDiscounted expected value

Approximate Policy Iteration Guess w 0  t = greedy(A w t ) Aw t+1  value of acting on  t Approximate value determination:

Approximate Value Determination Need a projection of the value function into the space of the basis functions: (L d projection) Previous work uses L 2 and weighted-L 2 projections. [Koller & Parr ’99, ’00]

Analysis of Approx. PI Theorem: We should be doing projections in Max-norm!

Approximate PI: Revisited Guess w 0  t = greedy(A w t ) Aw t+1  value of acting on  t Approximate value determination: Analysis motivating projections in max-norm; Efficient algorithm for max-norm projection.

Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

Max over Large State Spaces For fixed weights w, compute max-norm: However, if basis and target are functions of only a few variables, we can do it efficiently! Cost Networks can maximize over large state spaces efficiently when function is factored:

Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72] Cost Networks A D BC As in Bayes nets, maximization is exponential in size of largest factor. Here we need only 16, instead of 64 sum operations.

Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

Algorithm for finding: Max-norm Projection Solve by Linear Programming : [Cheney ’82]

Representing the Constraints Explicit representation is exponential (|S|=2 n ) : If basis and target are factored, can use Cost Networks to represent the constraints:

Approximate Policy Iteration Guess w 0  t = greedy(A w t ) Aw t+1  value of acting on  t How do represent the policy? How do we update it efficiently? Policy Improvement

What about the Policy ? Contextual Action Model: Z Y’ Z’ Y X’X default Z Y’ Z’ Y X’X Action 1 Z Y’ Z’ Y X’X Action 2 Factored value functions and model  compact policy description Policy forms a decision list: If then action 1 else if then action 2 else if then action 1 Theorem: [Koller & Parr ’00]

Factored Policy Iteration: Summary Guess V  = greedy(V) V = value of acting on  Structure induces decision-list policy Key operations isomorphic to Bayesian Network inference Time per iteration reduced from O((2 n ) 3 ) to O(poly(k,n,C)) C = largest factor in cost net (function of structure) k = number of basis functions ( k << 2 n ) poly = complexity of LP solver, in practice close to linear

Network Management Problem Computers connected in a network; Each computer can fail with some probability; If a computer fails, it increases the probability its neighbors will fail; At every time step, the sys-admin must decide which computer to fix. Bidirectional Ring Ring and Star Server Star 3 Legs Ring of Rings Server

Comparing projections in L 2 to L  Max-norm projection also much more efficient: Single cost network rather than many many BN inferences; Use of very efficient LP package (CPLEX). L 2 single basis L  single basis L  pair basis L 2 pair basis

Results on Larger Problems: Running Time Runs in time O(n 3 ) not O((2 n ) 3 )

Results on Larger Problems: Error Bounds Error remains bounded

Conclusions Max-norm projection directly minimizes error bounds; Closed-form projection operation provides exponential complexity reduction; Exploit structure to reduce computation costs! Solve very large MDPs efficiently.

Future Work POMDPs (IJCAI’01 workshop paper) ; Additional structure: Factored actions; Relational representations; CSI; Multi-agent systems; Linear program solution for MDP.