1 Black Box and Generalized Algorithms for Planning in Uncertain Domains Thesis Proposal, Dept. of Computer Science, Carnegie Mellon University H. Brendan.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Continuation Methods for Structured Games Ben Blum Christian Shelton Daphne Koller Stanford University.

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Decision Theoretic Planning

Optimal Policies for POMDP Presented by Alp Sardağ.

Applications of Single and Multiple UAV for Patrol and Target Search. Pinsky Simyon. Supervisor: Dr. Mark Moulin.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

Infinite Horizon Problems

Planning under Uncertainty

Reinforcement Learning

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Presented by David Stavens. Autonomous Inspection Compute a path such that every point on the boundary of the workspace can be inspected from some point.

7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Optimization Methods One-Dimensional Unconstrained Optimization

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

1 Solving problems by searching This Lecture Chapters 3.1 to 3.4 Next Lecture Chapter 3.5 to 3.7 (Please read lecture topic material before and after each.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Ricochet Robots Mitch Powell Daniel Tilgner. Abstract Ricochet robots is a board game created in Germany in A player is given 30 seconds to find.

Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.

1 Random Disambiguation Paths Al Aksakalli In Collaboration with Carey Priebe & Donniell Fishkind Department of Applied Mathematics and Statistics Johns.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Keep the Adversary Guessing: Agent Security by Policy Randomization

Analytics and OR DP- summary.

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

Approximate POMDP planning: Overcoming the curse of history!

CS 188: Artificial Intelligence Fall 2008

Heuristic Search Value Iteration

Solving problems by searching

Reinforcement Learning Dealing with Partial Observability

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Solving problems by searching

Adaptation of the Simulated Risk Disambiguation Protocol to a Discrete Setting ICAPS Workshop on POMDP, Classification and Regression: Relationships and.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

1 Black Box and Generalized Algorithms for Planning in Uncertain Domains Thesis Proposal, Dept. of Computer Science, Carnegie Mellon University H. Brendan McMahan

2 Outline The Problem and Approach  Motivating Examples  Goals and Techniques  MDPs and Uncertainty Example Algorithms Proposed Future Work

3 Mars Rover Mission Planning Human control not realistic Collect data while conserving power and bandwidth First Experiments in the Robotic Investigation of Life in the Atacama Desert of Chile. D. Wettergreen, et al Recent Progress in Local and Global Traversability for Planetary Rovers. S. Singh, et al

4 Autonomous Helicopter Control 6+ continuous state dimensions Complex, non-linear dynamics High failure cost Inverted Autonomous Helicopter Flight via Reinforcement Learning A. Ng, et al. Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods J. Bagnell and J. Schneider

5 Online Shortest Path Problem Getting from my (old) house to CMU each day:

6 Other Domains

7 Goal Planning multiple decisions over time to achieve goals or minimize cost in Uncertain Domains NOT deterministic, fully observable, perfectly modeled

8 The Black Box Approach Fast Existing Algorithm New Algorithm Hard Planning Problem Easier Problems Solutions Solution

9 The Generalization Approach Hard Planning Problem Solution Generalization of Existing Algorithm Fast Existing Algorithm

10 Two Examples Black Box Approach MDP Alg (e.g., value iteration) Used as a Black Box Oracle Algorithms (MDPs with unknown costs) Generalize To Algorithms for Stochastic Shortest Paths Dijkstra’s Alg (Shortest Paths) Generalization Approach

11 Benefits of using Black Boxes Use fast/optimized/mature implementations Pick implementation for specific domain Will be able to use algorithms not even invented yet Theoretical advantages

12 Benefits of Generalization New intuitions Some performance guarantees for free

13 Markov Decision Processes An MDP (S, A, P, c) … S is a finite set of states A is a finite set of actions dynamics P(y | x, a) costs c(x,a) Goal: New idea! No New Ideas Hungry A = {eat, wait, work} $1.00 $0.10 $4.75 A Research MDP

14 Simple Example Domain Robot path planning problem: Actions = {8 neighbors} Cost: Euclidean Distance Prob. p of random action

15 Types of Uncertainty Outcome Uncertainty (MDPs) Partial Observability (POMDPs) Model Uncertainty (families of MDPs, RL) Modeling Other Agents (Agent Uncertainty?)

16 The Curse of Dimensionality The size of |S| is exponential in the number of state variables: <x,y, vx, vy, battery_power, door_open, another_door_open, goal_x, goal_y, bob_x, bob_y, …>

17 Outline The Problem and Approach Example Algorithms  MDPs with Unknown Costs  Generalizing Dijkstra’s Algorithm Proposed Future Work

18 Unknown Costs, Offline Version A game with two players: The Planner chooses a policy for a MDP with known dynamics The Sentry chooses a cost function from a set K = {c 1,…,c k } of possible cost functions.

19 Avoiding Detection by Sensors The Planner (robot) picks policies (paths): The Sentry picks cost functions (sensor placements):

20 Matrix Game Formulation Matrix game M: Planner (rows) selects a policy  Sentry (columns) selects a cost c M ( , c) = [total cost of  under costs c] Goal: Find a minimax solution to M An optimal mixed strategy for the planner is a distribution over deterministic polices (paths).

21 Interpretations Model Uncertainty: → unknown cost function Partial Observability: → fixed, unobservable cost function Agent Uncertainty: → an adversary picks the cost function

22 How to Solve It Problem: Matrix M is exponentially big Solution: Can be represented compactly as a Linear Program (LP) Problem: LP still takes much too long to solve Solution: The Single Oracle Algorithm, taking advantage of fast black box MDP algorithms

23 Single Oracle Algorithm F is a small set of policies M’ is the matrix game where the Planner must play from F. We can solve M’ efficiently, it is only |F| x |K| in size! |F| = 2

24 Single Oracle Algorithm If only … we knew it was sufficient for the Planner to randomize among a small set of strategies and we could find that set of strategies.

25 Single Oracle Algorithm 1.Use an MDP algorithm to find an optimal policy  against the fixed cost function c. 2.Add  to F 3.Solve M’ and let c be the expected cost function under the Sentry’s optimal mixed strategy.

26 Example Run: Initialization Fix policy (blue path) Solve M’ to find red sensor field (cost vector), fix this as c

27 Iteration 1: Best Response Solve for the best response policy  (new blue line) Add  to F Red: Fixed cost vector (expected field of view) Blue: Shortest path given costs

28 Iteration 1: Solve the Game Solve M’ Minimax Equilibrium: Red:Mixture of Costs Blue:Mixture of Paths from F

29 Iteration 2: Best Response Solve for the best response policy  (new blue line) Add  to F Red: Fixed cost vector (expected field of view) Blue: Shortest path given costs

30 Iteration 2: Solve the Game Solve M’ Minimax Equilibrium: Red:Mixture of Costs Blue:Mixture of Paths from F

31 Iteration 6: Convergence Solution to M’ Best Response

32 Unknown Costs, Online Version Go from my house to CMU each day Model as a graph

33 A Shortest Path Problem? If we knew all the edge costs, it would be easy! But, traffic, downed trees → uncertainty

34 Limited Observations Each day, observe the total length of the path we actually took to get to CMU BGA Algorithm: Keep estimates of edge lengths Most days, follow FPL 1 algorithm: pick shortest path with respect to estimated lengths plus a little noise. Occasionally, play a “random” path in order to make sure we have good estimates of the edge lengths. 1 [Kalai and Vempala, 2003]

35 Dijkstra's Algorithm G x1x1 x2x2 x3x3 x4x4 v'= 0 v'= ∞ v'=3 v'=2 v'=1 v'=5 v'=6 v'=7 v'=2 Keeps states on a priority queue Pops states in order of increasing distance, updates predecessors Prioritized Sweeping 1,2 has a similar structure, but doesn’t reduce to Dijkstra’s algorithm 1 [A. Moore, C. Atkeson 1993] 2 [D. Andre, et al. 1998]

36 Prioritized Sweeping When we pop a state x, backup x, update priorities of predecessors w y1y1 y2y2 y3y3 w1w1 w2w2 x1x1 Values of red states updated based on value of purple states.

37 Improved Prioritized Sweeping When we pop a state x, its value has already been updated Update values and priorities of predecessors w y1y1 y2y2 y3y3 w1w1 w2w2 x1x1 Values of red states updated based on value of purple states.

38 Priority Function Intuitions Update the state: with lowest value (closest to goal) whose value is most accurately known  For Dijkstra’s algorithm, the updated (popped) state’s optimal value is known  This is the state whose value will change the least in the future. whose value has changed the most since it was last updated.

39 Comparison IPS, deterministic domain: PS, same problem: Dark red indicates recently popped from queue, lighter means less recently.

40 Outline The Problem and Approach Example Algorithms Proposed Future Work  Bounded RTDP and extensions  Large action spaces  Details of proposed contributions

41 Bounded RTDP RTDP: Fixed start state means many states are irrelevant Sample, backup along start → goal trajectories BRTDP adds: performance guarantees, much faster convergence (often better than HDP, LRTDP, and LAO*)

42 Dijkstra and BRTDP Dijkstra-style scheduling of backups for BRTDP Sample multiple trajectories Use priority queue to schedule backups of states on all trajectories

43 Dijkstra, BRTDP, and POMDPs HSVI 1 is like BRTDP, but for POMDPs The same trick should apply But more benefit, because backups are more expensive Piecewise linear belief- space value function x1x1 x2x2 1 [T. Smith and R. Simmons ]

44 Large Action Spaces (Prioritized) Policy Iteration already has an advantage Better tradeoff between policy evaluation, policy improvement? Structured sets of actions? Application of Experts/Bandits algorithms?

45 Details: Proposed Contributions Discussion of algorithms already developed: Oracle Algorithms, BGA, IPS, BRTDP, and several others. At least two significant new algorithmic contributions:  BRTDP + Dijkstra algorithm, extension to POMDPs  Improved version of PPI to handle large action spaces  Something else: generalizations of conjugate-gradient linear solvers to MDPs, extensions of the technique for finding upper bounds introduced in the BRTDP paper, algorithms for efficiently solving restricted classes of POMDPs...

46 Details: Proposed Contributions At least one significant new theoretical contribution:  Approximation algorithm for Canadian Traveler’s Problem or Stochastic TSP  Results connecting online algorithms / MDP techniques to stochastic optimization  New contributions on bandit-style online algorithms, perhaps applications to MDPs

47 Summary Motivating Problems Black Boxes: MDPs with unknown Costs Generalization: Reducing to Dijkstra Future Work: BRTDP + Dijkstra, Large action spaces

48 Questions?

49 Relationships of Algorithms Discussed

50 Iteration 3: Best Response Solve for the best response policy  (new blue line) Add  to F Red: Fixed cost vector (expected field of view) Blue: Shortest path given costs

51 Representations, Algorithms Simulation dynamics model Factored Representation (DBNs, etc) STRIPS-style languages Policy Search, … Generalizations of Value Iteration, …