Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.

Slides:



Advertisements
Similar presentations
Hierarchical Reinforcement Learning Amir massoud Farahmand
Advertisements

Markov Decision Process
Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts.
Hierarchical Task and Motion Planning in Now Leslie P. Kaelbling and Tomas Lozano-Perez Department of Computer Science & Engineering, MIT Kai Liu June.
Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Planning under Uncertainty
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
Generalizing Plans to New Environments in Relational MDPs
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.
Markov Decision Processes
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.
MAKING COMPLEX DEClSlONS
Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)
Decision Making Under Uncertainty Lec #7: Markov Decision Processes UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Craig.
A Logic for Decidable Reasoning about Services Yilan Gu Dept. of Computer Science University of Toronto Mikhail Soutchanski Dept. of Computer Science Ryerson.
Reinforcement Learning
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Macro-actions in the Situation Calculus Yilan Gu Department of Computer Science University of Toronto August 10, 2003.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Haley: A Hierarchical Framework for Logical Composition of Web Services Haibo Zhao, Prashant Doshi LSDIS Lab, Dept. of Computer Science, University of.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
1 Introduction to Reinforcement Learning Freek Stulp.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Decision Making Under Uncertainty Lec #5: Markov Decision Processes UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Craig.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Alborz Geramifard Logic Programming and MDPs for Planning Winter 2009.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Making complex decisions
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CS 188: Artificial Intelligence Spring 2006
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004

2 Outline Introduction: what’s the problem ? Temporal abstraction Logical representation of MDPs Potential future directions

3 Markov Decision Processes (MDPs) Decision-theoretic planning and learning problems are often modeled in MDPs. An MDP is a model M = consisting  a set of environment states S,  a set of actions A,  a transition function T: S  A  S  [0,1] T(s,a,s’) = Pr (s’| s,a),  a reward function R: S  A  R. A policy is a function  : S  A. Expected cumulative reward -- value function V  : S  R. The Bellman Eq.: V  (s) = R(s,  (s)) +   s’ T(s,  (s),s’) V  (s’)

4 MDP Example S = {(1,1), (1,2), …,(8,8)} A = {up, down, left, right} e.g., T((2,2),up,(1,2)) = 0.8, T((2,2),up,(2,1))=0.1, T((2,2),up,(2,3)) = 0.1, T((2,2),up,s’) = 0 for s’  (1,2), (2,1), (2,3) …… R((1,8)) = 1, R(s)= -1 for s  (1,8). Fig. The 8*8 grid world up 0.8 (2,2) Notice: explicit representation of the model

5 Conventional Solution Algorithms for MDPs Goal: looking for optimal policy  * so that V*(s) = V  * (s)  V  (s) for all s  S and  Conventional algorithms  Dynamic programming: value iteration and policy iteration,  Decision tree search algorithm, etc.  Example: Value iteration Beginning with arbitrary V 0 ; In each iteration n>0: for every s  S, Q n (s,a) := R(s, a) +   s’ T(s, a, s’) V n-1 (s’) for any a; V n (s) := max a Q n (s,a) ; When n  , V n (s)  V*(s). Problem: it does not scale up!

6 Solving Large MDPs (Part I) Temporal abstraction approaches (basic idea)  Solving MDPs hierarchically  Using complex actions or subtasks to compress the scales of the state spaces Representing and solving MDPs in a logical way (basic idea)  Logically representing environment features  Aggregating ‘similar’ states  Representing effect of actions compactly by using logical structures, and eliminating unaffected features during reasoning

7 Options (Macro-Actions)  Partition { S 1, S 2, S 3, S 4 }  A macro-action -- a local policy  i : S i  A on region S i  E.g., EPer ( S 1 ) = {(3,5),(5,3)}  Discounted transition model T i : S i  {  i }  Eper( S i )  [0,1]  Discounted reward model R i : S i  {  i }  R Example S1S1            

8 Abstract MDP M’= S ’=  Eper( S i ), e.g., {(4,3),(3,4),(5,3),(3,5),(6,4),(4,6), (5,6),(6,5)}. A ’=  A i, where A i is a set of macro-actions on region S i. Transition model T’: S ’  A ’  S ’  [0,1] T’(s,  i, s’) = T i (s,  i,s’) if s  S i, s’  Eper( S i ); T’(s,  i, s’) = 0 otherwise. Reward model R’: S ’  A ’  R R’(s,  i ) = R i (s,  i ) for any s’  Eper( S i ).

9 Other Temporal Abstraction Approaches Options [Sutton 1995; Singh, Sutton and Precup 1999] Macro-actions [Hauskrecht et al. 1998; Parr 1998] –Fixed policies Hierarchical abstract machines (HAMs) [Parr and Russell 1997; Andre and Russell 2001,2002] –Finite controllers MAXQ methods [Dietterich 1998, 2000] –Goal-oriented subtasks Etc.

10 Solving Large MDPs (Part II) Temporal abstraction approaches (basic idea)  Solving MDPs hierarchically  Using complex actions or subtasks to compress the scale of the state spaces Representing and solving MDPs logically (basic idea)  Logically representing environment features  Aggregating ‘similar’ states  Representing effect of actions compactly by using logical structures, and eliminating unaffected features during reasoning

11 First-Order MDPs Using the stochastic situation calculus to model decision-theoretic planning problems Underlying model : first-order MDPs (FOMDPs) Solving FOMDPs using symbolic dynamic programming

12 Stochastic Situation Calculus (I) Using choice axioms to specify possible outcomes n i (x) of any stochastic action a(x) Example: choice(delCoff(x),a)  a = delCoffS(x)  a = delCoffF(x) Situations: S 0, do(a,s) Fluents F(x,s) – modeling environment features compactly Examples: office(x,s), coffeeReq(x,s), holdingCoffee(s) Basic action theory:  Using successor state axioms to describe the effect of the actions ’ outcomes on each fluent coffeeReq(x,do(a,s))  coffeeReq(x,s)   a = delCoffS(x)

13 Stochastic Situation Calculus (II) Asserting probabilities (may be depended on conditions of current situation Example: prob(delCoffS(x), delCoff(x), s) = case [hot, 0.9;  hot, 0.7] Specifying rewards/costs conditionally Example: R(do(a,s)) = case[  x. a = delCoffS(x), 10;   x. a = delCoffS(x), 0] stGolog programs, policies proc  (x) if  holdingCoffee then getCoffee else ( ?(coffeeReq(x)) ; delCoffee(x)) end proc

14 Symbolic Dynamic Programming Representing value function V n-1 (s) logically case [  1 (s),v 1 ; … ;  m (s),v m ] Input: the system described in stochastic SitCal and V n-1 (s) Output (also in case format): –Q-functions Q n (a(x),s)= R(s)+   i prob(n i (x),a(x),s) V n-1 (do(n i (x),s)) –Value function V n (s) V n (s) = (  a)(  b) Q n (a,s)  Q n (b,s)

15 Other Logical Representations First-order MDPs [e.g., Boutilier et al. 2000; Boutilier, Reiter and Price 2001] Factored MDPs [e.g., Boutilier and Dearden 1994, Boutilier, Dearden and Goldszmit 1995; Hoey et al. 1999] Relational MDPs [e.g., Guestrin et al. 2003] Integrated Bayesian Agent Language (IBAL) [Pfeffer 2001] Etc [e.g., Bacchus 1993, Poole 1995].

16 Our Attempt: Combining temporal abstraction with logical representations of MDPs.

17 Motivation cityB cityA living(X, houseA) houseB inCity(Y,cityA)

18 Prior Work MAXQ approaches [Dietterich 2000] and PHAMs method [Andre and Sutton 2001] –Using variables to represent state features –Propositional representations Extending DTGolog with options [Ferrein, Fritz and Lakemeyer 2003] –Specifying options with the SitCal and Golog programs –Benefit: reusable when entering the exact same region –Shortage: options are based on explicit regions, and therefore not reusable under ‘similar’ regions

19 Our Idea and Potential Directions Given any stGolog program (a macro-action schema) Example : proc getCoffee(X) if  holdingCoffee then getCoffee else ( while coffeeReq(X) do delCoffee(X) ) end proc Basic Idea – inspired by macro-actions [Boutilier et al 1998]:  Analyzing the macro-action to find what has been affected by the macro-action Example: holdingCoffee, coffeeReq(X)  Preprocessing discounted transition and reward models Example: tr( holdingCoffee  coffeeReq(X), getCoffee(X),  holdingCoffee   coffeeReq(X) )

20 (Continue)  using and re-using macro-actions as primitive actions Benefit:  Schematic  Free variables in the macro-actions can represent a class of objects which have same characteristics  Even for infinite objects  Reusable in similar regions, other than the exact region

THE END Thank you!