An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.

Slides:



Advertisements
Similar presentations
Hierarchical Reinforcement Learning Amir massoud Farahmand
Advertisements

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Local Search Algorithms Chapter 4. Outline Hill-climbing search Simulated annealing search Local beam search Genetic algorithms Ant Colony Optimization.
Markov Decision Process
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.
Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.
Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University.
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Markov Decision Processes
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Reinforcement Learning (1)
Design and Analysis of Algorithms
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Vinay Papudesi and Manfred Huber.  Staged skill learning involves:  To Begin:  “Skills” are innate reflexes and raw representation of the world. 
Reinforcement Learning
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright E3 project, BUPT Autonomic Joint Session Admission Control using Reinforcement Learning.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CHAPTER 10 Reinforcement Learning Utility Theory.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
MDPs (cont) & Reinforcement Learning
Dynamic Programming Discrete time frame Multi-stage decision problem Solves backwards.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Engineering Societies in the Agents World Workshop 2003
Java 4/4/2017 Recursion.
Markov Decision Processes
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Planning to Maximize Reward: Markov Decision Processes
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Chapter 3: The Reinforcement Learning Problem
High-level robot behavior control using POMDPs
Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Spring 2006
COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei

Motivation The traditional reinforcement learning algorithms treat the state space of the Markov Decision Process as a single “flat” search space. Drawback of this approach: not scale to tasks that have a complex, hierarchical structure, e.g., robot soccer, air traffic control. To overcome this problem, i.e. to make reinforcement learning hierarchical, need to introduce mechanisms for abstraction and sharing This paper describes an initial effort in this direction

A learning example

A learning example (cont’d) Task:the taxi is in a randomly-chosen cell and the passenger is at one of the four special locations (R, G, B, Y). The passenger has a desired destination and the job of the taxi is to go to the passenger, pick him/her up, go to the passenger’s destination, and drop him/her off. Six available primitive actions: North, South, East, West, Pickup and Putdown Reward: each action receives -1; when the passenger is putdown at the destination, receive +20; when the taxi attempts to pickup a non- existent passenger or putdown the passenger at a wrong place, receive -10; running into walls has no effect but entails the usual reward of -1.

Q-learning algorithm For any MDP, there exist one or more optimal policies. All these policies share the same optimal value function, which satisfies the Bellman equation: Q function:

Q-learning algorithm (cont’d) Value function example:

Q-learning algorithm (cont’d) Learning Process:

Hierarchical Q-learning Action a is generally simple, e.g., those available primitive actions ( Normal Q- learning) Could action a be also complex, e.g., a subroutine that takes many primitive actions and then exits? Yes! The learning algorithm still works. ( Hierarchical Q-learning)

Hierarchical Q-learning (cont’d) Assumption: some hierarchical structure is given.

HSMQ Alg. (Task Decomposition)

MAXQ Alg. (Value Fun. Decomposition) Want to obtain some sharing (compactness) in the representation of the value function. Re-write Q(p, s, a) as where V(a, s) is the expected total reward while executing action a, and C(p, s, a) is the expected reward of completing parent task p after a has returned

MAXQ Alg. (cont’d) An example

MAXQ Alg. (cont’d)

State Abstraction Three fundamental forms Irrelevant variables e.g. passenger location is irrelevant for the navigate and put subtasks and it thus could be ignored. Funnel abstraction A funnel action is an action that causes a larger number of initial states to be mapped into a small number of resulting states. E.g., the navigate(t) action maps any state into a state where the taxi is at location t. This means the completion cost is independent of the location of the taxi—it is the same for all initial locations of the taxi.

State Abstraction (cont’d) Structure constraints - E.g. if a task is terminated in a state s, then there is no need to represent its completion cost in that state - Also, in some states, the termination predicate of the child task implies the termination predicate of the parent task Effect - reduce the amount memory to represent the Q-function. 14,000 q values required for flat Q-learning 3,000 for HSMQ (with the irrelevant-variable abstraction 632 for C() and V() in MAXQ - learning faster

State Abstraction (cont’d)

Limitations Recursively optimal not necessarily optimal Model-free Q-learning Model-based algorithms (that is, algorithms that try to learn P(s’|s,a) and R(s’|s,a)) are generally much more efficient because they remember past experience rather than having to re-experience it.