Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.

Slides:



Advertisements
Similar presentations
Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand
Markov Decision Process
Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts.
NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why.
Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.
Best-First Search: Agendas
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University.
Games with Chance Other Search Algorithms CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 3 Adapted from slides of Yoonsuck Choe.
1 Optimisation Although Constraint Logic Programming is somehow focussed in constraint satisfaction (closer to a “logical” view), constraint optimisation.
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
G. Levine Chapter 6 Chapter 6 Encapsulation –Why do we want encapsulation? Programmer friendliness- programmer need not know about these details –Easier.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.
Branch and Bound Algorithm for Solving Integer Linear Programming
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Chapter 1 Program Design
Hierarchical Routing Slides adopted from Christos Papadopoulos’ lecture at USC.
Making B+-Trees Cache Conscious in Main Memory
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Transfer Graph Approach for Multimodal Transport Problems
Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Techniques for Analysis and Calibration of Multi- Agent Simulations Manuel Fehler Franziska Klügl Frank Puppe Universität Würzburg Lehrstuhl für Künstliche.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 Automatic Refinement and Vacuity Detection for Symbolic Trajectory Evaluation Orna Grumberg Technion Haifa, Israel Joint work with Rachel Tzoref.
Combined Task and Motion Planning for Mobile Manipulation Jason Wolfe, etc. University of California, Berkeley Mozi Song INFOTECH 1.
Institute e-Austria in Timisoara 1 Author: prep. eng. Calin Jebelean Verification of Communication Protocols using SDL ( )
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Chapter 8 Object Design Reuse and Patterns. Object Design Object design is the process of adding details to the requirements analysis and making implementation.
Chapter 1 Data Structures and Algorithms. Primary Goals Present commonly used data structures Present commonly used data structures Introduce the idea.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
Introduction and Preliminaries D Nagesh Kumar, IISc Water Resources Planning and Management: M4L1 Dynamic Programming and Applications.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Backtracking And Branch And Bound
Reinforcement Learning (1)
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Structure in Reinforcement Learning
Concurrent Hierarchical Reinforcement Learning
Objective of This Course
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Presentation transcript:

Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning Workshop ) and Tom Dietterich (From ICML99).

Contents Purpose of HRL Design issues Defining sub-components (sub-task, options, macros) Learning problems and algorithms MAXQ ALISP (briefly) Problems with HRL What we will not discuss – Learning sub-tasks, multi-agent setting, etc.

Contents Purpose of HRL Defining sub-components (sub-task, options, macros) Learning problems and algorithms MAXQ ALISP (briefly) Problems with HRL

Purpose of HRL “Flat” RL works well but on small problems Want to scale up to complex behaviors – but curses of dimensionality kick in quickly! Humans don’t think at the granularity of primitive actions

Goals of HRL Scale-up: Decompose large problems into smaller ones Transfer: Share/reuse tasks Key ideas: – Impose constraints on value function or policy (i.e. structure) – Build learning algorithms to exploit constraints

Solution

Contents Purpose of HRL Design issues Defining sub-components (sub-task, options, macros) Learning problems and algorithms MAXQ ALISP (briefly) Problems with HRL

Design Issues How to define components? – Options, partial policies, sub-tasks Learning algorithms? Overcoming sub-optimality? State abstraction?

Example problem

Contents Purpose of HRL Design issues Defining sub-components (sub-task, options, macros) Learning problems and algorithms MAXQ in detail ALISP in detail Problems with HRL

Learning problems Given a set of options, learn a policy over those options. Given a hierarchy of partial policies, learn policy for the entire problem – HAMQ, ALISPQ Given a set of sub-tasks, learn policies for each sub-task Given a set of sub-tasks, learn policies for entire problem – MAXQ

Observations

Learning with partial policies

Hierarchies of Abstract Machines (HAM)

Learn policies for a given set of sub-tasks

Learning hierarchical sub-tasks

Example Locally optimalOptimal for the entire task

Contents Purpose of HRL Design issues Defining sub-components (sub-task, options, macros) Learning problems and algorithms MAXQ ALISP (briefly) Problems with HRL

Recap: MAXQ Break original MDP into multiple sub-MDP’s Each sub-MDP is treated as a temporally extended action Define a hierarchy of sub-MDP’s (sub-tasks) Each sub-task M _i defined by: – T = Set of terminal states – A _i = Set of child actions (may be other sub-tasks) – R’_i = Local reward function

Taxi Passengers appear at one of 4 special location -1 reward at every timestep +20 for delivering passenger to destination -10 for illegal actions 500 states, 6 primitive actions Subtasks: Navigate, Get, Put, Root

Sub-task hierarchy

Value function decomposition “completion function” Must be learned!

MAXQ hierarchy Q nodes store the completion C(i, s, a) Composite Max nodes compute values V(i,s) using recursive equation. Primitive Max nodes estimate value directly V(i,s)

How do we learn the completion function? Basic Idea: C(i,s,a) = V(i,s’) For simplicity, assume local rewards are all 0 Then at some sub-task ‘i', when child ‘a’ terminates with reward r, update V t (i,s’) must be computed recursively by searching for the best path through the tree (expensive if done naively!)

When local rewards may be arbitrary… Use two different completion functions C, C’ C reported externally while C’ is used internally. Updates change a little: Intuition: – C’(i,s,a) used to learn locally optimally policy at sub-task so adds local reward in update – C updates using best action a* according to C’

Sacrificing (hierarchical) optimality for…. State abstraction! – Compact value functions (i.e C, C’ is a function of fewer variables) – Need to learn fewer values (632 vs for Taxi) Eg. “Passenger in car or not” irrelevant to navigation task 2 main types of state abstraction – Variables are irrelevant to the task – Funnel actions: Subtasks always end in a fixed set of states irrespective of what the child does.

Tradeoff between hierarchical optimality and state abstraction Preventing sub-tasks from being context dependent leads to compact value functions But must pay a (hopefully small!) price in optimality.

ALISP Hierarchy of partial policies like HAM Integrated into LISP Decomposition along procedure boundaries Execution proceeds from choice point to choice point Main idea: 3-part Q function decomposition – Q r, Q c, Q e Q e = Expected reward outside sub-routine till the end of the program. – Gives hierarchical optimality!!

Why Not? (Parr 2005) Some cool ideas and algorithms, but No killer apps or wide acceptance, yet. Good idea that needs more refinement : – More user friendliness – More rigor in specification Recent / active research: – Learn hierarchy automatically – Multi-agent setting (Concurrent ALISP) – Efficient exploration (RMAXQ)

Conclusion Decompose MDP into sub-problems Structure + MDP (typically) = SMDP – Apply some variant of SMDP-Q learning Decompose value function and apply state abstraction to accelerate learning Thank You!