Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand
Reinforcement Learning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
1. Algorithms for Inverse Reinforcement Learning 2
Reinforcement learning (Chapter 21)
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement learning
Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto.
Reinforcement Learning Tutorial
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Latent Learning in Agents iCML 03 Robotics/Vision Workshop Rati Sharma.
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Soar-RL: Reinforcement Learning and Soar Shelley Nason.
A Multi-Agent Learning Approach to Online Distributed Resource Allocation Chongjie Zhang Victor Lesser Prashant Shenoy Computer Science Department University.
Reinforcement Learning (1)
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Vinay Papudesi and Manfred Huber.  Staged skill learning involves:  To Begin:  “Skills” are innate reflexes and raw representation of the world. 
Reinforcement Learning
Stochastic Routing Routing Area Meeting IETF 82 (Taipei) Nov.15, 2011.
Informed State Space Search Department of Computer Science & Engineering Indian Institute of Technology Kharagpur.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Overview of Machine Learning RPI Robotics Lab Spring 2011 Kane Hadley.
Advice Taking and Transfer Learning: Naturally-Inspired Extensions to Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik.
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
Concept Learning and The General-To Specific Ordering
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Presentation By SANJOG BHATTA Student ID : July 28’ 2009.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
On-Line Markov Decision Processes for Learning Movement in Video Games
István Szita & András Lőrincz
Reinforcement Learning in POMDPs Without Resets
Reinforcement learning (Chapter 21)
Biomedical Data & Markov Decision Process
Policy Gradient in Continuous Time
CIS 488/588 Bruce R. Maxim UM-Dearborn
Reinforcement Learning
Discrete Controller Synthesis
Chapter 9: Planning and Learning
Introduction to Reinforcement Learning and Q-Learning
Designing Neural Network Architectures Using Reinforcement Learning
CS 416 Artificial Intelligence
Continuous Curriculum Learning for RL
Presentation transcript:

Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005

Outline Reinforcement Learning RL Agent RL Agent Policy Policy Hierarchical Reinforcement Learning The Need The Need Sub-Goal Detection Sub-Goal Detection State Clusters State Clusters Border States Border States Continuous State and/or Action Spaces Continuous State and/or Action Spaces Options Options Macro Q-Learning with Parallel Option Discovery Macro Q-Learning with Parallel Option Discovery Experimental Results

Reinforcement Learning Agent observes the state, and takes the action according to the policy Policy is a function from the state space onto the action space Policy can be deterministic or non- deterministic State and action spaces can be discrete, continuous or hybrid

RL Agent No model of the environment Agent observes state s, takes action a and goes into state s’ observing reward r Agent tries to maximize total expected reward (return) Finite state machine model SS’ a, r

Policy In a flat RL model, policy is a map from each state to a primitive action In the optimal policy, the action taken by the agent return highest return at each each step Can be kept in tabular format for small state and action spaces Function approximators can be used for large state or action spaces (or continuous ones)

The Need For Hierarchical RL Increase the performance Applying RL to the problems with large action and/or state space become feasible Detection of sub-goals can help the agent to have the abstract actions defined over the primitive actions Sub-goals and abstract actions can be used in different tasks on the same domain. The knowledge is transferred between tasks The policy of the agent can be translated into a natural language

Sub-goal Detection A sub-goal can be a single state, a subset of the state space, or a constraint in the state space Reaching a sub-goal should help the agent reaching the main goal (to get the highest return) Sub-goals must be discovered by the agent autonomously

State Clusters The states in a cluster are strongly connected to each other The number of state transitions among clusters are small The states at two ends of a state transition between two different clusters are sub-goal candidates Clusters can be hierarchical Different clusters can be in the same cluster at a higher level Different clusters can be in the same cluster at a higher level

Border States Some actions cannot be applied in some states. These states are defined as border states Border states are assumed to have a transition sequence. We can travel through the border states by taking some actions Each end in this transition sequence is a candidate sub-goal assuming the agent sufficiently explored the environment

Border State Detection For discrete action and state space F(s): set of states which can be reached from state s in one time unit F(s): set of states which can be reached from state s in one time unit G(s): if an action in G(s) is applied at state s, no state transition occurs G(s): if an action in G(s) is applied at state s, no state transition occurs H(s): if an action in H(s) is applied at state s, the agent moves to a different state H(s): if an action in H(s) is applied at state s, the agent moves to a different state

Border State Detection Detect the longest state sequence s 0,s 1,s 2,…,s k-1,s k which satisfies the following constraints s i  F(s i+1 ) or s i+1  F(s i ) for 0  i<k s i  F(s i+1 ) or s i+1  F(s i ) for 0  i<k G(s i )  G(s i+1 )   for 0<i<k-1 G(s i )  G(s i+1 )   for 0<i<k-1 H(s 0 )  G(s 1 )   H(s 0 )  G(s 1 )   H(s k )  G(s k-1 )   H(s k )  G(s k-1 )   s 0 and s k are candidate sub-goals

Border States on Continuous State and Action Spaces Environment is assumed to be bounded State and action vectors can include both continuous and discrete dimensions The derivative of state vector with respect to the action vector can be used The border state regions must have small derivatives for some action vectors The large change in these derivatives is the indication of border state regions

Options An option is a policy It can be local (defined on a subset of state space) or can be global The option policy can use primitive actions or other options It is hierarchical Used to reach sub-goals

Macro Q-Learning with Parallel Option Discovery Agent starts with no sub-goal and option It detects the sub-goals and learns the option policies and the main policy simultaneously Options are formed and removed from the model according the sub-goal detection algorithm When a possible sub-goal is detected, a new option is added to the model to have the policy to reach this sub- goal All options policies are updated in parallel The agent generates an internal reward if a sub-goal is reached

Macro Q-Learning with Parallel Option Discovery An Option is defined by the following: O = (  o,  o, I o, Q o, r o ) where Q o is Q values for the option and r o is the internal reward signal associated with the option Intra-option learning method is used

Experiments Flat RL Hierarchical RL

Options in HRL

Questions and Suggestions!!!