Download presentation
Presentation is loading. Please wait.
1
An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei
2
Motivation The traditional reinforcement learning algorithms treat the state space of the Markov Decision Process as a single “flat” search space. Drawback of this approach: not scale to tasks that have a complex, hierarchical structure, e.g., robot soccer, air traffic control. To overcome this problem, i.e. to make reinforcement learning hierarchical, need to introduce mechanisms for abstraction and sharing This paper describes an initial effort in this direction
3
A learning example
4
A learning example (cont’d) Task:the taxi is in a randomly-chosen cell and the passenger is at one of the four special locations (R, G, B, Y). The passenger has a desired destination and the job of the taxi is to go to the passenger, pick him/her up, go to the passenger’s destination, and drop him/her off. Six available primitive actions: North, South, East, West, Pickup and Putdown Reward: each action receives -1; when the passenger is putdown at the destination, receive +20; when the taxi attempts to pickup a non- existent passenger or putdown the passenger at a wrong place, receive -10; running into walls has no effect but entails the usual reward of -1.
5
Q-learning algorithm For any MDP, there exist one or more optimal policies. All these policies share the same optimal value function, which satisfies the Bellman equation: Q function:
6
Q-learning algorithm (cont’d) Value function example:
7
Q-learning algorithm (cont’d) Learning Process:
8
Hierarchical Q-learning Action a is generally simple, e.g., those available primitive actions ( Normal Q- learning) Could action a be also complex, e.g., a subroutine that takes many primitive actions and then exits? Yes! The learning algorithm still works. ( Hierarchical Q-learning)
9
Hierarchical Q-learning (cont’d) Assumption: some hierarchical structure is given.
10
HSMQ Alg. (Task Decomposition)
11
MAXQ Alg. (Value Fun. Decomposition) Want to obtain some sharing (compactness) in the representation of the value function. Re-write Q(p, s, a) as where V(a, s) is the expected total reward while executing action a, and C(p, s, a) is the expected reward of completing parent task p after a has returned
12
MAXQ Alg. (cont’d) An example
13
MAXQ Alg. (cont’d)
15
State Abstraction Three fundamental forms Irrelevant variables e.g. passenger location is irrelevant for the navigate and put subtasks and it thus could be ignored. Funnel abstraction A funnel action is an action that causes a larger number of initial states to be mapped into a small number of resulting states. E.g., the navigate(t) action maps any state into a state where the taxi is at location t. This means the completion cost is independent of the location of the taxi—it is the same for all initial locations of the taxi.
16
State Abstraction (cont’d) Structure constraints - E.g. if a task is terminated in a state s, then there is no need to represent its completion cost in that state - Also, in some states, the termination predicate of the child task implies the termination predicate of the parent task Effect - reduce the amount memory to represent the Q-function. 14,000 q values required for flat Q-learning 3,000 for HSMQ (with the irrelevant-variable abstraction 632 for C() and V() in MAXQ - learning faster
17
State Abstraction (cont’d)
18
Limitations Recursively optimal not necessarily optimal Model-free Q-learning Model-based algorithms (that is, algorithms that try to learn P(s’|s,a) and R(s’|s,a)) are generally much more efficient because they remember past experience rather than having to re-experience it.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.