Download presentation
Presentation is loading. Please wait.
1
SNU BioIntelligence Lab.
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning R.S. Sutton, D. Precup, and S. Singh, Artificial Intelligence 112: , 1999 Talk by Jangmin O SNU BioIntelligence Lab.
2
SNU BioIntelligence Lab.
mov ax, dseg t=1 mov ds, ax … t=k-1 mul bx t=k mov C, dx mult a,b,c Time abstraction SNU BioIntelligence Lab.
3
SNU BioIntelligence Lab.
MDP : small, discrete time transition SMDP : larger, continuous-time transition, indivisible action Option : enabling an MDP trajectory to be analyzed in either way SNU BioIntelligence Lab.
4
The Reinforcement Learning (MDP) Framework
Dynamics Transition dynamics One-step expected rewards S : state set A : action set Agent’s objective : learn a policy maximizing the value function Environment at rt+1 Agent st st+1 SNU BioIntelligence Lab.
5
SNU BioIntelligence Lab.
Value-function Bellman equation state action SNU BioIntelligence Lab.
6
SNU BioIntelligence Lab.
2. Options SNU BioIntelligence Lab.
7
SNU BioIntelligence Lab.
Markov Options Definition generalization of primitive actions Temporally extended courses of action o = < I, , > Policy : SA [0, 1] Termination condition : S+ [0, 1] Initiation set I S at (st, ) (st+1, ) SNU BioIntelligence Lab.
8
SNU BioIntelligence Lab.
Semi Markov Options Semi-Markov : policy & termination depend on all prior events. History ht : st, at, rt+1, st+1, at+1, … , r, s Time stamps since option starts at t. Markov : Semi-Markov : Policy : A [0, 1] Termination condition : [0, 1] SNU BioIntelligence Lab.
9
SNU BioIntelligence Lab.
Policies over Options Generalization As Os Action is specialized version of option. Policies over option Policy : SO [0, 1] Flat policy = flat() I = {s, a A} (s)=1, s S (s,a)=1, s I Whenever a is available (st, ) (st+k, ) o SNU BioIntelligence Lab.
10
SNU BioIntelligence Lab.
State-value function E(, s, t) : event of semi Markov flat being initiated in s at time t Option-value function Value of taking option o in state s I under policy . o : follows o until termination, then starts according to . Semi Markov, E(o, h, t) Event of o continuing from h at time t. at o(h, ) termination condition at t+1, (hatrt+1st+1) at+1 o(hatrt+1st+1, ) SNU BioIntelligence Lab.
11
3. SMDP (Option-to-Option) Methods
SNU BioIntelligence Lab.
12
SNU BioIntelligence Lab.
MDP + Optinos = SMDP Theorem 1 For any MDP, and any set of options defined on that MDP, the decision process that selects only among those options, executing each to termination, is an SMDP. contents of SMDP A set of states A set of actions An expected cumulative discounted reward for each pair of state and action Well-defined joint distribution of the next state and transit time Well-defined MDP : Markov, options : semi Markov SNU BioIntelligence Lab.
13
SNU BioIntelligence Lab.
Model of Options Reward part State-prediction part p(s’,k) : the option terminates in s’ after k steps. SNU BioIntelligence Lab.
14
SNU BioIntelligence Lab.
Value-function state action SNU BioIntelligence Lab.
15
SNU BioIntelligence Lab.
Bellman equation state action SNU BioIntelligence Lab.
16
SNU BioIntelligence Lab.
3.1 SMDP Planning SNU BioIntelligence Lab.
17
SVI (synchronous value iteration)
state action VO* v.s. V* SNU BioIntelligence Lab.
18
SNU BioIntelligence Lab.
Example – grid world reward at goal = +1, the others = 0 SNU BioIntelligence Lab.
19
SNU BioIntelligence Lab.
(s) = 1 : outside the room (s) = 0 : within room The policy underlying one of the eight hallway options Hallway options : to take agent from anywhere within the room to one of the two hallway cells leading out of the room SNU BioIntelligence Lab.
20
SNU BioIntelligence Lab.
Value functions of the states SNU BioIntelligence Lab.
21
SNU BioIntelligence Lab.
Value functions of the states (Goal location changed) SNU BioIntelligence Lab.
22
SNU BioIntelligence Lab.
3.2 SMDP Value Learning SNU BioIntelligence Lab.
23
SMDP version of Q-learning
Example SNU BioIntelligence Lab.
24
SNU BioIntelligence Lab.
4 Interrupting Options SNU BioIntelligence Lab.
25
SNU BioIntelligence Lab.
Idea SMDP framework Options are treated as opaque indivisible units Interrupting options before they would terminate naturally according to their termination conditions. Comparison value of interrupting o and selecting a new option inter-step of option continue o interrupt o and allow switch SNU BioIntelligence Lab.
26
SNU BioIntelligence Lab.
Preparation Policy over a new set of options ’(s, o’) = (s, o), s S o’ : o + ability to terminating whenever switching seems better than continuing according to Q. o’ = < I, , ’ > ’(s) = 1, whenever Q(s,o) < V(s) SNU BioIntelligence Lab.
27
Theorem 2 (Interrupting)
For any MDP, any set of options O, and any Markov policy : SO [0,1], define a new set of options, O’, with a one-to-one mapping between the two option sets as follows: for every o= < I, , >O we define a corresponding o= < I, , ’ >O’ , where ’= except that for any history h that ends in state s and in which Q(h,o) < V(s), we may choose to set ’(h)=1. Any histories whose termination conditions are changed in this way are called interrupted histories. Let the interrupted policy ’ be such that for all s S, and for all o’ O’, ’(s,o’)= (s,o), where o is the option in O corresponding to o’. Then V’(s) V (s) for all s S (ii) If from state s S there is a non-zero probability of encountering an interrupted history upon initiating ’ in s, then V’(s) V (s). SNU BioIntelligence Lab.
28
SNU BioIntelligence Lab.
Proof of Theorem 2 Interrupted history 아이디어: ’ 로 o’ 실행후 를 따르는 것이 보다 낫다는 것 보임 그후: ’ 에 대해 한단계씩 펼쳐나가면 증명 완료 not less than SNU BioIntelligence Lab.
29
SNU BioIntelligence Lab.
2-dim space Move : 0.01 in any direction 7 Controllers : taking us to the landmark in a direct path (but, only applicable within a limited rage of landmark) Reward –1: on all transitions SNU BioIntelligence Lab.
30
SNU BioIntelligence Lab.
SNU BioIntelligence Lab.
31
SNU BioIntelligence Lab.
SNU BioIntelligence Lab.
32
5 Intra-Option Model Learning
SNU BioIntelligence Lab.
33
SNU BioIntelligence Lab.
Model Learning in SMDP Update equation Disadvantage Only when the option terminates One option at a time For Markov option : intra-option methods Learning from a fragment of experience “within” option Learning without ever executing a certain option itself. Off-policy learning SNU BioIntelligence Lab.
34
SNU BioIntelligence Lab.
model Bellman equation reward transition One-step intra option model learning SNU BioIntelligence Lab.
35
SNU BioIntelligence Lab.
Example SNU BioIntelligence Lab.
36
6 Intra-Option Value Learning
SNU BioIntelligence Lab.
37
SNU BioIntelligence Lab.
Case of SMDP Semi Markov option must be completed before it can be evaluated. One example after terminating that option For Markov option Intra option methods available Extracting more training examples from the same experience Off-policy SNU BioIntelligence Lab.
38
SNU BioIntelligence Lab.
(st, ) (st+k, ) o QO*(s,o) Off-policy idea : Full use of whatever experience occurs, irrespective of their role in generating the experience For semi Markov, One-example For Markov Option, many-example SNU BioIntelligence Lab.
39
SNU BioIntelligence Lab.
Learning method Value of a state-option pair given that the option is Markov and executing upon arrival in the state Bellman equation SNU BioIntelligence Lab.
40
Off-policy one-step temporal difference update
Q-update where Off-policy learning How to… Applying update rule to every option o consistent with every action taken, at. SNU BioIntelligence Lab.
41
SNU BioIntelligence Lab.
Without selecting the options Experience was generated by selecting randomly among actions SNU BioIntelligence Lab.
42
7 Subgoals for Learning Options
SNU BioIntelligence Lab.
43
SNU BioIntelligence Lab.
Subgoals Option ~ achieving subgoals of some kind Adapting each option’s policy to better achieve its subgoal Formulating subgoal : terminal subgoal value, g(s), GS How desirable to terminate in each state in G. Og : options terminating only and always in G Vgo(s) : new state-value function, for options oOg : expected value of the cumulative reward if option o is initiated in state s, plus subgoal value g(s’) of the state s’ in which it terminates. + SNU BioIntelligence Lab.
44
SNU BioIntelligence Lab.
Learning method Qgo(s,a) = Vgao(s) If st+1 G st+1 G SNU BioIntelligence Lab.
45
SNU BioIntelligence Lab.
For each option : subgoal value +1 for target hallway, 0 for all states outside the option’s room SNU BioIntelligence Lab.
46
SNU BioIntelligence Lab.
8. Conclusion SNU BioIntelligence Lab.
47
SNU BioIntelligence Lab.
Representing knowledge flexibly at multiple levels of temporal abstraction Framework within the context of reinforcement learning and MDP Between MDP and SMDP Each set of options defines SMDP Analyze mixtures of actions at different time scales Interrupting, constructing, decomposing options SNU BioIntelligence Lab.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.