Download presentation
Presentation is loading. Please wait.
1
Model-Free vs. Model- Based RL: Q, SARSA, & E 3
2
Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary Final projects Final presentations Dec 2, 7, 9 20 min (max) presentations 3 or 4 per day Sign up for presentation slots today!
3
The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+α*( r +γ*max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)
4
SARSA-learning algorithm Algorithm: SARSA_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q s =get_current_world_state() a =pick_next_action( Q, s ) Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) Q ( s, a )= Q ( s, a )+α*( r +γ* Q ( s’, a’ )- Q ( s, a )) a = a’ ; s = s’ ; } Until (bored)
5
SARSA vs. Q SARSA and Q -learning very similar SARSA updates Q(s,a) for the policy it’s actually executing Lets the pick_next_action() function pick action to update Q updates Q(s,a) for greedy policy w.r.t. current Q Uses max_ a to pick action to update might be diff than the action it executes at s’ In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing Exploration can get Q -learning in trouble...
6
Radioactive breadcrumbs Can now define eligibility traces for SARSA In addition to Q(s,a) table, keep an e(s,a) table Records “eligibility” (real number) for each state/action pair At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) Decay all e(s’’,a’’) by factor of λγ Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL
7
SARSA(λ)-learning alg. Algorithm: SARSA(λ)_learn Inputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1) Outputs: Q e ( s, a )=0 // for all s, a s =get_curr_world_st(); a =pick_nxt_act( Q, s ); Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) δ= r +γ* Q ( s’, a’ )- Q ( s, a ) e ( s, a )+=1 foreach ( s’’, a’’ ) pair in ( S X A ) { Q ( s’’, a’’ )= Q ( s’’, a’’ )+α* e ( s’’, a’’ )*δ e ( s’’, a’’ )*=λγ } a = a’ ; s = s’ ; } Until (bored)
8
The trail of crumbs Sutton & Barto, Sec 7.5
9
The trail of crumbs Sutton & Barto, Sec 7.5 λ=0
10
The trail of crumbs Sutton & Barto, Sec 7.5
11
Eligibility for a single state e(s i,a j ) 1st visit 2nd visit... Sutton & Barto, Sec 7.5
12
Eligibility trace followup Eligibility trace allows: Tracking where the agent has been Backup of rewards over longer periods Credit assignment: state/action pairs rewarded for having contributed to getting to the reward Why does it work?
13
The “forward view” of elig. Original SARSA did “one step” backup: Q(s,a) rtrt Q(s t+1,a t+1 ) Rest of trajectory Info backup
14
The “forward view” of elig. Original SARSA did “one step” backup: Could also do a “two step backup”: Q(s,a) rtrt Q(s t+2,a t+2 ) Rest of trajectory r t+1 Info backup
15
The “forward view” of elig. Original SARSA did “one step” backup: Could also do a “two step backup”: Or even an “ n step backup”:
16
The “forward view” of elig. Small-step backups ( n =1, n =2, etc.) are slow and nearsighted Large-step backups ( n =100, n =1000, n = ∞ ) are expensive and may miss near-term effects Want a way to combine them Can take a weighted average of different backups E.g.:
17
The “forward view” of elig. 1/31/3 2/32/3
18
How do you know which number of steps to avg over? And what the weights should be? Accumulating eligibility traces are just a clever way to easily avg. over all n :
19
The “forward view” of elig. λ0λ0 λ1λ1 λ2λ2 λ n-1
20
Replacing traces Kind just described are accumulating e-traces Every time you go back to state, add extra e. There are also replacing eligibility traces Every time you go back to a state/action, reset e(s,a) to 1 Works better sometimes Sutton & Barto, Sec 7.8
21
Model-free vs. Model-based
22
What do you know? Both Q -learning and SARSA(λ) are model free methods A.k.a., value-based methods Learn a Q function Never learn T or R explicitly At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment Also, no guarantees about explore/exploit tradeoff Sometimes, want one or both of the above
23
Model-based methods Model based methods, OTOH, do explicitly learn T & R At end of learning, have entire M = 〈 S, A,T,R 〉 Also have π* At least one model-based method also guarantees explore/exploit tradeoff properties
24
E3E3 Efficient Explore & Exploit algorithm Kearns & Singh, Machine Learning 49, 2002 Explicitly keeps a T matrix and a R table Plan (policy iter) w/ curr. T & R -> curr. π Every state/action entry in T and R : Can be marked known or unknown Has a #visits counter, nv(s,a) After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average) When nv(s,a)>NVthresh, mark cell as known & re-plan When all states known, done learning & have π*
25
The E 3 algorithm Algorithm: E3_learn_sketch // only an overview Inputs: S, A, γ (0<=γ<1), NVthresh, R max, Var max Outputs: T, R, π* Initialization: R ( s )= R max // for all s T ( s, a, s’ )=1/| S | // for all s, a, s’ known( s, a )=0; nv ( s, a )=0; // for all s, a π =policy_iter( S, A, T, R )
26
The E 3 algorithm Algorithm: E3_learn_sketch // con’t Repeat { s =get_current_world_state() a = π ( s ) ( r, s’ )=act_in_world( a ) T ( s, a, s’ )=(1+ T ( s, a, s’ )* nv ( s, a ))/( nv ( s, a )+1) nv ( s, a )++; if ( nv ( s, a )> NVthresh ) { known(s,a)=1; π =policy_iter( S, A, T, R ) } } Until (all ( s, a ) known)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.