Download presentation
Presentation is loading. Please wait.
1
Eligibility traces: The “atomic breadcrumbs” approach to RL
2
Administrivia P2M1 grades out this morning Let me know if you didn’t get a gradesheet Reminder: Q3 on Nov 10 Threads & synchronization P2M3 due today Happy (very nearly) Halloween!
3
Timeline Last time: Miscellany Design problem from midterm (Q5) Factory methods design pattern Polymorphism vs. explicit tests: case study More on Q learning Today: Eligibility traces & SARSA( λ ) Design exercise
4
Q learning in math... Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))
5
Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }
6
Why does it work? Won’t give a full explanation here Basic intuition: each step of experience “backs up” reward from goal state toward beginning goal state r=0 r=+5 “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state QQ Q QQQ
7
Q -learning in action 15x15 maze world; R (goal)= 1; R( other)=0 =0.9 =0.65
8
Q -learning in action Initial policy
9
Q -learning in action After 20 trials
10
Q -learning in action After 30 trials
11
Q -learning in action After 100 trials
12
Q -learning in action After 150 trials
13
Q -learning in action After 200 trials
14
Q -learning in action After 250 trials
15
Q -learning in action After 300 trials
16
Q -learning in action After 350 trials
17
Q -learning in action After 400 trials
18
Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state)
19
Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state) Every 10 episodes, “freeze” policy (turn off learning) Measure avg time to goal from a number of starting states Average over a number of test trials to iron out noise Plot learning curve: #episodes of learning vs. avg performance
20
Learning performance
21
Notes on learning perf. After 400 learning episodes, still hasn’t asymptoted Note: that’s ~700,000 steps of experience!!! Q learning is really, really slow!!! Same holds for many RL methods (sadly)
22
That’s so inefficient! Big problem with Q-learning: Each step of experience only “backs up” information by one step in the world Takes a really, really long time to back up info all the way to the start state How can we do better? Want to propagate info further on each step Ideas?
23
Eligibility traces Key idea: keep extra information around Which states visited this trial How long ago they were visited Extra bookkeeping info called “eligibility trace” Written e(s,a) On each step: update all state/action pairs, in proportion to their eligibility Efficiency note: really only need to update s,a where e(s,a) !=0
24
Radioactive breadcrumbs At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 For every s’’,a’’ pair in S × A : Update Q(s’’,a’’) in proportion to e(s’’,a’’) Decay e(s’’,a’’) by factor of e(s’’,a’’)*= lambda*gamma Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL
25
The SARSA( λ ) code public class SARSAlAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); Action nextAct=pickAction(end); double Qnow=_policy.get(start,act); double Qnext=_policy.get(end,nextAct); double delta=r+_gamma*Qnext-Qnow; setElig(start,act,getElig(start,act)+1.0); for (SAPair p : getEligiblePairs()) { currQ=_policy.get(p.getS(),p.getA()); _policy.set(p.getS(),p.getA(), currQ+getElig(p.getS(),p.getA())*_alpha*delta); setElig(p.getS(),p.getA(), getElig(p.getS(),p.getA())*_gamma*_lambda); } } }
26
The SARSA( λ ) picture Agent starts here Attaches eligibility
27
The SARSA( λ ) picture
32
Design Exercise: Experimental Rig
33
Design exercise For M4/Rollout, need to be able to: Train agent for many trials/steps per trial Generate learning curves for agent’s learning Run some trials w/ learning turned on Freeze learning Run some trials w/ learning turned off Average steps-to-goal over those trials Save average as one point in curve Design: objects/methods to support this learning framework Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.