Eligibility traces: The “atomic breadcrumbs” approach to RL.

Eligibility traces: The “atomic breadcrumbs” approach to RL

Administrivia P2M1 grades out this morning Let me know if you didn’t get a gradesheet Reminder: Q3 on Nov 10 Threads & synchronization P2M3 due today Happy (very nearly) Halloween!

Timeline Last time: Miscellany Design problem from midterm (Q5) Factory methods design pattern Polymorphism vs. explicit tests: case study More on Q learning Today: Eligibility traces & SARSA( λ ) Design exercise

Q learning in math... Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }

Why does it work? Won’t give a full explanation here Basic intuition: each step of experience “backs up” reward from goal state toward beginning goal state r=0 r=+5 “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state QQ Q QQQ

Q -learning in action 15x15 maze world; R (goal)= 1; R( other)=0  =0.9  =0.65

Q -learning in action Initial policy

Q -learning in action After 20 trials

Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state)

Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state) Every 10 episodes, “freeze” policy (turn off learning) Measure avg time to goal from a number of starting states Average over a number of test trials to iron out noise Plot learning curve: #episodes of learning vs. avg performance

Learning performance

Notes on learning perf. After 400 learning episodes, still hasn’t asymptoted Note: that’s ~700,000 steps of experience!!! Q learning is really, really slow!!! Same holds for many RL methods (sadly)

That’s so inefficient! Big problem with Q-learning: Each step of experience only “backs up” information by one step in the world Takes a really, really long time to back up info all the way to the start state How can we do better? Want to propagate info further on each step Ideas?

Eligibility traces Key idea: keep extra information around Which states visited this trial How long ago they were visited Extra bookkeeping info called “eligibility trace” Written e(s,a) On each step: update all state/action pairs, in proportion to their eligibility Efficiency note: really only need to update s,a where e(s,a) !=0

Radioactive breadcrumbs At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 For every s’’,a’’ pair in S × A : Update Q(s’’,a’’) in proportion to e(s’’,a’’) Decay e(s’’,a’’) by factor of  e(s’’,a’’)*= lambda*gamma Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

The SARSA( λ ) code public class SARSAlAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); Action nextAct=pickAction(end); double Qnow=_policy.get(start,act); double Qnext=_policy.get(end,nextAct); double delta=r+_gamma*Qnext-Qnow; setElig(start,act,getElig(start,act)+1.0); for (SAPair p : getEligiblePairs()) { currQ=_policy.get(p.getS(),p.getA()); _policy.set(p.getS(),p.getA(), currQ+getElig(p.getS(),p.getA())*_alpha*delta); setElig(p.getS(),p.getA(), getElig(p.getS(),p.getA())*_gamma*_lambda); } } }

The SARSA( λ ) picture Agent starts here Attaches eligibility

The SARSA( λ ) picture

Design Exercise: Experimental Rig

Design exercise For M4/Rollout, need to be able to: Train agent for many trials/steps per trial Generate learning curves for agent’s learning Run some trials w/ learning turned on Freeze learning Run some trials w/ learning turned off Average steps-to-goal over those trials Save average as one point in curve Design: objects/methods to support this learning framework Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.

Eligibility traces: The “atomic breadcrumbs” approach to RL.

Similar presentations

Presentation on theme: "Eligibility traces: The “atomic breadcrumbs” approach to RL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Eligibility traces: The “atomic breadcrumbs” approach to RL.

Similar presentations

Presentation on theme: "Eligibility traces: The “atomic breadcrumbs” approach to RL."— Presentation transcript:

Similar presentations

About project

Feedback