Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eligibility traces: The “atomic breadcrumbs” approach to RL.

Similar presentations


Presentation on theme: "Eligibility traces: The “atomic breadcrumbs” approach to RL."— Presentation transcript:

1 Eligibility traces: The “atomic breadcrumbs” approach to RL

2 Administrivia P2M1 grades out this morning Let me know if you didn’t get a gradesheet Reminder: Q3 on Nov 10 Threads & synchronization P2M3 due today Happy (very nearly) Halloween!

3 Timeline Last time: Miscellany Design problem from midterm (Q5) Factory methods design pattern Polymorphism vs. explicit tests: case study More on Q learning Today: Eligibility traces & SARSA( λ ) Design exercise

4 Q learning in math... Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

5 Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }

6 Why does it work? Won’t give a full explanation here Basic intuition: each step of experience “backs up” reward from goal state toward beginning goal state r=0 r=+5 “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state QQ Q QQQ

7 Q -learning in action 15x15 maze world; R (goal)= 1; R( other)=0  =0.9  =0.65

8 Q -learning in action Initial policy

9 Q -learning in action After 20 trials

10 Q -learning in action After 30 trials

11 Q -learning in action After 100 trials

12 Q -learning in action After 150 trials

13 Q -learning in action After 200 trials

14 Q -learning in action After 250 trials

15 Q -learning in action After 300 trials

16 Q -learning in action After 350 trials

17 Q -learning in action After 400 trials

18 Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state)

19 Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state) Every 10 episodes, “freeze” policy (turn off learning) Measure avg time to goal from a number of starting states Average over a number of test trials to iron out noise Plot learning curve: #episodes of learning vs. avg performance

20 Learning performance

21 Notes on learning perf. After 400 learning episodes, still hasn’t asymptoted Note: that’s ~700,000 steps of experience!!! Q learning is really, really slow!!! Same holds for many RL methods (sadly)

22 That’s so inefficient! Big problem with Q-learning: Each step of experience only “backs up” information by one step in the world Takes a really, really long time to back up info all the way to the start state How can we do better? Want to propagate info further on each step Ideas?

23 Eligibility traces Key idea: keep extra information around Which states visited this trial How long ago they were visited Extra bookkeeping info called “eligibility trace” Written e(s,a) On each step: update all state/action pairs, in proportion to their eligibility Efficiency note: really only need to update s,a where e(s,a) !=0

24 Radioactive breadcrumbs At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 For every s’’,a’’ pair in S × A : Update Q(s’’,a’’) in proportion to e(s’’,a’’) Decay e(s’’,a’’) by factor of  e(s’’,a’’)*= lambda*gamma Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

25 The SARSA( λ ) code public class SARSAlAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); Action nextAct=pickAction(end); double Qnow=_policy.get(start,act); double Qnext=_policy.get(end,nextAct); double delta=r+_gamma*Qnext-Qnow; setElig(start,act,getElig(start,act)+1.0); for (SAPair p : getEligiblePairs()) { currQ=_policy.get(p.getS(),p.getA()); _policy.set(p.getS(),p.getA(), currQ+getElig(p.getS(),p.getA())*_alpha*delta); setElig(p.getS(),p.getA(), getElig(p.getS(),p.getA())*_gamma*_lambda); } } }

26 The SARSA( λ ) picture Agent starts here Attaches eligibility

27 The SARSA( λ ) picture

28

29

30

31

32 Design Exercise: Experimental Rig

33 Design exercise For M4/Rollout, need to be able to: Train agent for many trials/steps per trial Generate learning curves for agent’s learning Run some trials w/ learning turned on Freeze learning Run some trials w/ learning turned off Average steps-to-goal over those trials Save average as one point in curve Design: objects/methods to support this learning framework Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.


Download ppt "Eligibility traces: The “atomic breadcrumbs” approach to RL."

Similar presentations


Ads by Google