Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the.

Odds & Ends

Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the coolest things in CS Come demo your P1 and P2 code! Contact me or Lynne Jacobson

The bird of time... Last time: Eligibility traces The SARSA( λ ) algorithm Design exercise This time: Tip o’ the day Notes on exploration Design exercise, cont’d.

Tip o’ the day Micro-experiments Often, often, often when hacking: “How the heck does that function work?” “The docs don’t say what happens when you hand null to the constructor...” “Uhhh... Will this work if I do it this way?” “WTF does that mean?” Could spend a bunch of time in the docs Or... Could just go and try it

Tip o’ the day Answer: micro-experiments Write a very small (<50 line) test program to make sure you understand what the thing does Think: homework assignment from CS152 Quick to write Answers question better than docs can Builds your intuition about what the machine is doing Using the debugger to watch is also good

Action selection in RL

Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); Action nextAct=_policy.argmaxAct(end); double Qnow=_policy.get(start,act); double Qnext=_policy.get(end,nextAct); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.set(start,act,Qrevised); }

The SARSA( λ ) code public class SARSAlAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); Action nextAct=pickAction(end); double Qnow=_policy.get(start,act); double Qnext=_policy.get(end,nextAct); double delta=r+_gamma*Qnext-Qnow; setElig(start,act,getElig(start,act)+1.0); for (SAPair p : getEligiblePairs()) { currQ=_policy.get(p.getS(),p.getA()); _policy.set(p.getS(),p.getA(), currQ+getElig(p.getS(),p.getA())*_alpha*delta); setElig(p.getS(),p.getA(), getElig(p.getS(),p.getA())*_gamma*_lambda); }

Q & SARSA( λ ): Key diffs Use of eligibility traces Q updates single step of history SARSA( λ ) keeps record of visited state/action pairs: e(s,a) Updates Q(s,a) value in proportion to e(s,a) Decays e(s,a) by λ each step

Q & SARSA( λ ): Key diffs How “next state” action is picked Q: nextAct=_policy.argmaxAct(end) Picks “best” next state SARSA: nextAct=RLAgent.pickAction(end) Picks next state that agent would pick Huh? What’s the difference?

Exploration vs. exploitation Sometimes, agent wants to do something other than “best currently known action” Why? If agent never tries anything new, it may never discover that there’s a better answer out there... Called the “exploration vs. exploitation” tradeoff Is it better to “explore” to find new stuff, or to “exploit” what you already know?

ε -Greedy exploration Answer: “Most of the time” do the best known thing act=argmax a ( Q(s,a) ) “Rarely” try something random act=pickAtRandom(allActionSet) ε -greedy exploration policies: “rarely”==prob ε “most of the time”==prob 1-ε

ε -Greedy in code public class eGreedyAgent implements RLAgent { // implements the e-greedy exploration policy public Action pickAction(State2d s) { final double rVal=_rand.nextDouble(); if (rVal<_epsilon) { return randPick(_ASet); } return _policy.argmaxAct(s); } private final Set _ASet; private final double _epsilon; }

Design Exercise: Experimental Rig

Design exercise For M4/Rollout, need to be able to: Train agent for many trials/steps per trial Generate learning curves for agent’s learning Run some trials w/ learning turned on Freeze learning Run some trials w/ learning turned off Average steps-to-goal over those trials Save average as one point in curve Design: objects/methods to support this learning framework Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.

Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the.

Similar presentations

Presentation on theme: "Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the.

Similar presentations

Presentation on theme: "Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the."— Presentation transcript:

Similar presentations

About project

Feedback