Q learning cont’d & other stuff A day of miscellany
Definition o’ the day Method: A trick that you use more than once.
Administriva Midterm back today P2M1 back tomorrow (fingers crossed) P2M3 due on Thurs P2 Rollout Nov 8 Q3 Nov 10 Threads & synchronization Know: What a thread is What synchronization is for/why you need it How you do synchronization in Java
Yesterday & today Last time: Midterm exam Before that: Q-learning algorithm Today: Midterm back Midterm discussion/design Factory method design pattern Design principle o’ the day More notes on Q learning
Midterm Graded and back μ =54 σ =15 median=58
Design principle o’ the day Use polymorphism instead of tests “In an OO language, if you’re writing a switch statement, 80% of the time, you’re doing the wrong thing” Similar for if -- many (not all) if statements can be avoided through careful use of polymorphism Instead of testing data, make each data thing know what to do with itself.
Polymorphism vs tests Bad old procedural programming way: Vec2d v=compute(); if (v.type==VT_BLUE) { // do blue thing } else { // do red thing } Good, shiny new OO way: Vec2d v=processorObject.compute(); v.doYourThing();
Polymorphism case study Current research code that I’m writing (Similar to P2, except for continuous worlds) As of this writing: 2,036 lines of code (incl comments) 40 classes 4 packages 0 switch statements 40 occurrences of string if (1.9%)
A closer look Of those 40 occurrences of if: 6 in comments 4 in.equals() down casts: if (o instanceof TypeBlah) in a single method testing intersection of line segments -- very tricky; lots of mathematical special cases Beyond that... Only 6 if statements in 1,906 lines of code==0.3% of the code...
Q-learning, cont’d
Review: Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s i,a 1 )>Q(s i,a 2 ), then a 1 is a “better” action than a 2 at state s i Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action
“Where should I go now?” Q(s 29, FWD ) =2.38 Q(s 29, BACK ) =1.79 Q(s 29, TURNCLOCK ) =3.4 9 Q(s 29, TURNCC ) =0.74 ⇒ “Best thing to do is turn clockwise” s 29 Q(s 29, NOOP ) =2.03
Q learning in math... Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))
Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }
Advice on params Q-learning requires 2 parameters: α : “learning rate” Good range for α : γ : “discount factor” Good range for γ :
What’s going on? s s’ agent begins one step at state s examines Q value for each action agent takes action a and ends up at s’ ; gets reward r now wants to revise Q(s,a) at start state needs Q value for some action at end state, s’ pick best currently known action at s’ == a’ a’ a set Q(s,a)=Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))
Why does it work? Won’t give a full explanation here Basic intuition: each step of experience “backs up” reward from goal state toward beginning goal state r=0 r=+5 “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state QQ Q QQQ