Presentation is loading. Please wait.

Presentation is loading. Please wait.

Q learning cont’d & other stuff A day of miscellany.

Similar presentations


Presentation on theme: "Q learning cont’d & other stuff A day of miscellany."— Presentation transcript:

1 Q learning cont’d & other stuff A day of miscellany

2 Definition o’ the day Method: A trick that you use more than once.

3 Administriva Midterm back today P2M1 back tomorrow (fingers crossed) P2M3 due on Thurs P2 Rollout Nov 8 Q3 Nov 10 Threads & synchronization Know: What a thread is What synchronization is for/why you need it How you do synchronization in Java

4 Yesterday & today Last time: Midterm exam Before that: Q-learning algorithm Today: Midterm back Midterm discussion/design Factory method design pattern Design principle o’ the day More notes on Q learning

5 Midterm Graded and back μ =54 σ =15 median=58

6 Design principle o’ the day Use polymorphism instead of tests “In an OO language, if you’re writing a switch statement, 80% of the time, you’re doing the wrong thing” Similar for if -- many (not all) if statements can be avoided through careful use of polymorphism Instead of testing data, make each data thing know what to do with itself.

7 Polymorphism vs tests Bad old procedural programming way: Vec2d v=compute(); if (v.type==VT_BLUE) { // do blue thing } else { // do red thing } Good, shiny new OO way: Vec2d v=processorObject.compute(); v.doYourThing();

8 Polymorphism case study Current research code that I’m writing (Similar to P2, except for continuous worlds) As of this writing: 2,036 lines of code (incl comments) 40 classes 4 packages 0 switch statements 40 occurrences of string if (1.9%)

9 A closer look Of those 40 occurrences of if: 6 in comments 4 in.equals() down casts: if (o instanceof TypeBlah)... 24 in a single method testing intersection of line segments -- very tricky; lots of mathematical special cases Beyond that... Only 6 if statements in 1,906 lines of code==0.3% of the code...

10 Q-learning, cont’d

11 Review: Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s i,a 1 )>Q(s i,a 2 ), then a 1 is a “better” action than a 2 at state s i Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action

12 “Where should I go now?” Q(s 29, FWD ) =2.38 Q(s 29, BACK ) =1.79 Q(s 29, TURNCLOCK ) =3.4 9 Q(s 29, TURNCC ) =0.74 ⇒ “Best thing to do is turn clockwise” s 29 Q(s 29, NOOP ) =2.03

13 Q learning in math... Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

14 Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }

15 Advice on params Q-learning requires 2 parameters: α : “learning rate” Good range for α : 0.3... 0.7 γ : “discount factor” Good range for γ : 0.9... 0.999

16 What’s going on? s s’ agent begins one step at state s examines Q value for each action agent takes action a and ends up at s’ ; gets reward r now wants to revise Q(s,a) at start state needs Q value for some action at end state, s’ pick best currently known action at s’ == a’ a’ a set Q(s,a)=Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

17 Why does it work? Won’t give a full explanation here Basic intuition: each step of experience “backs up” reward from goal state toward beginning goal state r=0 r=+5 “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state QQ Q QQQ


Download ppt "Q learning cont’d & other stuff A day of miscellany."

Similar presentations


Ads by Google