Download presentation
Presentation is loading. Please wait.
1
Q learning cont’d & other stuff A day of miscellany
2
Definition o’ the day Method: A trick that you use more than once.
3
Administriva Midterm back today P2M1 back tomorrow (fingers crossed) P2M3 due on Thurs P2 Rollout Nov 8 Q3 Nov 10 Threads & synchronization Know: What a thread is What synchronization is for/why you need it How you do synchronization in Java
4
Yesterday & today Last time: Midterm exam Before that: Q-learning algorithm Today: Midterm back Midterm discussion/design Factory method design pattern Design principle o’ the day More notes on Q learning
5
Midterm Graded and back μ =54 σ =15 median=58
6
Design principle o’ the day Use polymorphism instead of tests “In an OO language, if you’re writing a switch statement, 80% of the time, you’re doing the wrong thing” Similar for if -- many (not all) if statements can be avoided through careful use of polymorphism Instead of testing data, make each data thing know what to do with itself.
7
Polymorphism vs tests Bad old procedural programming way: Vec2d v=compute(); if (v.type==VT_BLUE) { // do blue thing } else { // do red thing } Good, shiny new OO way: Vec2d v=processorObject.compute(); v.doYourThing();
8
Polymorphism case study Current research code that I’m writing (Similar to P2, except for continuous worlds) As of this writing: 2,036 lines of code (incl comments) 40 classes 4 packages 0 switch statements 40 occurrences of string if (1.9%)
9
A closer look Of those 40 occurrences of if: 6 in comments 4 in.equals() down casts: if (o instanceof TypeBlah)... 24 in a single method testing intersection of line segments -- very tricky; lots of mathematical special cases Beyond that... Only 6 if statements in 1,906 lines of code==0.3% of the code...
10
Q-learning, cont’d
11
Review: Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s i,a 1 )>Q(s i,a 2 ), then a 1 is a “better” action than a 2 at state s i Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action
12
“Where should I go now?” Q(s 29, FWD ) =2.38 Q(s 29, BACK ) =1.79 Q(s 29, TURNCLOCK ) =3.4 9 Q(s 29, TURNCC ) =0.74 ⇒ “Best thing to do is turn clockwise” s 29 Q(s 29, NOOP ) =2.03
13
Q learning in math... Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))
14
Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }
15
Advice on params Q-learning requires 2 parameters: α : “learning rate” Good range for α : 0.3... 0.7 γ : “discount factor” Good range for γ : 0.9... 0.999
16
What’s going on? s s’ agent begins one step at state s examines Q value for each action agent takes action a and ends up at s’ ; gets reward r now wants to revise Q(s,a) at start state needs Q value for some action at end state, s’ pick best currently known action at s’ == a’ a’ a set Q(s,a)=Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))
17
Why does it work? Won’t give a full explanation here Basic intuition: each step of experience “backs up” reward from goal state toward beginning goal state r=0 r=+5 “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state QQ Q QQQ
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.