Download presentation
Presentation is loading. Please wait.
1
RL: Algorithms time
2
Happy Full Moon!
3
Administrivia Reminder: Midterm exam, this Thurs (Oct 20) Spec v 0.98 released today (after class) Check class web page
4
Temporal perspective Last Thursday: Fall break. Hope you had a good one! Last Tuesday: The true meaning of static Singleton design pattern Today: ✓ Astronomy ✓ Administrivia Static, singletons, & enums (briefly) More RL (algorithms!)
5
Enums Nothing magical about Java enums “Under the hood” they’re just classes + singleton design pattern
6
More RL
7
Recall: The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function
8
Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system:
9
Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system: π(s 42 )=“ FWD ” == “When I find myself in state s 42, execute the action FORWARD”
10
Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system: π(s 42 )=“ FWD ” == “When I find myself in state s 42, execute the action FORWARD” This is basis for learning: tweak π (indirectly) to make agent better over time
11
The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s Q: What’s the simplest Java implementation of a policy?
12
Explicit policy public class MyAgent implements RLAgent { private final Map _policy; public MyAgent() { _policy=new HashMap (); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { return _policy.get(here); } // generate a default and add to _policy }
13
Implicit policy public class MyAgent2 implements RLAgent { private final Map > _policy; public MyAgent2() { _policy=new HashMap<State2d, HashMap >(); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { Action maxAct=null; double v=Double.MIN_VALUE; for (Action a : _policy.get(here).keySet()) { if (_policy.get(here).get(a)>v) { maxAct=a; } return maxAct; } // handle default action case
14
Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s i,a 1 )>Q(s i,a 2 ), then a 1 is a “better” action than a 2 at state s i Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action
15
“Where should I go now?” Q(s 29, FWD ) =2.38 Q(s 29, BACK ) =1.79 Q(s 29, TURNCLOCK ) =3.4 9 Q(s 29, TURNCC ) =0.74 ⇒ “Best thing to do is turn clockwise” s 29 Q(s 29, NOOP ) =2.03
16
Q, cont’d Now we have something that we can learn! For a given state, s, and action, a, adjust Q for that pair If a seems better than _policy currently has recorded, increase Q(s,a) If a seems worse than _policy currently has recorded, decrease Q(s,a)
17
Q learning in math... Let be an experience tuple Let a’=argmax g {Q(s’,g)} “Best” action at next state, s’ Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a)) 0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm
18
Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }
19
Refactoring policy Could probably make agent simpler by moving policy out to a different object: public class Policy { public getQvalue(State2d s, Action a); public pickAction(State2d s); public getQMax(State2d s); public setQvalue(State2d s, Action a, double d); }
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.