Download presentation
Presentation is loading. Please wait.
1
RL Rolling on...
2
Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19 Office hours: Mon, Oct 17 8:30-10:30 AM Midterm: Oct 20 (Thu) Java syntax/semantics (interfaces, iterators, generics, etc.) Tools (JUnit, Javadoc, jar, packages, etc.) Design problems
3
Administrivia II P1 Rollout graded Everyone should have grade sheets back If not, let us know ASAP P1M3: μ=73, σ=30 P1 total: μ=67, σ=25 ➡ Improvement!
4
Today in history... Last time: RL Design exercise This time: Design exercise RL
5
Design Exercise: WorldSimulator & Friends
6
Design exercise Q1: Design the act() method in WorldSimulator What objects does it need to access? How can it take different terrains/agents into account? Q2: GridWorld2d could be really large Most of the terrain tiles are the same everywhere How can you avoid millions of copies of same tile?
7
More RL
8
Recall: The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function
9
Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system:
10
The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s Q: What’s the simplest Java implementation of a policy?
11
Explicit policy public class MyAgent implements RLAgent { private final Map _policy; public MyAgent() { _policy=new HashMap (); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { return _policy.get(here); } // generate a default and add to _policy }
12
Implicit policy public class MyAgent2 implements RLAgent { private final Map > _policy; public MyAgent2() { _policy=new HashMap<State2d, HashMap >(); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { Action maxAct=null; double v=Double.MIN_VALUE; for (Action a : _policy.get(here).keySet()) { if (_policy.get(here).get(a)>v) { maxAct=a; } return maxAct; } // handle default action case
13
Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s,a 1 )>Q(s,a 2 ), then a 1 is a “better” action than a 2 at that state Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action
14
Q, cont’d Now we have something that we can learn! For a given state, s, and action, a, adjust Q for that pair If a seems better than _policy currently has recorded, increase Q(s,a) If a seems worse than _policy currently has recorded, decrease Q(s,a)
15
Q learning in math... Let be an experience tuple Let a’=argmax g {Q(s’,g)} “Best” action at next state, s’ Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a)) 0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm
16
Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }
17
Refactoring policy Could probably make agent simpler by moving policy out to a different object: public class Policy { public getQvalue(State2d s, Action a); public pickAction(State2d s); public getQMax(State2d s); public setQvalue(State2d s, Action a, double d); }
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.