RL Rolling on...
Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19 Office hours: Mon, Oct 17 8:30-10:30 AM Midterm: Oct 20 (Thu) Java syntax/semantics (interfaces, iterators, generics, etc.) Tools (JUnit, Javadoc, jar, packages, etc.) Design problems
Administrivia II P1 Rollout graded Everyone should have grade sheets back If not, let us know ASAP P1M3: μ=73, σ=30 P1 total: μ=67, σ=25 ➡ Improvement!
Today in history... Last time: RL Design exercise This time: Design exercise RL
Design Exercise: WorldSimulator & Friends
Design exercise Q1: Design the act() method in WorldSimulator What objects does it need to access? How can it take different terrains/agents into account? Q2: GridWorld2d could be really large Most of the terrain tiles are the same everywhere How can you avoid millions of copies of same tile?
More RL
Recall: The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function
Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system:
The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s Q: What’s the simplest Java implementation of a policy?
Explicit policy public class MyAgent implements RLAgent { private final Map _policy; public MyAgent() { _policy=new HashMap (); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { return _policy.get(here); } // generate a default and add to _policy }
Implicit policy public class MyAgent2 implements RLAgent { private final Map > _policy; public MyAgent2() { _policy=new HashMap<State2d, HashMap >(); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { Action maxAct=null; double v=Double.MIN_VALUE; for (Action a : _policy.get(here).keySet()) { if (_policy.get(here).get(a)>v) { maxAct=a; } return maxAct; } // handle default action case
Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s,a 1 )>Q(s,a 2 ), then a 1 is a “better” action than a 2 at that state Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action
Q, cont’d Now we have something that we can learn! For a given state, s, and action, a, adjust Q for that pair If a seems better than _policy currently has recorded, increase Q(s,a) If a seems worse than _policy currently has recorded, decrease Q(s,a)
Q learning in math... Let be an experience tuple Let a’=argmax g {Q(s’,g)} “Best” action at next state, s’ Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a)) 0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm
Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }
Refactoring policy Could probably make agent simpler by moving policy out to a different object: public class Policy { public getQvalue(State2d s, Action a); public pickAction(State2d s); public getQMax(State2d s); public setQvalue(State2d s, Action a, double d); }