Presentation is loading. Please wait.

Presentation is loading. Please wait.

RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.

Similar presentations


Presentation on theme: "RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19."— Presentation transcript:

1 RL Rolling on...

2 Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19 Office hours: Mon, Oct 17 8:30-10:30 AM Midterm: Oct 20 (Thu) Java syntax/semantics (interfaces, iterators, generics, etc.) Tools (JUnit, Javadoc, jar, packages, etc.) Design problems

3 Administrivia II P1 Rollout graded Everyone should have grade sheets back If not, let us know ASAP P1M3: μ=73, σ=30 P1 total: μ=67, σ=25 ➡ Improvement!

4 Today in history... Last time: RL Design exercise This time: Design exercise RL

5 Design Exercise: WorldSimulator & Friends

6 Design exercise Q1: Design the act() method in WorldSimulator What objects does it need to access? How can it take different terrains/agents into account? Q2: GridWorld2d could be really large Most of the terrain tiles are the same everywhere How can you avoid millions of copies of same tile?

7 More RL

8 Recall: The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function

9 Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system:

10 The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s Q: What’s the simplest Java implementation of a policy?

11 Explicit policy public class MyAgent implements RLAgent { private final Map _policy; public MyAgent() { _policy=new HashMap (); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { return _policy.get(here); } // generate a default and add to _policy }

12 Implicit policy public class MyAgent2 implements RLAgent { private final Map > _policy; public MyAgent2() { _policy=new HashMap<State2d, HashMap >(); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { Action maxAct=null; double v=Double.MIN_VALUE; for (Action a : _policy.get(here).keySet()) { if (_policy.get(here).get(a)>v) { maxAct=a; } return maxAct; } // handle default action case

13 Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s,a 1 )>Q(s,a 2 ), then a 1 is a “better” action than a 2 at that state Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action

14 Q, cont’d Now we have something that we can learn! For a given state, s, and action, a, adjust Q for that pair If a seems better than _policy currently has recorded, increase Q(s,a) If a seems worse than _policy currently has recorded, decrease Q(s,a)

15 Q learning in math... Let be an experience tuple Let a’=argmax g {Q(s’,g)} “Best” action at next state, s’ Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a)) 0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm

16 Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }

17 Refactoring policy Could probably make agent simpler by moving policy out to a different object: public class Policy { public getQvalue(State2d s, Action a); public pickAction(State2d s); public getQMax(State2d s); public setQvalue(State2d s, Action a, double d); }


Download ppt "RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19."

Similar presentations


Ads by Google