RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.

RL Rolling on...

Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19 Office hours: Mon, Oct 17 8:30-10:30 AM Midterm: Oct 20 (Thu) Java syntax/semantics (interfaces, iterators, generics, etc.) Tools (JUnit, Javadoc, jar, packages, etc.) Design problems

Administrivia II P1 Rollout graded Everyone should have grade sheets back If not, let us know ASAP P1M3: μ=73, σ=30 P1 total: μ=67, σ=25 ➡ Improvement!

Today in history... Last time: RL Design exercise This time: Design exercise RL

Design Exercise: WorldSimulator & Friends

Design exercise Q1: Design the act() method in WorldSimulator What objects does it need to access? How can it take different terrains/agents into account? Q2: GridWorld2d could be really large Most of the terrain tiles are the same everywhere How can you avoid millions of copies of same tile?

More RL

Recall: The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function

Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system:

The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s Q: What’s the simplest Java implementation of a policy?

Explicit policy public class MyAgent implements RLAgent { private final Map _policy; public MyAgent() { _policy=new HashMap (); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { return _policy.get(here); } // generate a default and add to _policy }

Implicit policy public class MyAgent2 implements RLAgent { private final Map > _policy; public MyAgent2() { _policy=new HashMap<State2d, HashMap >(); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { Action maxAct=null; double v=Double.MIN_VALUE; for (Action a : _policy.get(here).keySet()) { if (_policy.get(here).get(a)>v) { maxAct=a; } return maxAct; } // handle default action case

Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s,a 1 )>Q(s,a 2 ), then a 1 is a “better” action than a 2 at that state Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action

Q, cont’d Now we have something that we can learn! For a given state, s, and action, a, adjust Q for that pair If a seems better than _policy currently has recorded, increase Q(s,a) If a seems worse than _policy currently has recorded, decrease Q(s,a)

Q learning in math... Let be an experience tuple Let a’=argmax g {Q(s’,g)} “Best” action at next state, s’ Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a)) 0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm

Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }

Refactoring policy Could probably make agent simpler by moving policy out to a different object: public class Policy { public getQvalue(State2d s, Action a); public pickAction(State2d s); public getQMax(State2d s); public setQvalue(State2d s, Action a, double d); }

RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.

Similar presentations

Presentation on theme: "RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.

Similar presentations

Presentation on theme: "RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19."— Presentation transcript:

Similar presentations

About project

Feedback