RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.

Slides:

Advertisements

Similar presentations

1 CSC 221: Computer Programming I Fall 2006 interacting objects modular design: dot races constants, static fields cascading if-else, logical operators.

Advertisements

Reinforcement Learning

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Reinforcement learning (Chapter 21)

Reinforcement learning

RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,

Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.

Intro to CIT 594

Intro to Reinforcement Learning Learning how to get what you want...

Q learning cont’d & other stuff A day of miscellany.

Bayesian wrap-up (probably). Administrivia My schedule has been chaos... Thank you for your understanding... Feedback on the student lectures? HW2 not.

CMSC 132: Object-Oriented Programming II

Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:

CMSC 132: Object-Oriented Programming II Nelson Padua-Perez William Pugh Department of Computer Science University of Maryland, College Park.

Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the.

Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.

Reinforcement Learning. Overview  Introduction  Q-learning  Exploration vs. Exploitation  Evaluating RL algorithms  On-Policy Learning: SARSA.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

RL: Algorithms time. Happy Full Moon! Administrivia Reminder: Midterm exam, this Thurs (Oct 20) Spec v 0.98 released today (after class) Check class.

RL 2 It’s 2:00 AM. Do you know where your mouse is?

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.

University of British Columbia CPSC 111, Intro to Computation 2009W2: Jan-Apr 2010 Tamara Munzner 1 More Class Design Lecture 13, Wed Feb

Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.

Eligibility traces: The “atomic breadcrumbs” approach to RL.

Intro to CIT 594

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Financial Information Management Managing Financial Information Critical Thinking Business Process Modeling WINIT Control Structures Homework.

CSE 331 Software Design & Implementation Hal Perkins Autumn 2012 Java Classes, Interfaces, and Types 1.

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

Sadegh Aliakbary Sharif University of Technology Fall 2012.

Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

Department of Computer Science Undergraduate Events More

Documentation javadoc. Documentation not a programmer's first love lives in a separate file somewhere usually a deliverable on the schedule often not.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

CSC 313 – Advanced Programming Topics. Observer Pattern in Java  Java ♥ Observer Pattern & uses everywhere  Find pattern in JButton & ActionListener.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Programmeren 1 6 september 2010 HOORCOLLEGE 2: INTERACTIE EN CONDITIES PROGRAMMEREN 1 6 SEPTEMBER 2009 Software Systems - Programming - Week.

Department of Computer Science Undergraduate Events More

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

int [] scores = new int [10];

CS 367 Introduction to Data Structures Charles N. Fischer Fall s367.html.

CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17.

Duke CPS Programming Heuristics l Identify the aspects of your application that vary and separate them from what stays the same ä Take what varies.

Principles of Imperative Computation Lecture 1 January 15 th, 2012.

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Assignment 1 Solutions. Problem 1 States : Actions: Single MDP controlling both detectives D1 (0) (1) C (2) D2 (3) (4)(5) (6)(7)(8)

Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.

On-Line Markov Decision Processes for Learning Movement in Video Games

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement learning (Chapter 21)

UBC Department of Computer Science Undergraduate Events More

Reinforcement Learning

Principles of Imperative Computation

Reinforcement Learning Dealing with Partial Observability

Department of Computer Science Ben-Gurion University

Reinforcement Nisheeth 18th January 2019.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

RL Rolling on...

Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19 Office hours: Mon, Oct 17 8:30-10:30 AM Midterm: Oct 20 (Thu) Java syntax/semantics (interfaces, iterators, generics, etc.) Tools (JUnit, Javadoc, jar, packages, etc.) Design problems

Administrivia II P1 Rollout graded Everyone should have grade sheets back If not, let us know ASAP P1M3: μ=73, σ=30 P1 total: μ=67, σ=25 ➡ Improvement!

Today in history... Last time: RL Design exercise This time: Design exercise RL

Design Exercise: WorldSimulator & Friends

Design exercise Q1: Design the act() method in WorldSimulator What objects does it need to access? How can it take different terrains/agents into account? Q2: GridWorld2d could be really large Most of the terrain tiles are the same everywhere How can you avoid millions of copies of same tile?

More RL

Recall: The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function

Policies Plan of action is called a policy, π Policy defines what action to take in every state of the system:

The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s Q: What’s the simplest Java implementation of a policy?

Explicit policy public class MyAgent implements RLAgent { private final Map _policy; public MyAgent() { _policy=new HashMap (); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { return _policy.get(here); } // generate a default and add to _policy }

Implicit policy public class MyAgent2 implements RLAgent { private final Map > _policy; public MyAgent2() { _policy=new HashMap<State2d, HashMap >(); } public Action pickAction(State2d here) { if (_policy.containsKey(here)) { Action maxAct=null; double v=Double.MIN_VALUE; for (Action a : _policy.get(here).keySet()) { if (_policy.get(here).get(a)>v) { maxAct=a; } return maxAct; } // handle default action case

Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s,a 1 )>Q(s,a 2 ), then a 1 is a “better” action than a 2 at that state Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action

Q, cont’d Now we have something that we can learn! For a given state, s, and action, a, adjust Q for that pair If a seems better than _policy currently has recorded, increase Q(s,a) If a seems worse than _policy currently has recorded, decrease Q(s,a)

Q learning in math... Let be an experience tuple Let a’=argmax g {Q(s’,g)} “Best” action at next state, s’ Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a)) 0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm

Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }

Refactoring policy Could probably make agent simpler by moving policy out to a different object: public class Policy { public getQvalue(State2d s, Action a); public pickAction(State2d s); public getQMax(State2d s); public setQvalue(State2d s, Action a, double d); }