Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the.

Slides:



Advertisements
Similar presentations
Tips for Teens!. If youre worried about not getting to class on time, or youre worried about not handing in an assignment on time… Even if youre worried.
Advertisements

Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
LING 388: Language and Computers Sandiway Fong Lecture 5: 9/8.
Inheritance Writing and using Classes effectively.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Dynamic Programming.
How To (correctly) Complete an SR 2 QR By: Mitch Kerr.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
1 CS2200 Software Development Lecture: Testing and Design A. O’Riordan, 2008 K. Brown,
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Q learning cont’d & other stuff A day of miscellany.
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
Dynamically Allocated Arrays May 2, Quiz 5 Today.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Policies and exploration and eligibility, oh my!.
Machine Learning Lecture 11: Reinforcement Learning
RL: Algorithms time. Happy Full Moon! Administrivia Reminder: Midterm exam, this Thurs (Oct 20) Spec v 0.98 released today (after class) Check class.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
Policies and exploration and eligibility, oh my!.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.
Any questions on the Section 6.2 homework?. Please CLOSE YOUR LAPTOPS, and turn off and put away your cell phones, and get out your note- taking materials.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Algebra 1 R. Jenkins, M.S., M.A..
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
C++ / G4MICE Course Session 3 Introduction to Classes Pointers and References Makefiles Standard Template Library.
Study Skills Goal Setting Planning Organization Test taking tips Classroom performance Reading Note taking Time management.
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
BIT 142:Programming & Data Structures in C#. How To Use NUnit-based Starter Projects.
FIRST GADGETEER PROJECT. Where are you? Making a VS project Parts of a C# program Basics of C# syntax Debugging in VS Questions? 2.
Testing and Debugging Version 1.0. All kinds of things can go wrong when you are developing a program. The compiler discovers syntax errors in your code.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD( ) and eligibility traces.
CS Class 05 Topics  Selection: switch statement Announcements  Read pages 74-83, ,
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text:
3.2 VB.NET Events An Event Procedure Properties and Event Procedures of the Form Tab Order of Controls Exercises.
Exceptions, cont’d. Factory Design Pattern COMP 401 Fall 2014 Lecture 12 9/30/2014.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Programming with Loops. When to Use a Loop  Whenever you have a repeated set of actions, you should consider using a loop.  For example, if you have.
Fall 2015CISC/CMPE320 - Prof. McLeod1 CISC/CMPE320 Assignment 4 is due Nov. 20 (next Friday). After today you should know everything you need for assignment.
Random Numbers Random numbers are extremely useful: especially for games, and also for calculating experimental probabilities. Formula for generating random.
Thinking about programming Intro to Computer Science CS1510 Dr. Sarah Diesburg.
PDH&PE WORKSHOP. USE YOUR SENIOR WIKI! wikispaces.com.
Psychological Experimentation The Experimental Method: Discovering the Causes of Behavior Experiment: A controlled situation in which the researcher.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Wrapper Classes ints, doubles, and chars are known as primitive types, or built-in types. There are no methods associated with these types of variables.
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Debugging Taken from notes by Dr. Neil Moore
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Debugging Taken from notes by Dr. Neil Moore
COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.
2009 Test Key.
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Odds & Ends

Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the coolest things in CS Come demo your P1 and P2 code! Contact me or Lynne Jacobson

The bird of time... Last time: Eligibility traces The SARSA( λ ) algorithm Design exercise This time: Tip o’ the day Notes on exploration Design exercise, cont’d.

Tip o’ the day Micro-experiments Often, often, often when hacking: “How the heck does that function work?” “The docs don’t say what happens when you hand null to the constructor...” “Uhhh... Will this work if I do it this way?” “WTF does that mean?” Could spend a bunch of time in the docs Or... Could just go and try it

Tip o’ the day Answer: micro-experiments Write a very small (<50 line) test program to make sure you understand what the thing does Think: homework assignment from CS152 Quick to write Answers question better than docs can Builds your intuition about what the machine is doing Using the debugger to watch is also good

Action selection in RL

Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); Action nextAct=_policy.argmaxAct(end); double Qnow=_policy.get(start,act); double Qnext=_policy.get(end,nextAct); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.set(start,act,Qrevised); }

The SARSA( λ ) code public class SARSAlAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); Action nextAct=pickAction(end); double Qnow=_policy.get(start,act); double Qnext=_policy.get(end,nextAct); double delta=r+_gamma*Qnext-Qnow; setElig(start,act,getElig(start,act)+1.0); for (SAPair p : getEligiblePairs()) { currQ=_policy.get(p.getS(),p.getA()); _policy.set(p.getS(),p.getA(), currQ+getElig(p.getS(),p.getA())*_alpha*delta); setElig(p.getS(),p.getA(), getElig(p.getS(),p.getA())*_gamma*_lambda); }

Q & SARSA( λ ): Key diffs Use of eligibility traces Q updates single step of history SARSA( λ ) keeps record of visited state/action pairs: e(s,a) Updates Q(s,a) value in proportion to e(s,a) Decays e(s,a) by λ each step

Q & SARSA( λ ): Key diffs How “next state” action is picked Q: nextAct=_policy.argmaxAct(end) Picks “best” next state SARSA: nextAct=RLAgent.pickAction(end) Picks next state that agent would pick Huh? What’s the difference?

Exploration vs. exploitation Sometimes, agent wants to do something other than “best currently known action” Why? If agent never tries anything new, it may never discover that there’s a better answer out there... Called the “exploration vs. exploitation” tradeoff Is it better to “explore” to find new stuff, or to “exploit” what you already know?

ε -Greedy exploration Answer: “Most of the time” do the best known thing act=argmax a ( Q(s,a) ) “Rarely” try something random act=pickAtRandom(allActionSet) ε -greedy exploration policies: “rarely”==prob ε “most of the time”==prob 1-ε

ε -Greedy in code public class eGreedyAgent implements RLAgent { // implements the e-greedy exploration policy public Action pickAction(State2d s) { final double rVal=_rand.nextDouble(); if (rVal<_epsilon) { return randPick(_ASet); } return _policy.argmaxAct(s); } private final Set _ASet; private final double _epsilon; }

Design Exercise: Experimental Rig

Design exercise For M4/Rollout, need to be able to: Train agent for many trials/steps per trial Generate learning curves for agent’s learning Run some trials w/ learning turned on Freeze learning Run some trials w/ learning turned off Average steps-to-goal over those trials Save average as one point in curve Design: objects/methods to support this learning framework Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.