Q learning cont’d & other stuff A day of miscellany.

Slides:



Advertisements
Similar presentations
Programming for Beginners
Advertisements

1 CSC 221: Computer Programming I Fall 2006 interacting objects modular design: dot races constants, static fields cascading if-else, logical operators.
Cs205: engineering software university of virginia fall 2006 Specifying Procedures David Evans
Emerald - et OO-språk for distribuerte applikasjoner Review – Concurrency - Mobility Eric Jul OMS Professor II.
Procedural programming in Java
George Blank University Lecturer. CS 602 Java and the Web Object Oriented Software Development Using Java Chapter 4.
Design Issues. Where to put class definitions  What goes in a source file? At most 1 public class At most 1 public class Other “helper” classes as needed.
Lecture 3: Topics If-then-else Operator precedence While loops Static methods Recursion.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
CS 240: Data Structures Tuesday, August 7 th 2-3 Trees, Proving Algorithms.
Introduction to Computers and Programming Lecture 4: Mathematical Operators New York University.
Introduction to Computers and Programming Lecture 5 New York University.
Object-Oriented Analysis and Design Lecture 10 Implementation (from Schach, “O-O and Classical Software Engineering”)
16-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
Comp 14 (3) By Stephan Sherman Office hours, W, 3:00-3:50pm.
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the.
Introduction to Computers and Programming Lecture 5 Boolean type; if statement Professor: Evan Korth New York University.
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
RL: Algorithms time. Happy Full Moon! Administrivia Reminder: Midterm exam, this Thurs (Oct 20) Spec v 0.98 released today (after class) Check class.
COMP 14: Intro. to Intro. to Programming May 23, 2000 Nick Vallidis.
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.
22C:19 Discrete Math Induction and Recursion Fall 2011 Sukumar Ghosh.
C++ Functions. 2 Agenda What is a function? What is a function? Types of C++ functions: Types of C++ functions: Standard functions Standard functions.
Local Definitions, Scope, Functional Abstraction, and Polymorphism.
Yeshivah of Flatbush After-School Math Enrichment 2009 Jerry B. Altzman, Ph.D.
(…A FEW OF THEM) C++ DESIGN PATTERNS. WHAT ARE THEY? Commonly occurring constructs Could be part of good software engineering Not universally agreed Good.
1 Chapter 18-1 Recursion Dale/Weems. 2 Chapter 18 Topics l Meaning of Recursion l Base Case and General Case in Recursive Function Definitions l Writing.
Peter Andreae Computer Science Victoria University of Wellington Copyright: Peter Andreae, Victoria University of Wellington Summary and Exam COMP 102.
Introduction to Programming Writing Java Beginning Java Programs.
Polymorphism, Inheritance Pt. 1 COMP 401, Fall 2014 Lecture 7 9/9/2014.
CS 106 Introduction to Computer Science I 03 / 19 / 2007 Instructor: Michael Eckmann.
Fall Week 3 CSCI-141 Scott C. Johnson.  Say we want to draw the following figure ◦ How would we go about doing this?
Sayed Ahmed Computer Engineering, BUET, Bangladesh Masters from the University of Manitoba, Canada
Week 2 - Wednesday.  What did we talk about last time?  Data representation  Binary numbers  Types  int  boolean  double  char  String.
Netprog: Java Intro1 Crash Course in Java. Netprog: Java Intro2 Why Java? Network Programming in Java is very different than in C/C++ –much more language.
Summary of what we learned yesterday Basics of C++ Format of a program Syntax of literals, keywords, symbols, variables Simple data types and arithmetic.
Agenda Review C++ Library Functions Review User Input Making your own functions Exam #1 Next Week Reading: Chapter 3.
Best Practices. Contents Bad Practices Good Practices.
Xmania!.
Week 2, Day 2: The Factory Method Pattern Other good design principles Cohesion vs. Coupling Implementing the Strategy Pattern Changing strategies (behaviors)
Instructor: Alexander Stoytchev CprE 185: Intro to Problem Solving (using C)
Introduction to Programming
Introduction to Programming Writing Java Beginning Java Programs.
Methods We write methods in our programs for many reasons:
C Functions Three major differences between C and Java functions: –Functions are stand-alone entities, not part of objects they can be defined in a file.
CSC 107 – Programming For Science. Today’s Goal  Discuss writing functions that return values  return statement’s meaning and how it works  When and.
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
This recitation 1 An interesting point about A3: Using previous methods to avoid work in programming and debugging. How much time did you spend writing.
1 CS161 Introduction to Computer Science Topic #8.
Fall 2008Programming Development Techniques 1 Topic 17 Assignment, Local State, and the Environment Model of Evaluation Section 3.1 & 3.2.
1/30/2016IT 2751 Operators Arithmetic operators: + - / * % Relational operators: == > = != Logical operators: || && !
Quick Review of OOP Constructs Classes:  Data types for structured data and behavior  fields and methods Objects:  Variables whose data type is a class.
COMP319 REVISION © University of LiverpoolCOMP 319slide 1.
Copyright © Cengage Learning. All rights reserved. 2 Equations and Inequalities.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
FUNCTIONS (C) KHAERONI, M.SI. OBJECTIVE After this topic, students will be able to understand basic concept of user defined function in C++ to declare.
Problem of the Day  Why are manhole covers round?
Information and Computer Sciences University of Hawaii, Manoa
CSE 143 Introduction to C++ [Appendix B] 4/11/98.
Computer Programming Methodology Introduction to Java
CSE332: Data Abstractions About the Final
Java Programming Language
Manipulating Pictures, Arrays, and Loops part 3
Manipulating Pictures, Arrays, and Loops
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Q learning cont’d & other stuff A day of miscellany

Definition o’ the day Method: A trick that you use more than once.

Administriva Midterm back today P2M1 back tomorrow (fingers crossed) P2M3 due on Thurs P2 Rollout Nov 8 Q3 Nov 10 Threads & synchronization Know: What a thread is What synchronization is for/why you need it How you do synchronization in Java

Yesterday & today Last time: Midterm exam Before that: Q-learning algorithm Today: Midterm back Midterm discussion/design Factory method design pattern Design principle o’ the day More notes on Q learning

Midterm Graded and back μ =54 σ =15 median=58

Design principle o’ the day Use polymorphism instead of tests “In an OO language, if you’re writing a switch statement, 80% of the time, you’re doing the wrong thing” Similar for if -- many (not all) if statements can be avoided through careful use of polymorphism Instead of testing data, make each data thing know what to do with itself.

Polymorphism vs tests Bad old procedural programming way: Vec2d v=compute(); if (v.type==VT_BLUE) { // do blue thing } else { // do red thing } Good, shiny new OO way: Vec2d v=processorObject.compute(); v.doYourThing();

Polymorphism case study Current research code that I’m writing (Similar to P2, except for continuous worlds) As of this writing: 2,036 lines of code (incl comments) 40 classes 4 packages 0 switch statements 40 occurrences of string if (1.9%)

A closer look Of those 40 occurrences of if: 6 in comments 4 in.equals() down casts: if (o instanceof TypeBlah) in a single method testing intersection of line segments -- very tricky; lots of mathematical special cases Beyond that... Only 6 if statements in 1,906 lines of code==0.3% of the code...

Q-learning, cont’d

Review: Q functions Implicit policy uses the idea of a “Q function” Q : S × A → Reals For each action at each state, says how good/bad that action is If Q(s i,a 1 )>Q(s i,a 2 ), then a 1 is a “better” action than a 2 at state s i Represented in code with Map : Mapping from an Action to the value ( Q ) of that Action

“Where should I go now?” Q(s 29, FWD ) =2.38 Q(s 29, BACK ) =1.79 Q(s 29, TURNCLOCK ) =3.4 9 Q(s 29, TURNCC ) =0.74 ⇒ “Best thing to do is turn clockwise” s 29 Q(s 29, NOOP ) =2.03

Q learning in math... Q learning rule says: update current Q with a fraction of next state Q value: Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

Q learning in code... public class MyAgent implements Agent { public void updateModel(SARSTuple s) { State2d start=s.getInitState(); State2d end=s.getNextState(); Action act=s.getAction(); double r=s.getReward(); double Qnow=_policy.get(start).get(act); double Qnext=_policy.get(end).findMaxQ(); double Qrevised=Qnow+getAlpha()* (r+getGamma()*Qnext-Qnow); _policy.get(start).put(act,Qrevised); }

Advice on params Q-learning requires 2 parameters: α : “learning rate” Good range for α : γ : “discount factor” Good range for γ :

What’s going on? s s’ agent begins one step at state s examines Q value for each action agent takes action a and ends up at s’ ; gets reward r now wants to revise Q(s,a) at start state needs Q value for some action at end state, s’ pick best currently known action at s’ == a’ a’ a set Q(s,a)=Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

Why does it work? Won’t give a full explanation here Basic intuition: each step of experience “backs up” reward from goal state toward beginning goal state r=0 r=+5 “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state “back up” a chunk of r and Q to prev. state QQ Q QQQ