RL 2 It’s 2:00 AM. Do you know where your mouse is?

Slides:



Advertisements
Similar presentations
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Advertisements

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Reinforcement learning
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,
Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.
Intro to Reinforcement Learning Learning how to get what you want...
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)
Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.
Machine Learning Lecture 11: Reinforcement Learning
Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
RL: Algorithms time. Happy Full Moon! Administrivia Reminder: Midterm exam, this Thurs (Oct 20) Spec v 0.98 released today (after class) Check class.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
RL Rolling on.... Administrivia Reminder: Terran out of town, Tues Oct 11 Andree Jacobsen substitute prof Reminder: Stefano Markidis out of town Oct 19.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
11 The Ultimate Upgrade Nicholas Garcia Bell Helicopter Textron.
Call In! access code # Persistence Pays! Ideas for Connecting with Lawmakers At Home An ALA Washington Office Advocacy Webinar. Call.
Transition to Managed Services 0 Microsoft E-Learning IT Infrastructure Partnership Team August 26, 2008.
HelloHello Donnia Trent, Moderator/Instructor IRSC Main Campus, Tomeu Center, GED, Lab 311 Fort Pierce, FL.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
G O T V Get Out The Vote A Workshop on Voting. Who We Are Volunteers Goals To inform you about your right to vote To encourage you to vote at every opportunity.
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
US History & Civics Practice Test Review Part I. Early Settlements and Colonies.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Reinforcement Learning
Class 2 Please read chapter 2 for Tuesday’s class (Response due by 3pm on Monday) How was Piazza? Any Questions?
Chapter Seven Participation, Elections, and Parties.
1 Introduction to Reinforcement Learning Freek Stulp.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Good Morning Cougars Today is Thursday, Oct. 16th TEST WILL BE ON TUESDAY.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Announcements Homework 3 due today (grace period through Friday)
Chapter 3: The Reinforcement Learning Problem
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning
How to Do Term Definitions
CS 188: Artificial Intelligence Spring 2006
How to Do Term Definitions
CS 188: Artificial Intelligence Spring 2006
Презентация құру тәсілдері
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

RL 2 It’s 2:00 AM. Do you know where your mouse is?

First up: Vote! Albuquerque Municipal Election today (Oct 4) Not all of you are eligible to vote, I know But if you are, you should. Educate yourself first! Mayor City councilors Bonds (what will ABQ spend its money on?) Propositions (election finance, min wage, voter ID) Polls close at 7:00 PM today...

Voting resources City of Albuquerque web site: League of Women Voters web site: sh.html sh.html

News o’ the day Wall Street Journal reports: “Microsoft Windows Officially Broken” In 2004, MS Longhorn (successor to XP) bogged down Whole code base had to be scrapped & started afresh ⇒ Vista Point: not MS bashing (much) Importance of software process MS moved to a more agile process for Vista Test first Rigorous regression testing Better coding infrastructure

Administrivia Grading: P1 rollout finished grading I will send grade reports this afternoon & tomorrow morning Prof Lane out of town Oct 11 Andree Jacobsen will cover Stefano Markidis out of town Oct 19 Will announce new office hours presently

Your place in History Last time: Q2 Introduction to Reinforcement Learning (RL)

Your place in History This time: ✓ P2M1 due ✓ Voting ✓ News ✓ Administriva ✓ Q&A More on RL Design exercise: WorldSimulator and Terrains

Recall: Mack & his maze Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? ?

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 4 )+R(s 11 )+R(s 10 )+...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 2 )+R(s 6 )+...

Where can you go? Definition: Complete set of all states agent could be in is called the state space: S Could be discrete or continuous For Project 2: states are discrete Q: what is the state space for P2? Size of state space: | S | Q: How big is the state space for P2?

Where can you go? Definition: Complete set of actions an agent could take is called the action space: A Again, discrete or cont. Again, P2: A is discrete Q: What is A for P2? Size? Again, size: | A |

What is it worth to you? Idea of “good” and “bad” places to go Quantified as “rewards” (This is where term “reinforcement learning” comes from. Originated in psychology.) Formally: R : S → Reals R(s) == reward for getting to state s How good or bad it is to reach state s Larger (more positive) is better Agent “wants” to get more positive rwd

How does it happen? Dynamics of agent defined by transition function T: S x A x S → [0,1] T(s,a,s’)==Pr[next state is s’ | curr state is s, act a] Examples from P2?

How does it happen? Dynamics of agent defined by transition function T: S x A x S → [0,1] T(s,a,s’)==Pr[next state is s’ | curr state is s, act a] Examples from P2? In practice: Don’t write T down explicitly. Encoded by WorldSimulator and Terrain/agent interactions.

The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function Q: What modules represent these in P2?

Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh)

Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy, π Policy defines what action to take in every state of the system:

Experience & histories Fundamental unit of experience in RL: At time t in some state s i, take action a j, get reward r t, end up in state s k Called an experience tuple or SARSA tuple Set of all experience during a single episode up to time T is a history or trajectory:

How good is a policy? Value is a function of start state and policy: Value measures: How good is policy π, averaged over all time, if agent starts at state s 1 and runs forever?

The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s

Design Exercise: WorldSimulator & Friends

Design exercise Q1: Design the act() method in WorldSimulator What objects does it need to access? How can it take different terrains/agents into account? Q2: GridWorld2d could be really large Most of the terrain tiles are the same everywhere How can you avoid millions of copies of same tile?