Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

 It encourages managers to cheat and lie  Managers inflating results and lowballing targets  It penalizes for telling the truth.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Positive And Negative Reinforcers For Your Child Psychology 121.
Drugs By Marcus Fuller Phetslat Carlton Kris Lunsford – Barrett and Chase Ropp.
Reinforcement learning (Chapter 21)
Social Relations: Conflict and Peacemaking How do we relate to others?
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Reinforcement Learning
Reinforcement learning
Reinforcement Learning
Intro to Reinforcement Learning Learning how to get what you want...
Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
RL 2 It’s 2:00 AM. Do you know where your mouse is?
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Von Neumann architecture
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
Constraints, Choices, and Demand
Maps in Your Mind: Edward C. Tolman and the Cognitive Map
The Secret to Raising Smart Kids by Carol S. Dweck
Motivation Week 4. Question Are happy workers more productive? –True? False? –Sometimes? Never? –Why?? Should managers care if their employees like their.
L14. Fair networks and topology design D. Moltchanov, TUT, Spring 2008 D. Moltchanov, TUT, Spring 2015.
Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Styles of Leadership LET II. Introduction Leadership styles are the pattern of behaviors that one uses to influence others. You can influence others in.
Reinforcement Learning
Operant Conditioning I
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Family Book Talk May 8, 2014 Who Moved My Cheese?.
Achievement Motivation Motivation and Emotion Some motivations involve simple human behaviors like eating.
© 2008 Sterling Commerce. Confidential and Proprietary. How to Get Along with Project Using Microsoft Project so that it actually works for you, not against.
ECO 340: Micro Theory Optimal Contracts Sami Dakhlia U. of Southern Mississippi
Psychology 485 March 23,  Intro & Definitions Why learn about probabilities and risk?  What is learned? Expected Utility Prospect Theory Scalar.
EXPLORING Mountains have slopes…. …and so do lines.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Reinforcement Learning
Thesis. Thesis Statement A sentence that sums up what the entire essay is going to be about/focus on. (It goes in your intro.) It is your OPINION on your.
Chapter Five The Demand Curve and the Behavior of Consumers.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Lecture by: Jacinto Fabiosa Fall 2005 Consumer Choice.
Markov Decision Process (MDP)
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Who Moved My Cheese? An Amazing Way to Deal With Change In Your Work & In Your Life DR SPENCER JOHNSON NEW VERSION.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Guide : T.Sandya Team members: N.Phaneendra Kumar Y.Naveen A.Raghunadha Rao.
Reinforcement Learning (1)
Reinforcement Learning in POMDPs Without Resets
Reinforcement Learning
Dr. Unnikrishnan P.C. Professor, EEE
Introduction to Reinforcement Learning and Q-Learning
Social Relations: Conflict and Peacemaking
Reinforcement Learning
Presentation transcript:

Reinforcement Learning (for real this time)

Administrivia No noose is good noose (Crazy weather?)

Historical perspective Last time P2 handout Discussion of P2 This time Questions/answers -- generics, etc. Q2 (for real this time!) Intro to RL

Meet Mack the Mouse* Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? * Mickey is still copyright ?

Start with an easy case V. simple maze: Whenever Mack goes left, he gets cheese Whenever he goes right, he gets shocked After reward/punishment, he’s reset back to start of maze Q: how can Mack learn to act well in this world?

Learning in the easy case Say there are two labels: “cheese” and “shock” Mack tries a bunch of trials in the world -- that generates a bunch of experiences: Now what?

But what to do? So we know that Mack can learn a mapping from actions to outcomes But what should Mack do in any given situation? What action should he take at any given time? Suppose Mack is the subject of a psychotropic drug study and has actually come to like shocks and hate cheese -- how does he act now?

Reward functions In general, we think of a reward function: R() tells us whether Mack thinks a particular outcome is good or bad Mack before drugs: R( cheese )=+1 R( shock )=-1 Mack after drugs: R( cheese )=-1 R( shock )=+1 Behavior always depends on rewards (utilities)

Maximizing reward So Mack wants to get the maximum possible reward (Whatever that means to him) For the one-shot case like this, this is fairly easy Now what about a harder case?

Reward over time In general: agent can be in a state s i at any time t Can choose an action a j to take in that state Rwd associated with a state: R(s i ) Or with a state/action transition: R(s i,a j ) Series of actions leads to series of rewards (s 1,a 1 ) → s 3 : R(s 3 ); (s 3,a 7 ) → s 14 : R(s 14 );...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 4 )+R(s 11 )+R(s 10 )+...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 2 )+R(s 6 )+...