RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Markov Decision Process
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,
Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
Reinforcement Learning
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Policies and exploration and eligibility, oh my!.
Distributed Q Learning Lars Blackmore and Steve Block.
Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning (1)
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
Policies and exploration and eligibility, oh my!.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
RL for Large State Spaces: Policy Gradient
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text:
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
Distributed Q Learning Lars Blackmore and Steve Block.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Markov Decision Process (MDP)
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement Learning (1)
"Playing Atari with deep reinforcement learning."
CS 188: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

RL at Last! Q- learning and buddies

Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet to be read (I warned you)

Proposal analysis Overall: excellent job! Congrats to all! In general: better than previous semesters. You do not need to revise them But do pay attention to my comments I’m available for questions Overall scores (to date): Writing & Clarity (W&C): 7/10 ± 1.3 Background & Context (B&C): 7.9/10 ± 1.1 Research Plan: 7.9/10 ± 0.6

Reminders Agent acting in a Markov decision process (MDP): M = 〈 S, A,T,R 〉 E.g., robot in maze, airplane, etc. Fully observable, finite state and action spaces, finite history, bounded rewards Last time: planning given known M Policy evaluation: find value V π of fixed policy, π Policy iteration: find best policy, π*

Q : A key operative Critical step in policy iteration π’(s i ) =argmax a ∈ A {sum j ( T(s i,a,s j )*V(s j ) )} Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?” Often used operation. Gets a special name: Definition: the Q function, is: Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”

What to do with Q Can think of Q as a big table: one entry for each state/action pair “If I’m in state s and take action a, this is my expected discounted reward...” A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?” Can get V and π from Q :

Learning with Q Q and the notion of policy evaluation give us a nice way to do actual learning Use Q table to represent policy Update Q through experience Every time you see a (s,a,r,s’) tuple, update Q Each example of (s,a,r,s’) is a sample from T(s,a,s’) and from R W/ enough samples, can get a good idea of how the world works, where reward is, etc. Note: Never actually learn T or R ; let Q encode everything you need to know about the world

The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+α*( r +γ*max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

Q -learning in action 15x15 maze world; R (goal)= 1; R( other)=0 γ =0.9 α =0.65

Q -learning in action Initial policy

Q -learning in action After 20 episodes

Q -learning in action After 30 episodes

Q -learning in action After 100 episodes

Q -learning in action After 150 episodes

Q -learning in action After 200 episodes

Q -learning in action After 250 episodes

Q -learning in action After 300 episodes

Q -learning in action After 350 episodes

Q -learning in action After 400 episodes

Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state)

Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state) Every 10 episodes, “freeze” policy (turn off learning) Measure avg time to goal from a number of starting states Average over a number of test episodes to iron out noise Plot learning curve: #episodes of learning vs. avg performance

Learning performance

Notes on learning perf. After 400 learning episodes, still hasn’t asymptoted Note: that’s ~700,000 steps of experience!!! Q learning is really, really slow!!! Same holds for many RL methods (sadly) Fixing this is a good research topic... ;-)

Why does this work? Multiple ways to think of it The (more nearly) intuitive: Look at the key update step in the Q - learning alg: I.e., a weighted avg between current Q(s,a) and sampled Q(s’,a’)

Why does this work? Still... Why should that weighted avg be the right thing? Compare update eqn w/ Bellman eqn:

Why does this work? Still... Why should that weighted avg be the right thing? Compare w/ Bellman eqn: