Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.

Slides:



Advertisements
Similar presentations
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Advertisements

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
1 Eligibility Traces (ETs) Week #7. 2 Introduction A basic mechanism of RL. λ in TD(λ) refers to an eligibility trace. TD methods such as Sarsa and Q-learning.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Reinforcement Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Odds & Ends. Administrivia Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the.
Policies and exploration and eligibility, oh my!.
Reinforcement Learning. Overview  Introduction  Q-learning  Exploration vs. Exploitation  Evaluating RL algorithms  On-Policy Learning: SARSA.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Accumulation vs. replacement; model- free vs. model-based RL.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Accumulation vs. replacement; model- free vs. model-based RL.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Reinforcement Learning (1)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
Policies and exploration and eligibility, oh my!.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,
Reinforcement Learning (RL) Consider an “agent” embedded in an environmentConsider an “agent” embedded in an environment Task of the agentTask of the agent.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
Reinforcement Learning
E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text:
INTRODUCTION TO Machine Learning
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
1 Introduction to Reinforcement Learning Freek Stulp.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Boltzmann Machine (BM) (§6.4)
Introduction to Reinforcement Learning and Q-Learning
Chapter 7: Eligibility Traces
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Model-Free vs. Model- Based RL: Q, SARSA, & E 3

Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary Final projects Final presentations Dec 2, 7, 9 20 min (max) presentations 3 or 4 per day Sign up for presentation slots today!

The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+α*( r +γ*max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

SARSA-learning algorithm Algorithm: SARSA_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q s =get_current_world_state() a =pick_next_action( Q, s ) Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) Q ( s, a )= Q ( s, a )+α*( r +γ* Q ( s’, a’ )- Q ( s, a )) a = a’ ; s = s’ ; } Until (bored)

SARSA vs. Q SARSA and Q -learning very similar SARSA updates Q(s,a) for the policy it’s actually executing Lets the pick_next_action() function pick action to update Q updates Q(s,a) for greedy policy w.r.t. current Q Uses max_ a to pick action to update might be diff than the action it executes at s’ In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing Exploration can get Q -learning in trouble...

Radioactive breadcrumbs Can now define eligibility traces for SARSA In addition to Q(s,a) table, keep an e(s,a) table Records “eligibility” (real number) for each state/action pair At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) Decay all e(s’’,a’’) by factor of λγ Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

SARSA(λ)-learning alg. Algorithm: SARSA(λ)_learn Inputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1) Outputs: Q e ( s, a )=0 // for all s, a s =get_curr_world_st(); a =pick_nxt_act( Q, s ); Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) δ= r +γ* Q ( s’, a’ )- Q ( s, a ) e ( s, a )+=1 foreach ( s’’, a’’ ) pair in ( S X A ) { Q ( s’’, a’’ )= Q ( s’’, a’’ )+α* e ( s’’, a’’ )*δ e ( s’’, a’’ )*=λγ } a = a’ ; s = s’ ; } Until (bored)

The trail of crumbs Sutton & Barto, Sec 7.5

The trail of crumbs Sutton & Barto, Sec 7.5 λ=0

The trail of crumbs Sutton & Barto, Sec 7.5

Eligibility for a single state e(s i,a j ) 1st visit 2nd visit... Sutton & Barto, Sec 7.5

Eligibility trace followup Eligibility trace allows: Tracking where the agent has been Backup of rewards over longer periods Credit assignment: state/action pairs rewarded for having contributed to getting to the reward Why does it work?

The “forward view” of elig. Original SARSA did “one step” backup: Q(s,a) rtrt Q(s t+1,a t+1 ) Rest of trajectory Info backup

The “forward view” of elig. Original SARSA did “one step” backup: Could also do a “two step backup”: Q(s,a) rtrt Q(s t+2,a t+2 ) Rest of trajectory r t+1 Info backup

The “forward view” of elig. Original SARSA did “one step” backup: Could also do a “two step backup”: Or even an “ n step backup”:

The “forward view” of elig. Small-step backups ( n =1, n =2, etc.) are slow and nearsighted Large-step backups ( n =100, n =1000, n = ∞ ) are expensive and may miss near-term effects Want a way to combine them Can take a weighted average of different backups E.g.:

The “forward view” of elig. 1/31/3 2/32/3

How do you know which number of steps to avg over? And what the weights should be? Accumulating eligibility traces are just a clever way to easily avg. over all n :

The “forward view” of elig. λ0λ0 λ1λ1 λ2λ2 λ n-1

Replacing traces Kind just described are accumulating e-traces Every time you go back to state, add extra e. There are also replacing eligibility traces Every time you go back to a state/action, reset e(s,a) to 1 Works better sometimes Sutton & Barto, Sec 7.8

Model-free vs. Model-based

What do you know? Both Q -learning and SARSA(λ) are model free methods A.k.a., value-based methods Learn a Q function Never learn T or R explicitly At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment Also, no guarantees about explore/exploit tradeoff Sometimes, want one or both of the above

Model-based methods Model based methods, OTOH, do explicitly learn T & R At end of learning, have entire M = 〈 S, A,T,R 〉 Also have π* At least one model-based method also guarantees explore/exploit tradeoff properties

E3E3 Efficient Explore & Exploit algorithm Kearns & Singh, Machine Learning 49, 2002 Explicitly keeps a T matrix and a R table Plan (policy iter) w/ curr. T & R -> curr. π Every state/action entry in T and R : Can be marked known or unknown Has a #visits counter, nv(s,a) After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average) When nv(s,a)>NVthresh, mark cell as known & re-plan When all states known, done learning & have π*

The E 3 algorithm Algorithm: E3_learn_sketch // only an overview Inputs: S, A, γ (0<=γ<1), NVthresh, R max, Var max Outputs: T, R, π* Initialization: R ( s )= R max // for all s T ( s, a, s’ )=1/| S | // for all s, a, s’ known( s, a )=0; nv ( s, a )=0; // for all s, a π =policy_iter( S, A, T, R )

The E 3 algorithm Algorithm: E3_learn_sketch // con’t Repeat { s =get_current_world_state() a = π ( s ) ( r, s’ )=act_in_world( a ) T ( s, a, s’ )=(1+ T ( s, a, s’ )* nv ( s, a ))/( nv ( s, a )+1) nv ( s, a )++; if ( nv ( s, a )> NVthresh ) { known(s,a)=1; π =policy_iter( S, A, T, R ) } } Until (all ( s, a ) known)