Policies and exploration and eligibility, oh my!.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Reinforcement Learning
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Infinite Horizon Problems
Reinforcement Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Machine Learning Lecture 11: Reinforcement Learning
Chapter 6: Temporal Difference Learning
Accumulation vs. replacement; model- free vs. model-based RL.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Accumulation vs. replacement; model- free vs. model-based RL.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Reinforcement Learning (1)
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
Policies and exploration and eligibility, oh my!.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
RL for Large State Spaces: Policy Gradient
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning (RL) Consider an “agent” embedded in an environmentConsider an “agent” embedded in an environment Task of the agentTask of the agent.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text:
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning (RL)
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
Chapter 9: Planning and Learning
Reinforcement Learning
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Policies and exploration and eligibility, oh my!

Administrivia Reminder: R3 due Thurs Anybody not have a group? Reminder: final project Written report: due Thu, May 11, by noon Oral reports: Apr 27, May 2, May 4 15 people registered ⇒ 5 pres/session ⇒ 15 min/pres Volunteers?

River of time Last time: The Q function The Q -learning algorithm Q -learning in action Today: Notes on writing & presenting Action selection & exploration The off-policy property Use of experience; eligibility traces Radioactive breadcrumbs

Final project writing FAQ Q: How formal a document should this be? A: Very formal. This should be as close in style to the papers we have read as possible. Pay attention to the sections that they have -- introduction, background, approach, experiments, etc. Try to establish a narrative -- “tell the story” As always, use correct grammar, spelling, etc.

Final project writing FAQ Q: How long should the final report be? A: As long as necessary, but no longer. A’: I would guess that it would take ~10-15 pages to describe your work well.

Final project writing FAQ Q: Any particular document format? A: I prefer: 8.5”x11” paper 1” margins 12pt font double-spaced In LaTeX: \renewcommand{baselinestretch}{1.6} Stapled!

Final project writing FAQ Q: Any other tips? A: Yes: DON’T BE VAGUE -- be as specific and concrete as possible about what you did/what other people did/etc.

The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount  (0<=  <1); Learning rate  (0<=  <1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+  *( r +  *max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

Why does this work? Still... Why should that weighted avg be the right thing? Compare w/ Bellman eqn:

Why does this work? Still... Why should that weighted avg be the right thing? Compare w/ Bellman eqn... I.e., the update is based on a sample of the true distribution, T, rather than the full expectation that is used in the Bellman eqn/policy iteration alg First time agent finds a rewarding state, s r,  of that reward will be propagated back by one step via Q update to s r-1, a state one step away from s r Next time, the state two away from s r will be updated, and so on...

Picking the action One critical step underspecified in Q learn alg: a =pick_next_action( Q, s ) How should you pick an action at each step?

Picking the action One critical step underspecified in Q learn alg: a =pick_next_action( Q, s ) How should you pick an action at each step? Could pick greedily according to Q Might tend to keep doing the same thing and not explore at all. Need to force exploration.

Picking the action One critical step underspecified in Q learn alg: a =pick_next_action( Q, s ) How should you pick an action at each step? Could pick greedily according to Q Might tend to keep doing the same thing and not explore at all. Need to force exploration. Could pick an action at random Ignores everything you’ve learned about Q so far Would you still converge?

Off-policy learning Exploit a critical property of the Q learn alg: Lemma (w/o proof): The Q learning algorithm will converge to the correct Q* independently of the policy being executed, so long as: Every (s,a) pair is visited infinitely often in the infinite limit  is chosen to be small enough (usually decayed)

Off-policy learning I.e., Q learning doesn’t care what policy is being executed -- will still converge Called an off-policy method: the policy being learned can be diff than the policy being executed Off-policy property tells us: we’re free to pick any policy we like to explore, so long as we guarantee infinite visits to each (s,a) pair Might as well choose one that does (mostly) as well as we know how to do at each step

“Almost greedy” exploring Can’t be just greedy w.r.t. Q (why?) Typical answers: ε-greedy: execute argmax a {Q(s,a)} w/ prob (1- ε ) and a random action w/ prob ε Boltzmann exploration: pick action a w/ prob:

The value of experience We observed that Q learning converges slooooooowly... Same is true of many other RL algs But we can do better (sometimes by orders of magnitude) What’re the biggest hurdles to Q convergence?

The value of experience We observed that Q learning converges slooooooowly... Same is true of many other RL algs But we can do better (sometimes by orders of magnitude) What’re the biggest hurdles to Q convergence? Well, there are many Big one, though, is: poor use of experience Each timestep only changes one Q(s,a) value Takes many steps to “back up” experience very far

That eligible state Basic problem: Every step, Q only does a one- step backup Forgot where it was before that No sense of the sequence of state/actions that got it where it is now Want to have a long-term memory of where the agent has been; update the Q values for all of them

That eligible state Want to have a long-term memory of where the agent has been; update the Q values for all of them Idea called eligibility traces: Have a memory cell for each state/action pair Set memory when visit that state/action Each step, update all eligible states

Retrenching from Q Can integrate eligibility traces w/ Q -learning But it’s a bit of a pain Need to track when agent is “on policy” or “off policy”, etc. Good discussion in Sutton & Barto

Retrenching from Q We’ll focus on a (slightly) simpler learning alg: SARSA learning V. similar to Q learning Strictly on policy: only learns about policy it’s actually executing E.g., learns instead of

The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount  (0<=  <1); Learning rate  (0<=  <1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+  *( r +  *max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

SARSA-learning algorithm Algorithm: SARSA_learn Inputs: State space S ; Act. space A Discount  (0<=  <1); Learning rate  (0<=  <1) Outputs: Q s =get_current_world_state() a =pick_next_action( Q, s ) Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) Q ( s, a )= Q ( s, a )+  *( r +  * Q ( s’, a’ )- Q ( s, a )) a = a’ ; s = s’ ; } Until (bored)

SARSA vs. Q SARSA and Q -learning very similar SARSA updates Q(s,a) for the policy it’s actually executing Lets the pick_next_action() function pick action to update Q updates Q(s,a) for greedy policy w.r.t. current Q Uses max_ a to pick action to update might be diff than the action it executes at s’ In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing Exploration can get Q -learning in trouble...

Getting Q in trouble... “Cliff walking” example (Sutton & Barto, Sec 6.5)

Getting Q in trouble... “Cliff walking” example (Sutton & Barto, Sec 6.5)

Radioactive breadcrumbs Can now define eligibility traces for SARSA In addition to Q(s,a) table, keep an e(s,a) table Records “eligibility” (real number) for each state/action pair At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) Decay all e(s’’,a’’) by factor of  Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

SARSA(  )-learning alg. Algorithm: SARSA(  )_learn Inputs: S, A,  (0<=  <1),  (0<=  <1),  (0<=  <1) Outputs: Q e ( s, a )=0 // for all s, a s =get_curr_world_st(); a =pick_nxt_act( Q, s ) Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ )  = r +  * Q ( s’, a’ )- Q ( s, a ) e ( s, a )+=1 foreach ( s’’, a’’ ) pair in ( S X A ) { Q ( s’’, a’’ )= Q ( s’’, a’’ )+  * e ( s’’, a’’ )*  e ( s’’, a’’ )*=  } a = a’ ; s = s’ ; } Until (bored)