R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Confidence Intervals for Proportions
Planning under Uncertainty
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Capturing User Interests by Both Exploitation and Exploration Richard Sia (Joint work with NEC) Feb
Chapter 5: Monte Carlo Methods
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
Reinforcement Learning
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Introduction to Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CHAPTER 10 Reinforcement Learning Utility Theory.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Questions? Rewards – Delayed rewards – Consistent penalty – Hot Beach -> Bacon Beach Sequence of rewards – Time to live – Stationary of preferences :
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
Reinforcement Learning
Chapter 8: Generalization and Function Approximation
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Introduction to Reinforcement Learning and Q-Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 2: Evaluative Feedback
How Confident Are You?.
Presentation transcript:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct actions pPure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken. pSupervised learning is instructive; optimization is evaluative pAssociative vs. Nonassociative: Associative: inputs mapped to outputs; learn the best output for each input Nonassociative: “learn” (find) one best output pn-armed bandit (at least how we treat it) is: Nonassociative Evaluative feedback

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 The n-Armed Bandit Problem pChoose repeatedly from one of n actions; each choice is called a play pAfter each play, you get a reward, where These are unknown action values Distribution of depends only on pObjective is to maximize the reward in the long term, e.g., over 1000 plays To solve the n-armed bandit problem, you must explore a variety of actions and the exploit the best of them

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 The Exploration/Exploitation Dilemma pSuppose you form estimates pThe greedy action at t is pYou can’t exploit all the time; you can’t explore all the time pYou can never stop exploring; but you should always reduce exploring action value estimates

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 Action-Value Methods pMethods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action had been chosen times, producing rewards then p pWhereis a true value of the action a taken and is its estimate at time t “sample average”

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5  -Greedy Action Selection pGreedy action selection:   -Greedy: {... the simplest way to try to balance exploration and exploitation

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 10-Armed Testbed pn = 10 possible actions pEach is chosen randomly from a normal distribution: peach is also normal: p1000 plays prepeat the whole thing 2000 times and average the results

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7  -Greedy Methods on the 10-Armed Testbed

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8 Softmax Action Selection pSoftmax action selection methods grade action probs. by estimated values. pThe most common softmax uses a Gibbs, or Boltzmann, distribution: “computational temperature”

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Binary Bandit Tasks Suppose you have just two actions: and just two rewards: Then you might infer a target or desired action: { and then always play the action that was most often the target Call this the supervised algorithm It works fine on deterministic tasks…

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Contingency Space The space of all possible binary bandit tasks: If the probability of success for action 1 is 0.1 and for action 2 is 0.2, then the supervised algorithm will fail since in almost all cases will switch to other action

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Linear Learning Automata Both actions, represent a stochastic, incremental version of the supervised algorithm where is the probability of selecting action a

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 Performance on Binary Bandit Tasks A and B

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 Incremental Implementation Recall the sample average estimation method: Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: The average of the first k rewards is (dropping the dependence on ): This is a common form for update rules: NewEstimate = OldEstimate + StepSize[Target – OldEstimate ]

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Tracking a Nonstationary Problem Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem. Better in the nonstationary case is: exponential, recency-weighted average

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 Optimistic Initial Values pAll methods so far depend on, i.e., they are biased. pSuppose instead we initialize the action values optimistically, i.e., on the 10-armed testbed, use Q is larger than the mean value of both rewards so it encourages exploration This approach may not work well in nonstationary problems after initial exploration dies out

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16 Reinforcement Comparison pCompare rewards to a reference reward,, e.g., an average of observed rewards pStrengthen or weaken the action taken depending on pLet denote the preference for action pPreferences determine action probabilities, e.g., by Gibbs distribution: pThen:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 Performance of a Reinforcement Comparison Method

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Pursuit Methods pMaintain both action-value estimates and action preferences pAlways “pursue” the greedy action, i.e., make the greedy action more likely to be selected pAfter the t-th play, update the action values to get pThe new greedy action is pThen: and the probs. of the other actions decremented to maintain the sum of 1

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 Performance of a Pursuit Method

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 Associative Search actions Bandit 3 Imagine switching bandits at each play Associate the policy with a clue signal (red, yellow, green) similar to the n-armed bandit methods for its immediate reward but also similar to full reinforcement method for its policy learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Conclusions pThese are all very simple methods but they are complicated enough—we will build on them pIdeas for improvements: estimating uncertainties... interval estimation approximating Bayes optimal solutions Gittens indices (specialized algorithm) pThe full RL problem offers some ideas for solution...