Chapter 2: Evaluative Feedback

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Introduction to Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CHAPTER 10 Reinforcement Learning Utility Theory.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Questions? Rewards – Delayed rewards – Consistent penalty – Hot Beach -> Bacon Beach Sequence of rewards – Time to live – Stationary of preferences :
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Deep Feedforward Networks
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
Reinforcement Learning
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement Learning
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
Evaluating Hypotheses
CS 188: Artificial Intelligence Fall 2007
Chapter 8: Generalization and Function Approximation
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Chapter 4: Dynamic Programming
CS 416 Artificial Intelligence
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

Chapter 2: Evaluative Feedback Evaluating actions vs. instructing by giving correct actions Pure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken. Supervised learning is instructive; optimization is evaluative Associative vs. Nonassociative: Associative: inputs mapped to outputs; learn the best output for each input Nonassociative: “learn” (find) one best output n-armed bandit (at least how we treat it) is: Nonassociative Evaluative feedback R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

The n-Armed Bandit Problem Choose repeatedly from one of n actions; each choice is called a play After each play , you get a reward , where These are unknown action values Distribution of depends only on Objective is to maximize the reward in the long term, e.g., over 1000 plays To solve the n-armed bandit problem, you must explore a variety of actions and the exploit the best of them R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

The Exploration/Exploitation Dilemma Suppose you form estimates The greedy action at t is You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always reduce exploring action value estimates R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Action-Value Methods Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action had been chosen times, producing rewards then “sample average” R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

e-Greedy Action Selection { . . . the simplest way to try to balance exploration and exploitation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

10-Armed Testbed n = 10 possible actions Each is chosen randomly from a normal distribution: each is also normal: 1000 plays repeat the whole thing 2000 times and average the results R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

e-Greedy Methods on the 10-Armed Testbed R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Softmax Action Selection Softmax action selection methods grade action probs. by estimated values. The most common softmax uses a Gibbs, or Boltzmann, distribution: “computational temperature” R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

{ Binary Bandit Tasks Suppose you have just two actions: and just two rewards: Then you might infer a target or desired action: { and then always play the action that was most often the target Call this the supervised algorithm It works fine on deterministic tasks… R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Contingency Space The space of all possible binary bandit tasks: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Linear Learning Automata For two actions, a stochastic, incremental version of the supervised algorithm R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Performance on Binary Bandit Tasks A and B R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Incremental Implementation Recall the sample average estimation method: The average of the first k rewards is (dropping the dependence on ): Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: This is a common form for update rules: NewEstimate = OldEstimate + StepSize[Target – OldEstimate] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Tracking a Nonstationary Problem Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem. Better in the nonstationary case is: exponential, recency-weighted average R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Optimistic Initial Values All methods so far depend on , i.e., they are biased. Suppose instead we initialize the action values optimistically, i.e., on the 10-armed testbed, use R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Reinforcement Comparison Compare rewards to a reference reward, , e.g., an average of observed rewards Strengthen or weaken the action taken depending on Let denote the preference for action Preferences determine action probabilities, e.g., by Gibbs distribution: Then: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Performance of a Reinforcement Comparison Method R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Pursuit Methods Maintain both action-value estimates and action preferences Always “pursue” the greedy action, i.e., make the greedy action more likely to be selected After the t-th play, update the action values to get The new greedy action is Then: and the probs. of the other actions decremented to maintain the sum of 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Performance of a Pursuit Method R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Associative Search Imagine switching bandits at each play Bandit 3 actions R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Conclusions These are all very simple methods but they are complicated enough—we will build on them Ideas for improvements: estimating uncertainties . . . interval estimation approximating Bayes optimal solutions Gittens indices The full RL problem offers some ideas for solution . . . R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction