1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Planning under Uncertainty
Reinforcement Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Capturing User Interests by Both Exploitation and Exploration Richard Sia (Joint work with NEC) Feb
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Maximum Likelihood (ML), Expectation Maximization (EM)
Reinforcement Learning (1)
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
Reinforcement Learning
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 CSC 8520 Spring Paula Matuszek Kinds of Machine Learning Machine learning techniques can be grouped into several categories, in several ways: –What.
Learning from Observations Chapter 18 Through
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel.
Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
+ Simulation Design. + Types event-advance and unit-time advance. Both these designs are event-based but utilize different ways of advancing the time.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Questions? Rewards – Delayed rewards – Consistent penalty – Hot Beach -> Bacon Beach Sequence of rewards – Time to live – Stationary of preferences :
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
Reinforcement learning (Chapter 21)
Reinforcement Learning
Hidden Markov Models Part 2: Algorithms
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
October 20, 2010 Dr. Itamar Arel College of Engineering
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Presentation transcript:

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)

ECE Reinforcement Learning in AI Changes to the schedule Please note changes to the schedule (on course webpage) Key: mid-term moved to Oct. 5 (one week prior) 2

ECE Reinforcement Learning in AI 3 Outline Recap What is evaluative feedback N-arm Bandit Problem (test case) Action-Value Methods Action-Value Methods Softmax Action Selection Softmax Action Selection Incremental Implementation Incremental Implementation Tracking nonstationary rewards Tracking nonstationary rewards Optimistic Initial Values Optimistic Initial Values Reinforcement Comparison Reinforcement Comparison Pursuit methods Pursuit methods Associative search Associative search

ECE Reinforcement Learning in AI 4 Recap RL revolves around learning from experience by interacting with the environment Unsupervised learning discipline Trial-and-error based Delayed reward – main concept (value functions, etc.) Policy maps from situations to actions Exploitation vs. Exploration is key challenge We looked at the Tic-Tac-Toe example where: V(s)  V(s) +  [V(s’) –V(s)]

ECE Reinforcement Learning in AI 5 What is Evaluative Feedback? RL uses training information that evaluates the actions taken rather than instructs by giving correct actions Necessitates trail-by-error search for good behavior Necessitates trail-by-error search for good behavior Creates need for active exploration Creates need for active exploration Pure evaluative feedback indicates how good the action taken is, but not whether it is the best or the worst action possible Pure instructive feedback, on the other hand, indicates the correct action to take, independent of the action actually taken Corresponds to supervised learning Corresponds to supervised learning e.g. artificial neural networks e.g. artificial neural networks

ECE Reinforcement Learning in AI 6 n-Armed Bandit Problem Lets look at a simple version of the n-armed bandit problem First step in understanding the full RL problem First step in understanding the full RL problem Here is the problem description: An agent is repeatedly faced with making one out of n actions An agent is repeatedly faced with making one out of n actions After each step a reward value is provided, drawn from a stationary probability distribution that depends on the action selected ◊ After each step a reward value is provided, drawn from a stationary probability distribution that depends on the action selected ◊ The agent’s objective is to maximize the expected total reward over time The agent’s objective is to maximize the expected total reward over time Each action selection is called a play or iteration Each action selection is called a play or iteration Extension of the classic slot machine (“one-armed bandit”)

ECE Reinforcement Learning in AI 7 n-Armed Bandit Problem (cont.) Each action has a value – an expected or mean reward given that the action is selected If the agent knew the value of each function – the problem would be trivial If the agent knew the value of each function – the problem would be trivial The agent maintains estimates of the values, and chooses the highest Greedy algorithm Greedy algorithm Directly associated with (policy) exploitation Directly associated with (policy) exploitation If agent chooses non-greedily – we say it explores Under uncertainly the agent must explore Under uncertainly the agent must explore A balance must be found between exploration & exploitation A balance must be found between exploration & exploitation Initial condition: all levers assume to yield reward = 0 We’ll see several simple balancing methods and show that they work much better than methods that always exploit

ECE Reinforcement Learning in AI 8 Action-Value Methods We’ll look at simple methods for estimating the values of actions Let Q * (a) denote the true (actual) value of a, and Q t (a) its estimate at time t The true value equals the mean reward for that action The true value equals the mean reward for that action Let’s assume that by iteration (play) t, action a has been taken k a times – hence we may use the sample-average … The greedy policy selects the highest sample-average, i.e.

ECE Reinforcement Learning in AI 9 Action-Value Methods (cont.) A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability , instead select an action at random This is called an  –greedy method This is called an  –greedy method We simulate the 10-arm bandit problem, where … r a ~ N ( Q * (a),1) (noisy readings of rewards) r a ~ N ( Q * (a),1) (noisy readings of rewards) Q * (a) ~ N ( 0,1) (actual, true mean reward for action a ) Q * (a) ~ N ( 0,1) (actual, true mean reward for action a ) We further assume that there are 2000 machines (tasks), each with 10 levers The rewards distributions are drawn independently for each machine Each iteration, choose a lever on each machine and calculate the average reward from all 2000 machines …

ECE Reinforcement Learning in AI 10 Action-Value Methods (cont.) ◊

ECE Reinforcement Learning in AI 11 Side note: the optimal average reward ◊

ECE Reinforcement Learning in AI 12 Action-Value Methods (cont.) The advantage of  –greedy methods depends on the task If the rewards have high variance  –greedy would have stronger advantage If the rewards have high variance  –greedy would have stronger advantage If the rewards had zero variance, greedy algorithm would have sufficed If the rewards had zero variance, greedy algorithm would have sufficed If the problem was non-stationary (true rewards values changed slowly over time)  –greedy would have been a must  –greedy would have been a must Q: Perhaps some better methods exist ? Q: Perhaps some better methods exist ?

ECE Reinforcement Learning in AI 13 Softmax Action Selection So far we assumed that while exploring (using  –greedy) we chose equally among the alternatives This means we could have chosen really bad, as opposed (for example) to choosing the next-best action The obvious solution is to rank the alternatives … Generate a probability density/mass function to estimate the rewards from each action Generate a probability density/mass function to estimate the rewards from each action All actions are ranked/weighted All actions are ranked/weighted Typically use Boltzmann distribution, i.e. choose action a on iteration t with probability Typically use Boltzmann distribution, i.e. choose action a on iteration t with probability

ECE Reinforcement Learning in AI 14 Softmax Action Selection (cont.)

ECE Reinforcement Learning in AI 15 Incremental Implementation Sample-average methods require linearly-increasing memory (storage of reward history) We need a more memory-efficient method … ◊

ECE Reinforcement Learning in AI 16 Recurring theme in RL The previous result is consistent with a recurring theme in RL which is New_Estimate  Old_Estimate + StepSize[Target – Old_Estimate] The StepSize may be fixed or adaptive (in accordance with the specific application)

ECE Reinforcement Learning in AI 17 Tracking a Nonstationary Problem So far we have considered stationary problems In reality, many problems are effectively nonstationary A popular approach is to weigh recent rewards more heavily than older ones One such technique is called fixed step size This is a weighted average that exponentially decreases ◊

ECE Reinforcement Learning in AI 18 Optimistic Initial Values All methods discussed so far depended, to some extent, on the initial action-value estimates, Q 0 (a) For sample-average methods – this bias disappears when all actions have been selected at least once For fixed step-size methods, the bias disappears with time (geometrically decreasing) In the 10-arm bandit example with  = 0.1 … If we were to set all initial reward guesses to +5 (instead of zero) If we were to set all initial reward guesses to +5 (instead of zero) Exploration is guaranteed, since true values are ~ N(0,1) Exploration is guaranteed, since true values are ~ N(0,1) ◊

ECE Reinforcement Learning in AI 19 Reinforcement Comparison An intuitive element in RL is that … higher rewards  made more likely to occur lower rewards  made less likely to occur How is the learner to know what constitutes a high or low reward? To make a judgment, one must compare the reward to a reference reward - r t To make a judgment, one must compare the reward to a reference reward - r t Natural choice – average of previously received rewards Natural choice – average of previously received rewards These methods are called reinforcement comparison methods These methods are called reinforcement comparison methods The agent maintains an action preference value, p t (a), for each action a The preference might be used to select an action according to a softmax relationship ◊

ECE Reinforcement Learning in AI 20 Reinforcement Comparison (cont.) The reinforcement comparison idea is used in updating the action preferences High reward increases the probability of an action to be selected, and visa versa Following the action preference update, the agent updates the reference reward  allows us to differentiate between rates for r t and p t

ECE Reinforcement Learning in AI 21 Pursuit Methods Another class of effective learning methods are pursuit methods They maintain both action-value estimates and action preferences The preferences continually “pursue” the greedy actions Letting denote the greedy action, the update rules are: The action value estimates, Q t+1 (a), are updated using one the ways described (e.g. sample averages of observed rewards)

ECE Reinforcement Learning in AI 22 Pursuit Methods (cont.)

ECE Reinforcement Learning in AI 23 Associative Search So far we’ve considered nonassociative tasks in which there was no association of actions with states Find the single best action when task is stationary, or Find the single best action when task is stationary, or Track the the best action as it changes over time Track the the best action as it changes over time However, in RL the goal is to learn a policy (i.e. state to action mappings) A natural extension of the n -arm bandit: Assume you have K machines, but only one is played at a time Assume you have K machines, but only one is played at a time The agent maps the state (i.e. machine played) to the action The agent maps the state (i.e. machine played) to the action This would be called an associative search task It is like the full RL problem in that it involves a policy It is like the full RL problem in that it involves a policy However, it lacks the long-term reward prospect of full RL However, it lacks the long-term reward prospect of full RL

ECE Reinforcement Learning in AI 24 Summary We’ve looked at various action-selection schemes Balancing exploration vs. exploitation  –greedy  –greedy Softmax techniques Softmax techniques There is no “single-best” solution for all problems We’ll see more of this issue later …