CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Markov Decision Processes
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Machine Learning Chapter 13. Reinforcement Learning
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
1 Introduction to Reinforcement Learning Freek Stulp.
MDPs (cont) & Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Reinforcement Learning
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
Planning to Maximize Reward: Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Learning to Maximize Reward: Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 1: Introduction
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning Convergence

CS 5751 Machine Learning Chapter 13 Reinforcement Learning2 Control Learning Consider learning to choose actions, e.g., Robot learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon Note several problem characteristics Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectors

CS 5751 Machine Learning Chapter 13 Reinforcement Learning3 One Example: TD-Gammon Tesauro, 1995 Learn to play Backgammon Immediate reward +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to best human player

CS 5751 Machine Learning Chapter 13 Reinforcement Learning4 Reinforcement Learning Problem Environment Agent state action reward s0s0 r0r0 a0a0 s1s1 r1r1 a1a1 s2s2 r2r2 a2a2... Goal: learn to choose actions that maximize r 0 +  r 1 +  2 r 2 + …, where 0   < 1

CS 5751 Machine Learning Chapter 13 Reinforcement Learning5 Markov Decision Process Assume finite set of states S set of actions A at each discrete time, agent observes state s t  S and choose action a t  A then receives immediate reward r t and state changes to s t+1 Markov assumption: s t+1 =  (s t, a t ) and r t = r(s t, a t ) –i.e., r t and s t+1 depend only on current state and action –functions  and r may be nondeterministic –functions  and r no necessarily known to agent

CS 5751 Machine Learning Chapter 13 Reinforcement Learning6 Agent’s Learning Task Execute action in environment, observe results, and learn action policy  : S  A that maximizes E[r t +  r t+1 +  2 r t+2 + …] from any starting state in S here 0   < 1 is the discount factor for future rewards Note something new: target function is  : S  A but we have no training examples of form training examples are of form,r>

CS 5751 Machine Learning Chapter 13 Reinforcement Learning7 Value Function To begin, consider deterministic worlds … For each possible policy  the agent might adopt, we can define an evaluation function over states where r t,r t+1,… are generated by following policy  starting at state s Restated, the task is to learn the optimal policy  *

CS 5751 Machine Learning Chapter 13 Reinforcement Learning8 G r(s,a) (immediate reward) values G Q(s,a) values G V*(s) values G One optimal policy

CS 5751 Machine Learning Chapter 13 Reinforcement Learning9 What to Learn We might try to have agent learn the evaluation function V  * (which we write as V*) We could then do a lookahead search to choose best action from any state s because A problem: This works well if agent knows a  : S  A  S, and r : S  A   But when it doesn’t, we can’t choose actions this way

CS 5751 Machine Learning Chapter 13 Reinforcement Learning10 Q Function Define new function very similar to V* If agent learns Q, it can choose optimal action even without knowing d! Q is the evaluation function the agent will learn

CS 5751 Machine Learning Chapter 13 Reinforcement Learning11 Training Rule to Learn Q Note Q and V* closely related: Which allows us to write Q recursively as Let denote learner’s current approximation to Q. Consider training rule where s' is the state resulting from applying action a in state s

CS 5751 Machine Learning Chapter 13 Reinforcement Learning12 Q Learning for Deterministic Worlds For each s,a initialize table entry Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s' Update the table entry for as follows: s  s'

CS 5751 Machine Learning Chapter 13 Reinforcement Learning13 Updating 100 initial state: s R 100 next state: s R a right

CS 5751 Machine Learning Chapter 13 Reinforcement Learning14 Convergence converges to Q. Consider case of deterministic world where each visited infinitely often. Proof: define a full interval to be an interval during which each is visited. During each full interval the largest error in table is reduced by factor of  Let be table after n updates, and  n be the maximum error in ; that is

CS 5751 Machine Learning Chapter 13 Reinforcement Learning15 Convergence (cont) For any table entry updated on iteration n+1, the error in the revised estimate is

CS 5751 Machine Learning Chapter 13 Reinforcement Learning16 Nondeterministic Case What if reward and next state are non-deterministic? We redefine V,Q by taking expected values

CS 5751 Machine Learning Chapter 13 Reinforcement Learning17 Nondeterministic Case Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove converge of to Q [Watkins and Dayan, 1992]

CS 5751 Machine Learning Chapter 13 Reinforcement Learning18 Temporal Difference Learning Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these:

CS 5751 Machine Learning Chapter 13 Reinforcement Learning19 Temporal Difference Learning Equivalent expression: TD( ) algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V* for any 0   1 (Dayan, 1992) Tesauro’s TD-Gammon uses this algorithm

CS 5751 Machine Learning Chapter 13 Reinforcement Learning20 Subtleties and Ongoing Research Replace table with neural network or other generalizer Handle case where state only partially observable Design optimal exploration strategies Extend to continuous action, state Learn and use d : S  A  S, d approximation to  Relationship to dynamic programming

CS 5751 Machine Learning Chapter 13 Reinforcement Learning21 RL Summary Reinforcement learning (RL) –control learning –delayed reward –possible that the state is only partially observable –possible that the relationship between states/actions unknown Temporal Difference Learning –learn discrepancies between successive estimates –used in TD-Gammon V(s) - state value function –needs known reward/state transition functions

CS 5751 Machine Learning Chapter 13 Reinforcement Learning22 RL Summary Q(s,a) - state/action value function –related to V –does not need reward/state trans functions –training rule –related to dynamic programming –measure actual reward received for action and future value using current Q function –deterministic - replace existing estimate –nondeterministic - move table estimate towards measure estimate –convergence - can be shown in both cases