Reinforcement Learning Guest Lecturer: Chengxiang Zhai 15-681 Machine Learning December 6, 2001.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Markov Decision Processes
Infinite Horizon Problems
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Machine Learning Chapter 13. Reinforcement Learning
Reinforcement Learning
Introduction Many decision making problems in real life
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
1 Introduction to Reinforcement Learning Freek Stulp.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Learning to Maximize Reward: Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001

Outline For Today The Reinforcement Learning Problem Markov Decision Process Q-Learning Summary

The Checker Problem Revisited Goal: To win every game! What to learn: Given any board position, choose a “good” move But, what is a “good” move? –A move that helps win a game –A move that will lead to a “better” board position So, what is a “better” board position? –A position where a “good” next move exists!

Structure of the Checker Problem You are interacting/experimenting with an environment (board) You see the state of the environment (board position) And, you take an action (move), which will –change the state of the environment –result in an immediate reward Immediate reward = 0 unless you win (+100) or lose (-100) the game You want to learn to “control” the environment (board) so as to maximize your long term reward (win the game)

Reinforcement Learning Problem Agent Environment s t+1 r t t states reward t r action a t s 0 a 0 r 1 s 1 a 1 r 2 s 2 a 2 r 3 s 3 r 1 +  r 2 +   r 3 +…  discount factor) Maximize

Three Elements in RL ? ? action astate s reward r (Slide from Prof. Sebastian Thrun’s lecture)

Example 1 : Slot Machine State: configuration of slots Action: stopping time Reward: $$$ (Slide from Prof. Sebastian Thrun’s lecture)

Example 2 : Mobile Robot State: location of robot, people, etc. Action: motion Reward: the number of happy faces (Slide from Prof. Sebastian Thrun’s lecture)

Example 3 : Backgammon State: Board position Action: move Reward: –win (+100) –lose (-100) TD-Gammon  best human players in the world

What Are We Learning Exactly? A decision function/policy –Given the state, choose an action Formally, –States: S={s 1, …s n } –Actions: A={a 1,…,a m } –Reward: R –Find  :S  A that maximizes R (cumulativ reward over time)

So, What’s Special About Reinforcement Learning? Find  :S  A  Function Approx.

Reinforcement Learning Problem Agent Environment s t+1 r t t states reward t r action a t s 0 a 0 r 1 s 1 a 1 r 2 s 2 a 2 r 3 s 3 r 1 +  r 2 +   r 3 +…  discount factor) Maximize

What’s So Special About CL? (Answers from “the book”) Delayed Reward Exploration Partially observable states Life-long learning

Now that we know the problem, How do we solve it? ==> Markov Decision Process (MDP)

Markov Decision Process (MDP) Finite set of states S Finite set of actions A At each time step the agent observes state s t  S and chooses action a t  A(s t ) Then receives immediate reward r t+1 =r(s t,a t ) And state changes to s t+1 =  (s t,a t ) Markov assumption : s t+1 =  (s t,a t ) and r t+1 =r(s t,a t ) –Next reward and state only depend on current state s t and action a t –Functions  (s t,a t ) and r(s t,a t ) may be non-deterministic –Functions  (s t,a t ) and r(s t,a t ) not necessarily known to agent

Learning A Policy A policy tells us how to choose an action given a state An optimal policy is one that gives the best cumulative reward from any initial state we define a cumulative value function for a policy  V  (s)= r t +  r t+1 +   r t+2 +…=  i=0 r t+i  i where r t, r t+1,… are generated by following the policy  from start state s Task: Learn the optimal policy  * that maximizes V  (s)  s,  * = argmax  V  (s) Define optimal value function V*(s) = V  (s)

Idea 1: Enumerating Policy For each policy  : S  A For each state s, compute the evaluation function V  (s) Pick the  that has the largest V  (s) What’s the problem? Complexity! How do we get around this? Observation: If we know V*(s) = V  (s), we can find  *  *(s) = argmax a [r(s,a) +  V*(  (s,a))]

Idea 2: Learn V*(s) For each state, compute V*(s) (less complexity) Given the current state s, choose action according to  *(s) = argmax a [r(s,a) +  V*(  (s,a))] What’s the problem this time? This works, but only if we know r(s,a) and  (s,a) How can we evaluate an action without knowing r(s,a) and  (s,a) ? Observation: It seems that all we need is some function like Q(s,a) … …[  *(s) = argmax a Q(s,a) ]

Idea 3: Learn Q(s,a) Because we know  *(s) = argmax a [r(s,a) +  V*(  (s,a))] If we want  *(s) = argmax a Q(s,a), then we must have Q(s,a) = r(s,a) +  V*(  (s,a)) We can express V* in terms of Q! V*(s) = max a Q(s,a) So, we have THE RULE FOR Q-LEARNING Q(s,a) = r(s,a) +  max a’ Q(  (s,a’),a’) Value of a on sreward of a on sBest value of any action on next state

Q-Learning for Deterministic Worlds For each, initialize table entry Q(s,a) =0 Observe current state s Do forever: –Select an action a and execute it –Receive immediate reward r –Observe the new state s’ –Update the entry for Q(s,a) as follows: –Change to state s’ Q(s,a) = r +  max a’ Q(s’,a’)

Why does Q-learning Work? Q-learning converges! Intuitively, for non-negative rewards, Estimated Q values never decrease and never exceed true Q values Maximum error goes down by a factor of  after each state is updated Q n+1 (s,a) = r +  max a’ Q n (s’,a’) True Q(s,a)

Nondeterministic Case Both r(s,a) and  (s,a) may have probabilistic outcomes Solution: Just add expectation! Update rule is slightly different (partially updating, “stop” after enough visitings) Also converges Q(s,a) =E[ r(s,a)] +  E p(s’|s,a) [max a’ Q(s’,a’)]

Extensions of Q-Learning How can we accelerate Q-learning? –Choose action a that can maximize Q(s,a) (exploration vs. exploitation) –Updating sequences –Store past state-action transitions –Exploit knowledge of transition and reward function (simulation) What if we can’t store all the entries? –Function Approximation (Neural Networks,etc)

Temporal Difference (TD) Learning Learn by reducing discrepancies between estimates made at different times Q-learning is a special case with one-step lookahead. Why not more than one-step? TD( ): Blend one-step, two-step, …, n-step lookahead with coefficients depending on When =0, we get one-step Q-learning When =1, only the observed r values are considered Q (s t,a t ) = r t +  [(1- ) max a Q(s t,a t )+ Q (s t+1,a t+1 )]

What You Should Know All basic concepts of RL (state, action, reward, policy, value functions, discounted cumulative reward, …) Mathematical foundation of RL is MDP and dynamic programming Details of Q-learning including its limitation (You should be able to implement it!) Q-learning is a member of temporal difference algorithms