Reinforcement learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning Peter Bodík. Previous Lectures Supervised learning –classification, regression Unsupervised learning –clustering, dimensionality.
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning Tutorial
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Application of Reinforcement Learning in Network Routing By Chaopin Zhu Chaopin Zhu.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Announcements Grader office hours posted on course website
A Crash Course in Reinforcement Learning
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
An Overview of Reinforcement Learning
Markov Decision Processes
Biomedical Data & Markov Decision Process
"Playing Atari with deep reinforcement learning."
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement Learning
Reinforcement learning
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
یادگیری تقویتی Reinforcement Learning
Instructor: Vincent Conitzer
Reinforcement Learning
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
Chapter 4: Dynamic Programming
Reinforcement learning
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Reinforcement learning 02/05/2017 Copyrights: Szepesvári Csaba: Megerősítéses tanulás (2004) Szita István, Lőrincz András: Megerősítéses tanulás (2005) Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998)

Reinforcement learning http://www.youtube.com/watch?v=mRpX9DFCdwI http://www.youtube.com/watch?v=VCdxqn0fcnE

Reinforcement learning

Reinforcement learning

Reinforcement learning Pavlov: Nomad 200 robot Nomad 200 simulator Sridhar Mahadevan UMass

Reinforcement learning Controll tasks Planning of multiple actions Learning from interaction Objective: maximising reward (i.e. task-specific) +50 -1 +3 r9 r5 r4 r1 … … s1 s2 s3 s4 s5 … s9 a1 a2 a3 a4 a5 … a9

Supervised vs Reinforcement learning Both are machine learning Supervised Reinforcement Prompt supervision Late, indirect reinforcement Passive learnng (training dataset is given) Active learning (actions taken by the system which will be then reinforced)

Reinforcement learning time: states: actions: reward: policy (strategy): deterministic: stochasztik: (s,a) is the likelihood that we choose action a being in state s (infinate horizon)

process: model of the environment: transition probabilites and reward objective: find a policy which maximises the expected value of total reward

Markov assumption → the dynamics of the system can be given by:

Markov Decision Processes (MDPs) Stochastic transitions a1 r = 0 1 1 2 r = 2 a2

The exploration – exploitation dilemma The k-armed bandit bandit Avg. reward rewards 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 -20, 0, 50 agent 100 Maximising the reward on a long-term we have to explore the world’s dynamics then we can exploit this knowladge and collect reward.

Discounting infinate horizon rt can be infinate! solution: discounting. Instead of rt we use t rt , <1 always finate

Markov Decision Process environment changes according to P and R agent takes an action: we are looking for the optimal policy  which maximises

Long-term reward The policy p of the agent is fixed Rt is the total discounted reward (return) after the step t +50 -1 +3 r9 r5 r4 r1

Value = expected total reward The expected value of Rt depends on p V(s) is the value function Task: find optimal policy p* which maximises Rt in each state

We optimise (search for p We optimise (search for p*) for the long-term reward instead of promptly (greedy) rewards at at+1 at+2 st st+1 st+2 st+3 rt+1 rt+2 rt+3

Bellman equation Based on the Markov assumption, a recursive formula can be derived for the expcted return: s 4 3 5 p(s)

Preference relation among policies 1 ≥ 2, iff a partial ordering * is optimal if * ≥  for every policy  optimal policy exists for every problem

example MDP 4 states, 2 actions -10 A D C B objective 1 2 +100 4 states, 2 actions 10% chace to take the non-selected action

Two example policies A D C B 1 2 -10 +100 (A,1) = 1 (A,2) = 0 (B,1) = 1 (B,2) = 0 (C,1) = 1 (C,2) = 0 (D,1) = 1 (D,2) = 0

solution: solution for 2 :

a third policy 3(A,1) = 0,4 3(A,2) = 0,6 3(B,1) = 1 3(B,2) = 0 3(C,1) = 0 3(C,2) = 1 3(D,1) = 1 3(D,2) = 0 A D C B 1 2 -10 +100

solution:

Comparision of the 3 policies 1 2 3 A 75.61 77.78 B 87.56 68.05 87.78 C D 100 1 ≤ 3 and 2 ≤ 3 3 is optimal there can be many optimal policies! the optimal value function (V) is unique

Optimal policies and the Bellman equation Q is the action-value function The optmal policies share the same value functon: Greedy policy: argmaxa Q*(s,a) The greedy policy is optimal!!!

Optimal policies and the Bellman equation non-linear! has a unique solution solves the long-term planning problem

dynamic programming for MDPs DP assume P and R are known Searching for optimal  Policy iteration Value iteration

Policy iteration

Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables with parameter λ. Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be 0.9 and formulate this as an MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.

If P and R are NOT known searching for V R(s): return starting from s (random variable)

estimation of V(s) by Monte Carlo methods, MC estimating R(s) by simulation (remember we don’t know P and R) take N episodes starting from s according to 

Monte Carlo policy evaluation