Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Dialogue Policy Optimisation

brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Decision Theoretic Planning

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning Introduction Presented by Alp Sardağ.

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.

Reinforcement Learning

Reinforcement Learning (1)

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Introduction Many decision making problems in real life

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.

Introduction to Reinforcement Learning

Reinforcement Learning

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Reinforcement Learning

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

1 Introduction to Reinforcement Learning Freek Stulp.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement Learning

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.

"Playing Atari with deep reinforcement learning."

Markov Decision Processes

Markov Decision Processes

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Dr. Unnikrishnan P.C. Professor, EEE

یادگیری تقویتی Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Reinforcement Learning 主講人：虞台文

Content Introduction Main Elements Markov Decision Process (MDP) Value Functions

Reinforcement Learning Introduction

Reinforcement Learning Learning from interaction (with environment) Goal-directed learning Learning what to do and its effect Trial-and-error search and delayed reward – The two most important distinguishing features of reinforcement learning

Exploration and Exploitation The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. Dilemma  neither exploitation nor exploration can be pursued exclusively without failing at the task.

Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output)

Reinforcement Learning RL System Inputs Outputs (“actions”) Training Info = evaluations (“rewards” / “penalties”) Objective: get as much reward as possible

Reinforcement Learning Main Elements

 Environment action reward state Agent agent To maximize value

Main Elements  Environment action reward state Agent agent To maximize value Immediate reward (short term) Immediate reward (short term) Total reward (long term) Total reward (long term)

Example (Bioreactor) State – current temperature and other sensory readings, composition, target chemical Actions – how much heating, stirring are required? – what ingredients need to be added? Reward – moment-by-moment production of desired chemical

Example (Pick-and-Place Robot) State – current positions and velocities of joints Actions – voltages to apply to motors Reward – reach end-position successfully, speed, smoothness of trajectory

Example (Recycling Robot) State – charge level of battery Actions – look for cans, wait for can, go recharge Reward – positive for finding cans, negative for running out of battery

Main Elements Environment – Its state is perceivable Reinforcement Function – To generate reward – A function of states (or state/action pairs) Value Function – The potential to reach the goal (with maximum total reward) – To determine the policy – A function of state

The Agent-Environment Interface Environment Agent action atat state stst reward rtrt s t+1 r t+1 stst s t+1 s t+2 s t+3 rtrt r t+1 r t+2 r t+3 atat a t+1 a t+2 a t+3 …… Frequently, we model the environment as a Markov Decision Process (MDP).

Reward Function A reward function is closely related to the goal in reinforcement learning. – It maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. or S: a set of states A: a set of actions

Goals and Rewards The agent's goal is to maximize the total amount of reward it receives. This means maximizing not just immediate reward, but cumulative reward in the long run.

Goals and Rewards Reward = 0 Reward = 1 Can you design another reward function?

Goals and Rewards Win Loss Draw or Non-terminal statereward +1 11 0

Goals and Rewards The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved. 0 11 11 11 11

Reinforcement Learning Markov Decision Processes

Definition An MDP consists of: – A set of states S, and a set of actions A – A transition distribution – Expected next rewards

Example (Recycling Robot) HighLow wait search wait recharge

Example (Recycling Robot) HighLow wait search wait recharge

Decision Making Many stochastic processes can be modeled within the MDP framework. The process is controlled by choosing actions in each state trying to attain the maximum long-term reward. How to find the optimal policy?

Reinforcement Learning Value Functions

or To estimate how good it is for the agent in a given state (or how good it is to perform a given action in a given state). The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Value functions are defined with respect to particular policies.

Returns Episodic Tasks – finite-horizon tasks terminates after a fixed number of time steps – indefinite-horizon tasks can last arbitrarily long but eventually terminate Continual Tasks – infinite-horizon tasks

Finite Horizontal Tasks Return at time t Expected return at time t k-armed bandit problem

Indefinite Horizontal Tasks Return at time t Expected return at time t Play chess

Continual Tasks Return at time t Expected return at time t Control

Unified Notation Reformulation of episodic tasks s0s0 s1s1 s2s2 r1r1 r2r2 r3r3 r 4 =0 r 5 = Discounted return at time t  : discounting factor  = 0 = 1 < 1

Policies A policy, , is a mapping from states, s  S, and actions, a  A(s), to the probability  (s, a) of taking action a when in state s.

Value Functions under a Policy State-Value Function Action-Value Function

Bellman Equation for a Policy  State-Value Function

Backup Diagram  State-Value Function s a r

Bellman Equation for a Policy  Action-Value Function

Backup Diagram  Action-Value Function s, a s’s’ a’a’

Bellman Equation for a Policy  This is a set of equations (in fact, linear), one for each state. – It specifies the consistency condition between values of states and successor states, and rewards. Its unique solution is the value function for .

Example (Grid World) State: position Actions: north, south, east, west; resulting state is deterministic. Reward: If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy;  = 0.9

Optimal Policy (  * ) Optimal State-Value Function Optimal Action-Value Function What is the relation btw. them.

Optimal Value Functions Bellman Optimality Equations:

Optimal Value Functions Bellman Optimality Equations: How to apply the value function to determine the action to be taken on each state? How to compute? How to store?

Example (Grid World) V*V* ** Random Policy Optimal Policy

Finding Optimal Solution via Bellman Finding an optimal policy by solving the Bellman Optimality Equation requires the following: – accurate knowledge of environment dynamics; – enough space and time for computation; – the Markov Property.

Example (Recycling Robot) HighLow wait search wait recharge

Example (Recycling Robot) HighLow wait search wait recharge

Optimality and Approximation How much space and time do we need? – polynomial in number of states (via dynamic programming methods) – BUT, number of states is often huge (e.g., backgammon has about states). We usually have to settle for approximations. – Many RL methods can be understood as approximately solving the Bellman Optimality Equation.