Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Slides:



Advertisements
Similar presentations
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
Advertisements

Bayesian Networks Bucket Elimination Algorithm 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Hidden Markov Model 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction – Markov Chain – Hidden Markov Model (HMM) Formal Definition of HMM & Problems Estimate.
Optimization Problems 虞台文 大同大學資工所 智慧型多媒體研究室. Content Introduction Definitions Local and Global Optima Convex Sets and Functions Convex Programming Problems.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.
Reinforcement Learning
Reinforcement Learning (1)
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Lecture 1: A Formal Model of Computation 虞台文 大同大學資工所 智慧型多媒體研究室.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.
INTRODUCTION TO Machine Learning
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Reinforcement Learning Elementary Solution Methods
Hopfield Neural Networks for Optimization 虞台文 大同大學資工所 智慧型多媒體研究室.
Lecture 2: Limiting Models of Instruction Obeying Machine 虞台文 大同大學資工所 智慧型多媒體研究室.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
EM Algorithm 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.
"Playing Atari with deep reinforcement learning."
Markov Decision Processes
Markov Decision Processes
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Chapter 3: The Reinforcement Learning Problem
Dr. Unnikrishnan P.C. Professor, EEE
یادگیری تقویتی Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Hopfield Neural Networks for Optimization
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室

Content Introduction Main Elements Markov Decision Process (MDP) Value Functions

Reinforcement Learning Introduction 大同大學資工所 智慧型多媒體研究室

Reinforcement Learning Learning from interaction (with environment) Goal-directed learning Learning what to do and its effect Trial-and-error search and delayed reward – The two most important distinguishing features of reinforcement learning

Exploration and Exploitation The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. Dilemma  neither exploitation nor exploration can be pursued exclusively without failing at the task.

Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output)

Reinforcement Learning RL System Inputs Outputs (“actions”) Training Info = evaluations (“rewards” / “penalties”) Objective: get as much reward as possible

Reinforcement Learning Main Elements 大同大學資工所 智慧型多媒體研究室

Main Elements  Environment action reward state Agent agent To maximize value

Example (Bioreactor) state – current temperature and other sensory readings, composition, target chemical actions – how much heating, stirring, what ingredients to add reward – moment-by-moment production of desired chemical

Example (Pick-and-Place Robot) state – current positions and velocities of joints actions – voltages to apply to motors reward: – reach end-position successfully, speed, smoothness of trajectory

Example (Recycling Robot) State – charge level of battery Actions – look for cans, wait for can, go recharge Reward – positive for finding cans, negative for running out of battery

Main Elements Environment – Its state is perceivable Reinforcement Function – To generate reward – A function of states (or state/action pairs) Value Function – The potential to reach the goal (with maximum total reward) – To determine the policy – A function of state

The Agent-Environment Interface Environment Agent action atat state stst reward rtrt s t+1 r t+1 stst s t+1 s t+2 s t+3 rtrt r t+1 r t+2 r t+3 atat a t+1 a t+2 a t+3 …… Frequently, we model the environment as a Markov Decision Process (MDP).

Reward Function A reward function defines the goal in a reinforcement learning problem. – Roughly speaking, it maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. or S: a set of states A: a set of actions

Goals and Rewards The agent's goal is to maximize the total amount of reward it receives. This means maximizing not just immediate reward, but cumulative reward in the long run.

Goals and Rewards Reward = 0 Reward = 1 Can you design another reward function?

Goals and Rewards Win Loss Draw or Non-terminal statereward +1 11 0

Goals and Rewards The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved. 0 11 11 11 11

Reinforcement Learning Markov Decision Processes 大同大學資工所 智慧型多媒體研究室

Definition An MDP consists of: – A set of states S, and actions A, – A transition distribution – Expected next rewards

Decision Making Many stochastic processes can be modeled within the MDP framework. The process is controlled by choosing actions in each state trying to attain the maximum long-term reward. How to find the optimal policy?

Example (Recycling Robot) HighLow wait search wait recharge

Example (Recycling Robot) HighLow wait search wait recharge

Reinforcement Learning Value Functions 大同大學資工所 智慧型多媒體研究室

Value Functions or To estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Value functions are defined with respect to particular policies.

Returns Episodic Tasks – finite-horizon tasks – indefinite-horizon tasks Continual Tasks – infinite-horizon tasks

Finite Horizontal Tasks Return at time t Expected return at time t k-armed bandit problem

Indefinite Horizontal Tasks Return at time t Expected return at time t Play chess

Infinite Horizontal Tasks Return at time t Expected return at time t Control

Unified Notation Reformulation of episodic tasks s0s0 s1s1 s2s2 r1r1 r2r2 r3r3 r 4 =0 r 5 = Discounted return at time t  : discounting factor  = 0 = 1 < 1

Policies A policy, , is a mapping from states, s  S, and actions, a  A(s), to the probability  (s, a) of taking action a when in state s.

Value Functions under a Policy State-Value Function Action-Value Function

Bellman Equation for a Policy  State-Value Function

Backup Diagram  State-Value Function s a r

Bellman Equation for a Policy  Action-Value Function

Backup Diagram  Action-Value Function s, a s’s’ a’a’

Bellman Equation for a Policy  This is a set of equations (in fact, linear), one for each state. The value function for  is its unique solution. It can be regarded as consistency condition between values of states and successor states, and rewards.

Example (Grid World) State: position Actions: north, south, east, west; deterministic. Reward: If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy;  = 0.9

Optimal Policy (  * ) Optimal State-Value Function Optimal Action-Value Function What is the relation btw. them.

Optimal Value Functions Bellman Optimality Equations:

Optimal Value Functions Bellman Optimality Equations: How to apply the value function to determine the action to be taken on each state? How to compute? How to store?

Example (Grid World) V*V* ** Random Policy Optimal Policy

Finding Optimal Solution via Bellman Finding an optimal policy by solving the Bellman Optimality Equation requires the following: – accurate knowledge of environment dynamics; – we have enough space and time to do the computation; – the Markov Property.

Optimality and Approximation How much space and time do we need? – polynomial in number of states (via dynamic programming methods) – BUT, number of states is often huge (e.g., backgammon has about states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation.