Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室

Content Introduction Main Elements Markov Decision Process (MDP) Value Functions

Reinforcement Learning Introduction 大同大學資工所智慧型多媒體研究室

Reinforcement Learning Learning from interaction (with environment) Goal-directed learning Learning what to do and its effect Trial-and-error search and delayed reward – The two most important distinguishing features of reinforcement learning

Exploration and Exploitation The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. Dilemma  neither exploitation nor exploration can be pursued exclusively without failing at the task.

Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output)

Reinforcement Learning RL System Inputs Outputs (“actions”) Training Info = evaluations (“rewards” / “penalties”) Objective: get as much reward as possible

Reinforcement Learning Main Elements 大同大學資工所智慧型多媒體研究室

Main Elements  Environment action reward state Agent agent To maximize value

Example (Bioreactor) state – current temperature and other sensory readings, composition, target chemical actions – how much heating, stirring, what ingredients to add reward – moment-by-moment production of desired chemical

Example (Pick-and-Place Robot) state – current positions and velocities of joints actions – voltages to apply to motors reward: – reach end-position successfully, speed, smoothness of trajectory

Example (Recycling Robot) State – charge level of battery Actions – look for cans, wait for can, go recharge Reward – positive for finding cans, negative for running out of battery

Main Elements Environment – Its state is perceivable Reinforcement Function – To generate reward – A function of states (or state/action pairs) Value Function – The potential to reach the goal (with maximum total reward) – To determine the policy – A function of state

The Agent-Environment Interface Environment Agent action atat state stst reward rtrt s t+1 r t+1 stst s t+1 s t+2 s t+3 rtrt r t+1 r t+2 r t+3 atat a t+1 a t+2 a t+3 …… Frequently, we model the environment as a Markov Decision Process (MDP).

Reward Function A reward function defines the goal in a reinforcement learning problem. – Roughly speaking, it maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. or S: a set of states A: a set of actions

Goals and Rewards The agent's goal is to maximize the total amount of reward it receives. This means maximizing not just immediate reward, but cumulative reward in the long run.

Goals and Rewards Reward = 0 Reward = 1 Can you design another reward function?

Goals and Rewards Win Loss Draw or Non-terminal statereward +1 11 0

Goals and Rewards The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved. 0 11 11 11 11

Reinforcement Learning Markov Decision Processes 大同大學資工所智慧型多媒體研究室

Definition An MDP consists of: – A set of states S, and actions A, – A transition distribution – Expected next rewards

Decision Making Many stochastic processes can be modeled within the MDP framework. The process is controlled by choosing actions in each state trying to attain the maximum long-term reward. How to find the optimal policy?

Example (Recycling Robot) HighLow wait search wait recharge

Reinforcement Learning Value Functions 大同大學資工所智慧型多媒體研究室

Value Functions or To estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Value functions are defined with respect to particular policies.

Returns Episodic Tasks – finite-horizon tasks – indefinite-horizon tasks Continual Tasks – infinite-horizon tasks

Finite Horizontal Tasks Return at time t Expected return at time t k-armed bandit problem

Indefinite Horizontal Tasks Return at time t Expected return at time t Play chess

Infinite Horizontal Tasks Return at time t Expected return at time t Control

Unified Notation Reformulation of episodic tasks s0s0 s1s1 s2s2 r1r1 r2r2 r3r3 r 4 =0 r 5 =0...... Discounted return at time t  : discounting factor  = 0 = 1 < 1

Policies A policy, , is a mapping from states, s  S, and actions, a  A(s), to the probability  (s, a) of taking action a when in state s.

Value Functions under a Policy State-Value Function Action-Value Function

Bellman Equation for a Policy  State-Value Function

Backup Diagram  State-Value Function s a r

Bellman Equation for a Policy  Action-Value Function

Backup Diagram  Action-Value Function s, a s’s’ a’a’

Bellman Equation for a Policy  This is a set of equations (in fact, linear), one for each state. The value function for  is its unique solution. It can be regarded as consistency condition between values of states and successor states, and rewards.

Example (Grid World) State: position Actions: north, south, east, west; deterministic. Reward: If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy;  = 0.9

Optimal Policy (  * ) Optimal State-Value Function Optimal Action-Value Function What is the relation btw. them.

Optimal Value Functions Bellman Optimality Equations:

Optimal Value Functions Bellman Optimality Equations: How to apply the value function to determine the action to be taken on each state? How to compute? How to store?

Example (Grid World) V*V* ** Random Policy Optimal Policy

Finding Optimal Solution via Bellman Finding an optimal policy by solving the Bellman Optimality Equation requires the following: – accurate knowledge of environment dynamics; – we have enough space and time to do the computation; – the Markov Property.

Optimality and Approximation How much space and time do we need? – polynomial in number of states (via dynamic programming methods) – BUT, number of states is often huge (e.g., backgammon has about 10 20 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Similar presentations

Presentation on theme: "Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning 主講人：虞台文 大同大學資工所 智慧型多媒體研究室.

Similar presentations

Presentation on theme: "Reinforcement Learning 主講人：虞台文 大同大學資工所 智慧型多媒體研究室."— Presentation transcript:

Similar presentations

About project

Feedback

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Presentation on theme: "Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室."— Presentation transcript: