Download presentation
Presentation is loading. Please wait.
Published byVanessa Hancock Modified over 8 years ago
2
Reinforcement Learning 主講人:虞台文
3
Content Introduction Main Elements Markov Decision Process (MDP) Value Functions
4
Reinforcement Learning Introduction
5
Reinforcement Learning Learning from interaction (with environment) Goal-directed learning Learning what to do and its effect Trial-and-error search and delayed reward – The two most important distinguishing features of reinforcement learning
6
Exploration and Exploitation The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. Dilemma neither exploitation nor exploration can be pursued exclusively without failing at the task.
7
Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output)
8
Reinforcement Learning RL System Inputs Outputs (“actions”) Training Info = evaluations (“rewards” / “penalties”) Objective: get as much reward as possible
9
Reinforcement Learning Main Elements
10
Environment action reward state Agent agent To maximize value
11
Main Elements Environment action reward state Agent agent To maximize value Immediate reward (short term) Immediate reward (short term) Total reward (long term) Total reward (long term)
12
Example (Bioreactor) State – current temperature and other sensory readings, composition, target chemical Actions – how much heating, stirring are required? – what ingredients need to be added? Reward – moment-by-moment production of desired chemical
13
Example (Pick-and-Place Robot) State – current positions and velocities of joints Actions – voltages to apply to motors Reward – reach end-position successfully, speed, smoothness of trajectory
14
Example (Recycling Robot) State – charge level of battery Actions – look for cans, wait for can, go recharge Reward – positive for finding cans, negative for running out of battery
15
Main Elements Environment – Its state is perceivable Reinforcement Function – To generate reward – A function of states (or state/action pairs) Value Function – The potential to reach the goal (with maximum total reward) – To determine the policy – A function of state
16
The Agent-Environment Interface Environment Agent action atat state stst reward rtrt s t+1 r t+1 stst s t+1 s t+2 s t+3 rtrt r t+1 r t+2 r t+3 atat a t+1 a t+2 a t+3 …… Frequently, we model the environment as a Markov Decision Process (MDP).
17
Reward Function A reward function is closely related to the goal in reinforcement learning. – It maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. or S: a set of states A: a set of actions
18
Goals and Rewards The agent's goal is to maximize the total amount of reward it receives. This means maximizing not just immediate reward, but cumulative reward in the long run.
19
Goals and Rewards Reward = 0 Reward = 1 Can you design another reward function?
20
Goals and Rewards Win Loss Draw or Non-terminal statereward +1 11 0
21
Goals and Rewards The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved. 0 11 11 11 11
22
Reinforcement Learning Markov Decision Processes
23
Definition An MDP consists of: – A set of states S, and a set of actions A – A transition distribution – Expected next rewards
24
Example (Recycling Robot) HighLow wait search wait recharge
25
Example (Recycling Robot) HighLow wait search wait recharge
26
Decision Making Many stochastic processes can be modeled within the MDP framework. The process is controlled by choosing actions in each state trying to attain the maximum long-term reward. How to find the optimal policy?
27
Reinforcement Learning Value Functions
28
or To estimate how good it is for the agent in a given state (or how good it is to perform a given action in a given state). The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Value functions are defined with respect to particular policies.
29
Returns Episodic Tasks – finite-horizon tasks terminates after a fixed number of time steps – indefinite-horizon tasks can last arbitrarily long but eventually terminate Continual Tasks – infinite-horizon tasks
30
Finite Horizontal Tasks Return at time t Expected return at time t k-armed bandit problem
31
Indefinite Horizontal Tasks Return at time t Expected return at time t Play chess
32
Continual Tasks Return at time t Expected return at time t Control
33
Unified Notation Reformulation of episodic tasks s0s0 s1s1 s2s2 r1r1 r2r2 r3r3 r 4 =0 r 5 =0...... Discounted return at time t : discounting factor = 0 = 1 < 1
34
Policies A policy, , is a mapping from states, s S, and actions, a A(s), to the probability (s, a) of taking action a when in state s.
35
Value Functions under a Policy State-Value Function Action-Value Function
36
Bellman Equation for a Policy State-Value Function
37
Backup Diagram State-Value Function s a r
38
Bellman Equation for a Policy Action-Value Function
39
Backup Diagram Action-Value Function s, a s’s’ a’a’
40
Bellman Equation for a Policy This is a set of equations (in fact, linear), one for each state. – It specifies the consistency condition between values of states and successor states, and rewards. Its unique solution is the value function for .
41
Example (Grid World) State: position Actions: north, south, east, west; resulting state is deterministic. Reward: If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; = 0.9
42
Optimal Policy ( * ) Optimal State-Value Function Optimal Action-Value Function What is the relation btw. them.
43
Optimal Value Functions Bellman Optimality Equations:
44
Optimal Value Functions Bellman Optimality Equations: How to apply the value function to determine the action to be taken on each state? How to compute? How to store?
45
Example (Grid World) V*V* ** Random Policy Optimal Policy
46
Finding Optimal Solution via Bellman Finding an optimal policy by solving the Bellman Optimality Equation requires the following: – accurate knowledge of environment dynamics; – enough space and time for computation; – the Markov Property.
47
Example (Recycling Robot) HighLow wait search wait recharge
48
Example (Recycling Robot) HighLow wait search wait recharge
49
Optimality and Approximation How much space and time do we need? – polynomial in number of states (via dynamic programming methods) – BUT, number of states is often huge (e.g., backgammon has about 10 20 states). We usually have to settle for approximations. – Many RL methods can be understood as approximately solving the Bellman Optimality Equation.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.