Reinforcement Learning 主講人:虞台文
Content Introduction Main Elements Markov Decision Process (MDP) Value Functions
Reinforcement Learning Introduction
Reinforcement Learning Learning from interaction (with environment) Goal-directed learning Learning what to do and its effect Trial-and-error search and delayed reward – The two most important distinguishing features of reinforcement learning
Exploration and Exploitation The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. Dilemma neither exploitation nor exploration can be pursued exclusively without failing at the task.
Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output)
Reinforcement Learning RL System Inputs Outputs (“actions”) Training Info = evaluations (“rewards” / “penalties”) Objective: get as much reward as possible
Reinforcement Learning Main Elements
Environment action reward state Agent agent To maximize value
Main Elements Environment action reward state Agent agent To maximize value Immediate reward (short term) Immediate reward (short term) Total reward (long term) Total reward (long term)
Example (Bioreactor) State – current temperature and other sensory readings, composition, target chemical Actions – how much heating, stirring are required? – what ingredients need to be added? Reward – moment-by-moment production of desired chemical
Example (Pick-and-Place Robot) State – current positions and velocities of joints Actions – voltages to apply to motors Reward – reach end-position successfully, speed, smoothness of trajectory
Example (Recycling Robot) State – charge level of battery Actions – look for cans, wait for can, go recharge Reward – positive for finding cans, negative for running out of battery
Main Elements Environment – Its state is perceivable Reinforcement Function – To generate reward – A function of states (or state/action pairs) Value Function – The potential to reach the goal (with maximum total reward) – To determine the policy – A function of state
The Agent-Environment Interface Environment Agent action atat state stst reward rtrt s t+1 r t+1 stst s t+1 s t+2 s t+3 rtrt r t+1 r t+2 r t+3 atat a t+1 a t+2 a t+3 …… Frequently, we model the environment as a Markov Decision Process (MDP).
Reward Function A reward function is closely related to the goal in reinforcement learning. – It maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. or S: a set of states A: a set of actions
Goals and Rewards The agent's goal is to maximize the total amount of reward it receives. This means maximizing not just immediate reward, but cumulative reward in the long run.
Goals and Rewards Reward = 0 Reward = 1 Can you design another reward function?
Goals and Rewards Win Loss Draw or Non-terminal statereward +1 11 0
Goals and Rewards The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved. 0 11 11 11 11
Reinforcement Learning Markov Decision Processes
Definition An MDP consists of: – A set of states S, and a set of actions A – A transition distribution – Expected next rewards
Example (Recycling Robot) HighLow wait search wait recharge
Example (Recycling Robot) HighLow wait search wait recharge
Decision Making Many stochastic processes can be modeled within the MDP framework. The process is controlled by choosing actions in each state trying to attain the maximum long-term reward. How to find the optimal policy?
Reinforcement Learning Value Functions
or To estimate how good it is for the agent in a given state (or how good it is to perform a given action in a given state). The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Value functions are defined with respect to particular policies.
Returns Episodic Tasks – finite-horizon tasks terminates after a fixed number of time steps – indefinite-horizon tasks can last arbitrarily long but eventually terminate Continual Tasks – infinite-horizon tasks
Finite Horizontal Tasks Return at time t Expected return at time t k-armed bandit problem
Indefinite Horizontal Tasks Return at time t Expected return at time t Play chess
Continual Tasks Return at time t Expected return at time t Control
Unified Notation Reformulation of episodic tasks s0s0 s1s1 s2s2 r1r1 r2r2 r3r3 r 4 =0 r 5 = Discounted return at time t : discounting factor = 0 = 1 < 1
Policies A policy, , is a mapping from states, s S, and actions, a A(s), to the probability (s, a) of taking action a when in state s.
Value Functions under a Policy State-Value Function Action-Value Function
Bellman Equation for a Policy State-Value Function
Backup Diagram State-Value Function s a r
Bellman Equation for a Policy Action-Value Function
Backup Diagram Action-Value Function s, a s’s’ a’a’
Bellman Equation for a Policy This is a set of equations (in fact, linear), one for each state. – It specifies the consistency condition between values of states and successor states, and rewards. Its unique solution is the value function for .
Example (Grid World) State: position Actions: north, south, east, west; resulting state is deterministic. Reward: If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; = 0.9
Optimal Policy ( * ) Optimal State-Value Function Optimal Action-Value Function What is the relation btw. them.
Optimal Value Functions Bellman Optimality Equations:
Optimal Value Functions Bellman Optimality Equations: How to apply the value function to determine the action to be taken on each state? How to compute? How to store?
Example (Grid World) V*V* ** Random Policy Optimal Policy
Finding Optimal Solution via Bellman Finding an optimal policy by solving the Bellman Optimality Equation requires the following: – accurate knowledge of environment dynamics; – enough space and time for computation; – the Markov Property.
Example (Recycling Robot) HighLow wait search wait recharge
Example (Recycling Robot) HighLow wait search wait recharge
Optimality and Approximation How much space and time do we need? – polynomial in number of states (via dynamic programming methods) – BUT, number of states is often huge (e.g., backgammon has about states). We usually have to settle for approximations. – Many RL methods can be understood as approximately solving the Bellman Optimality Equation.