Download presentation
Presentation is loading. Please wait.
1
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University
2
Outline Last week –Goal of Reinforcement Learning –Mathematical Model (MDP) –Planning Value iteration Policy iteration This week: Learning Algorithms –Model based –Model Free
3
Planning - Basic Problems. Policy evaluation - Given a policy , estimate its return. Optimal control - Find an optimal policy * (maximizes the return from any start state). Given a complete MDP model.
4
Planning - Value Functions V (s) The expected value starting at state s following Q (s,a) The expected value starting at state s with action a and then following V (s) and Q (s,a) are define using an optimal policy . V (s) = max V (s)
5
Algorithms - optimal control CLAIM: A policy is optimal if and only if at each state s: V (s) MAX a Q (s,a)} (Bellman Eq.) The greedy policy with respect to Q (s,a) is (s) = argmax a {Q (s,a) }
6
MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. 3. Policy Iteration method.
7
Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning.
8
Example - Elevator Control Planning (alone) : Given arrival model build schedule Learning (alone): Model the arrival model well. Real objective: Construct a schedule while updating model
9
Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning).
10
Learning - Model Based Estimate the model from the observation. (Both transition probability and rewards.) Use the estimated model as the true model, and find optimal policy. If we have a “good” estimated model, we should have a “good” estimation.
11
Learning - Model Based: off policy Let the policy run for a “long” time. –what is “long” ?! Build an “observed model”: –Transition probabilities –Rewards Use the “observed model” to estimate value of the policy.
12
Learning - Model Based sample size Sample size (optimal policy): Naive: O(|S| 2 |A| log (|S| |A|) ) samples. (approximates each transition (s,a,s’) well.) Better: O(|S| |A| log (|S| |A|) ) samples. (Sufficient to approximate optimal policy.) [KS, NIPS’98]
13
Learning - Model Based: on policy The learner has control over the action. –The immediate goal is to lean a model As before: –Build an “observed model”: Transition probabilities and Rewards –Use the “observed model” to estimate value of the policy. Accelerating the learning: –How to reach “new” places ?!
14
Learning - Model Based: on policy Well sampled nodes Relatively unknown nodes
15
Learning: Policy improvement Assume that we can perform: –Given a policy , –Compute V and Q functions of Can run policy improvement: – = Greedy (Q) Process converges if estimations are accurate.
16
Learning: Monte Carlo Methods Assume we can run in episodes –Terminating MDP –Discounted return Simplest: sample the return of state s: –Wait to reach state s, –Compute the return from s, –Average all the returns.
17
Learning: Monte Carlo Methods First visit: –For each state in the episode, –Compute the return from first occurrence –Average the returns Every visit: –Might be biased! Computing optimal policy: –Run policy iteration.
18
Learning - Model Free Policy evaluation: TD(0) An online view: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” is A t =r t + V(s t+1 )-V(s t ), The update: V t +1 (s t ) = V t (s t ) + A t Note that for the correct value function we have: E[r+ V(s’)-V(s)] =0
19
Learning - Model Free Optimal Control: off-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+ r t + V t (s t+1 ) - Q t (s t,a t )] OFF POLICY: Q-Learning Any underlying policy selects actions. Assumes every state action performed infinitely often Learning rate dependency. Convergence in the limit: GUARANTEED [DW,JJS,S,TS]
20
Learning - Model Free Optimal Control: on-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+ r t + Q t (s t+1,a t+1 ) - Q t (s t,a t )] ON-Policy: SARSA a t+1 the -greedy policy for Q t. The policy selects the action! Need to balance exploration and exploitation. Convergence in the limit: GUARANTEED [DW,JJS,S,TS]
21
Learning - Model Free Policy evaluation: TD( ) Again: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” A=r t + V(s t+1 )-V(s t ), Update every state s: V t +1 (s) = V t (s ) + A e(s) Update of e(s) : When visiting s: incremented by 1: e(s) = e(s)+1 For all s: decremented by every step: e(s) = e(s)
22
Summary Markov Decision Process: Mathematical Model. Planning Algorithms. Learning Algorithms: Model Based Monte Carlo TD(0) Q-Learning SARSA TD( )
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.