Download presentation
Presentation is loading. Please wait.
Published byMaximilian Hudson Modified over 9 years ago
1
QUIZ!! T/F: Optimal policies can be defined from an optimal Value function. TRUE T/F: “Pick the MEU action first, then follow optimal policy” is optimal. TRUE T/F: π*(s)=max s’ V*(s’). FALSE T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE T/F: Value Iteration: The policy cannot converge before the value function FALSE Explain the difference between Policy Iteration and Value Iteration. Why can Policy Iteration be faster than Value Iteration? 1
2
CS 511a: Artificial Intelligence Spring 2013 Lecture 11: MDPs / Reinforcement Learning Feb 25, 2013 Robert Pless, Course adopted from Kilian Weinberger, with many slides from either Dan Klein, Stuart Russell or Andrew Moore 2
3
Announcements Project 2 due Thursday night. HW 1 due Friday 5pm* * accepted no penalty or late-day charge until Monday 10am. 3
4
Policy Iteration 4 Why do we compute V* or Q*, if all we care about is the best policy *? Why do we compute V* or Q*, if all we care about is the best policy *?
5
Utilities for Fixed Policies Another basic operation: compute the utility of a state s under a fix (general non-optimal) policy Define the utility of a state s, under a fixed policy : V (s) = expected total discounted rewards (return) starting in s and following Recursive relation (one-step look- ahead / Bellman equation): 5 a s s, a T(s,a,s’) s’s’ R(s,a,s’) V (s) Q * (s,a) a= (s)
6
Policy Evaluation How do we calculate the V’s for a fixed policy? Idea one: modify Bellman updates Idea two: Optimal solution is stationary point (equality). Then it’s just a linear system, solve with Matlab (or whatever) 6
7
Policy Iteration Policy evaluation: with fixed current policy , find values with simplified Bellman updates: Iterate until values converge Policy improvement: with fixed utilities, find the best action according to one-step look-ahead 7
8
Comparison In value iteration: Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) Policy might not change between updates (wastes computation) In policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies Value update can be solved as linear system Can be faster, if policy changes infequently Hybrid approaches (asynchronous policy iteration): Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often 8
9
Asynchronous Value Iteration In value iteration, we update every state in each iteration Actually, any sequences of Bellman updates will converge if every state is visited infinitely often In fact, we can update the policy as seldom or often as we like, and we will still converge Idea: Update states whose value we expect to change: If is large then update predecessors of s
10
Reinforcement Learning 10
11
Reinforcement Learning Reinforcement learning: Still have an MDP: A set of states s S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’) Still looking for a policy (s) New twist: don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn 11 Demo
12
Example: Animal Learning RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated Example: foraging Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar intake measurement to motor planning area 12
13
Passive Learning Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You are given a policy (s) Goal: learn the state values … what policy evaluation did In this case: Learner “along for the ride” No choice about what actions to take Just execute the policy and learn from experience We’ll get to the active case soon This is NOT offline planning! You actually take actions in the world and see what happens… 13
14
Passive Model-Based Learning Idea: Learn the model empirically through experience Solve for values as if the learned model were correct Simple empirical model learning Count outcomes for each s,a Normalize to give estimate of T(s,a,s’) Discover R(s,a,s’) when we experience (s,a,s’) Solving the MDP with the learned model Iterative policy evaluation, for example 14 (s) s s, (s) s, (s),s’ s’s’
15
Example: Model-Based Learning Episodes: x y T(, right, ) = 1 / 3 T(, right, ) = 2 / 2 +100 -100 = 1 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) 15
16
Passive Model-Free Learning Big idea: why bother learning T? 1. Direct Estimation: Average V(s) value directly and compute expected discounted reward for each state. No need to compute T or R. 16 (s) s s, (s) s’s’
17
Model-Free Learning Want to compute an expectation weighted by P(x): Model-based: estimate P(x) from samples, compute expectation Model-free: estimate expectation directly from samples Why does this work? Because samples appear with the right frequencies! 17
18
Example:Model-Free Estimation Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) V(2,3) ~ (96 + -103) / 2 = -3.5 V(3,3) ~ (99 + 97 + -102) / 3 = 31.3 = 1, R = -1 +100 -100 18
19
Sample-Based Policy Evaluation? Update V without building T or R. 19 (s) s s, (s) s1’s1’ s2’s2’ s3’s3’ s, (s),s’ s’s’
20
Passive Model-Free Learning Big idea: why bother learning T? 1. Direct Estimation: Average V(s) value directly and compute expected discounted reward for each state. No need to compute T or R. 2. Temporal-Difference Leearning: Update value function towards whatever successor occurs – maintain running average. 20 (s) s s, (s) s’s’
21
Temporal-Difference Learning Big idea: learn from every experience! Update V(s) each time we experience (s,a,s’,r) Likely s’ will contribute updates more often Temporal difference learning Policy still fixed! Move values toward value of whatever successor occurs: running average! 21 (s) s s, (s) s’s’ Sample of V(s): Update to V(s): Same update:
22
Exponential Moving Average Exponential moving average Makes recent samples more important Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average Decreasing learning rate can give converging averages 22
23
Problems with TD Value Learning TD value leaning is a model-free way to do policy evaluation However, if we want to turn values into a (new) policy, we’re sunk: Idea: learn Q-values directly Makes action selection model-free too! a s s, a s,a,s’ s’s’ 23
24
Active Learning Full reinforcement learning You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You can choose any actions you like Goal: learn the optimal policy … what value iteration did! In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the world and find out what happens… 24
25
Detour: Q-Value Iteration Value iteration: find successive approx optimal values Start with V 0 * (s) = 0, which we know is right (why?) Given V i *, calculate the values for all states for depth i+1: But Q-values are more useful! Start with Q 0 * (s,a) = 0, which we know is right (why?) Given Q i *, calculate the q-values for all q-states for depth i+1: 25
26
Q-Learning Q-Learning: sample-based Q-value iteration Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: [DEMO – Grid Q’s] 26
27
Q-Learning Q-Learning: sample-based Q-value iteration Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: 27
28
Example’s Tom Erez, Hopper: http://www.youtube.com/watch?feature=playe r_embedded&v=kUfmnoobTHQ - ! http://www.youtube.com/watch?feature=playe r_embedded&v=kUfmnoobTHQ - ! 28
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.