Download presentation
Presentation is loading. Please wait.
1
Q
2
Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement B&R == Background & Related work RP == Research Plan
3
Reminders... Last time: Bellman equation Examples (pictures) Solution of planning problem: policy iteration Today: Q functions The Q learning algorithm Discussion of R2
4
The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉 discount Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { V i =eval_policy( M, π i, ) // from Bellman eqn π i+1 =local_update_policy( π i, V i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, V ) for i=1..| S | { π’(s i ) =argmax a ∈ A ( sum j ( T(s i,a,s j )*V(s j ) ) ) }
5
Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉 discount Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { V i =eval_policy( M, π i, ) // from Bellman eqn π i+1 =local_update_policy( π i, V i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, V ) for i=1..| S | { π’(s i ) =argmax a ∈ A ( sum j ( T(s i,a,s j )*V(s j ) ) ) } The policy iteration alg.
6
Q : A key operative Critical step in policy iteration π’(s i ) =argmax a ∈ A ( sum j ( T(s i,a,s j )*V(s j ) ) ) Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?” Alt: regardless of current π, what would be the best a I could pick for the next timestep (greedily)
7
Q : A key operative Commonly used operation. Gets a special name: Definition: the Q function, is: Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”
8
What to do with Q Can think of Q as a big table: one entry for each state/action pair “If I’m in state s and take action a, this is my expected discounted reward...” A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?” Can get V and π from Q :
9
Policy iteration, restated Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉 discount Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { Q i =eval_policy( M, π i, ) // from Bellman eqn π i+1 =local_update_policy( π i, Q i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, Q ) for i=1..| S | { π’(s i ) =argmax a ∈ A ( Q ( s i, a ) ) }
10
Learning with Q Q and the notion of policy evaluation give us a nice way to do actual learning Use Q table to represent policy Update Q through experience Every time you see a (s,a,r,s’) tuple, update Q
11
Learning with Q Each example of (s,a,r,s’) is a sample from T(s,a,s’) and from R W/ enough samples, can get a good idea of how the world works, where reward is, etc. Note: Never actually learn T or R ; let Q encode everything you need to know about the world
12
The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount (0<= <1); Learning rate (0<= <1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+ *( r + *max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)
13
Q -learning in action 15x15 maze world; R (goal)= 1; R( other)=0 =0.9 =0.65
14
Q -learning in action Initial policy
15
Q -learning in action After 20 episodes
16
Q -learning in action After 30 episodes
17
Q -learning in action After 100 episodes
18
Q -learning in action After 150 episodes
19
Q -learning in action After 200 episodes
20
Q -learning in action After 250 episodes
21
Q -learning in action After 300 episodes
22
Q -learning in action After 350 episodes
23
Q -learning in action After 400 episodes
24
Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state)
25
Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state) Every 10 episodes, “freeze” policy (turn off learning) Measure avg time to goal from a number of starting states Average over a number of test episodes to iron out noise Plot learning curve: #episodes of learning vs. avg performance
26
Learning performance
27
Notes on learning perf. After 400 learning episodes, still hasn’t asymptoted Note: that’s ~700,000 steps of experience!!! Q learning is really, really slow!!! Same holds for many RL methods (sadly) Fixing this is a good research topic... ;-)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.