Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton and Barto
Computational Modeling Lab Backup diagrams in DP State-value function for policy V(s) V(s 2 ’ ) V(s 2 ) Q(s,a) Q(s 2,a 2 ) Q(s 2,a 1 ) s1s1 s2s2 Action-values function for policy Q(s 1,a 2 ) Q(s 1,a 1 ) V(s 3 ’ ) V(s 3 ) V(s 1 ’ ) V(s 1 )
Computational Modeling Lab Dynamic Programming, model based T T T TTTTTTTTTT
Computational Modeling Lab Recall Value Iteration in DP Q(s,a)
Computational Modeling Lab RL, model free TTTTTTTTTT
Computational Modeling Lab Q-Learning, a value iteration approach Q-learning is off-policy
Computational Modeling Lab example d R=4 c b a R=5 R=2 R=1 R=10 R= Epoch 1: 1,2,4 Epoch 2: 1,6 Epoch 3: 1,3 Epoch 4: 1,2,5 Epoch 6: 2,5
Computational Modeling Lab Some convergence issues Q-learning in guaranteed to converge in a Markovian setting Tsitsiklis J.N. Asynchronous Stochastic Approximation and Q- learning. Machine Learning, Vol. 16: , 1994.
Computational Modeling Lab Proof by Tsitsiklis, cont. On the convergence of Q-learning
Computational Modeling Lab Proof by Tsitsiklis On the convergence of Q-learning “Learning factor” Contraction mapping Noise term q vector, but with possibly outdated components Q(s,a)
Computational Modeling Lab Proof by Tsitsiklis, cont. Stochastic approximation, as a vector t qiqi qjqj FiFi F i + noise
Computational Modeling Lab Proof by Tsitsiklis, cont. Relating Q-learning to stochastic approximation i th component Noise term Contraction mapping Bellman operator Can vary in time
Computational Modeling Lab Sarsa: On-Policy TD Control When is Sarsa = Q-learning?
Computational Modeling Lab Q-Learning versus SARSA Q-learning is off-policy Q-learning is on-policy Sarsa
Computational Modeling Lab Cliff Walking example Actions: up, down, left, right Reward: cliff -100, goal 0, default -1. Action selection -greedy, with = 0.1 Sarsa takes exploration into account
Computational Modeling Lab Q-learning for CAC S 1 = (2,4) S 3 = (3,3) Q(s 1,A1) Q(s 1,R1) Q(s 3,A2) Q(s 3,R2) Class -1 Class -2 [ [ S 2 =(3,4) Acceptance Criterion: Maximize Network Revenue
Computational Modeling Lab Continuous Time Q-learning for CAC Call Arrival t 0 = 0 System state: x Call Departure t1t1 Call Arrival System state: y Call Departure tntn Call Departure t2t2 [Bratke]