1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Batch RL Via Least Squares Policy Iteration
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.
Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Reinforcement Learning
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
Reinforcement Learning
Biomedical Data & Markov Decision Process
Tuning bandit algorithms in stochastic environments
Reinforcement learning
Chapter 2: Evaluative Feedback
Chapter 8: Generalization and Function Approximation
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 7: Eligibility Traces
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Presentation transcript:

1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A A AA A

2 Contents  Defining the problem(s)  Learning optimally  Learning a good policy Monte-Carlo Temporal Difference (bootstrapping) Batch – fitted value iteration and relatives

3 The Learning Problem  The MDP is unknown but the agent can interact with the system  Goals: Learn an optimal policy  Where do the samples come from? Samples are generated externally The agent interacts with the system to get the samples (“active learning”)  Performance measure: What is the performance of the policy obtained? Learn optimally: Minimize regret while interacting with the system  Performance measure: loss in rewards due to not using the optimal policy from the beginning  Exploration vs. exploitation

4 Learning from Feedback  A protocol for prediction problems: x t – situation (observed by the agent) y t 2 Y – value to be predicted p t 2 Y – predicted value (can depend on all past values ) learning!) r t (x t,y t,y) – value of predicting y loss of learner: t = r t (x t,y t,y)-r t (x t, y t,p t )  Supervised learning: agent is told y t, r t (x t,y t,.) Regression: r t (x t,y t,y)=-(y-y t ) 2  t =(y t -p t ) 2  Full information prediction problem: 8 y 2 Y, r t (x t,y) is communicated to the agent, but not y t  Bandit (partial information) problem: r t (x t,p t ) is communicated to the agent only

5 Learning Optimally  Explore or exploit?  Bandit problems Simple schemes Optimism in the face of uncertainty (OFU)  UCB  Learning optimally in MDPs with the OFU principle

6 Learning Optimally: Exploration vs. Exploitation  Two treatments  Unknown success probabilities  Goal: find the best treatment while loosing few patients  Explore or exploit?

7 Exploration vs. Exploitation: Some Applications  Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page  More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker..

8 Bernoulli Bandits  Payoff is 0 or 1  Arm 1: R 1 (1), R 2 (1), R 3 (1), R 4 (1), …  Arm 2: R 1 (2), R 2 (2), R 3 (2), R 4 (2), …

9 Some definitions  Payoff is 0 or 1  Arm 1: R 1 (1), R 2 (1), R 3 (1), R 4 (1), …  Arm 2: R 1 (2), R 2 (2), R 3 (2), R 4 (2), … Now: t=9 T 1 (t-1) = 4 T 2 (t-1) = 4 A 1 = 1, A 2 = 2, …

10 The Exploration/Exploitation Dilemma  Action values: Q * (a) = E[R t (a)]  Suppose you form estimates  The greedy action at t is:  Exploitation: When the agent chooses to follow A t *  Exploration: When the agent chooses to do something else  You can’t exploit all the time; you can’t explore all the time  You can never stop exploring; but you should always reduce exploring. Maybe.

11 Action-Value Methods  Methods that adapt action-value estimates and nothing else  How to estimate action-values?  Sample average:  Claim: if n t (a) !1  Why??

12 -Greedy Action Selection  Greedy action selection:  -Greedy:... the simplest way to “balance” exploration and exploitation

13 10-Armed Testbed  n = 10 possible actions  Repeat 2000 times: Q * (a) ~ N(0,1) Play 1000 rounds  R t (a)~ N(Q * (a),1)

14 -Greedy Methods on the 10- Armed Testbed

15 Softmax Action Selection  Problem with ² -greedy: Neglects action values  Softmax idea: grade action probs. by estimated values.  Gibbs, or Boltzmann action selection, or exponential weights: = t is the “computational temperature”

16 Incremental Implementation  Sample average:  Incremental computation:  Common update rule form: NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

17 UCB: Upper Confidence Bounds  Principle: Optimism in the face of uncertainty  Works when the environment is not adversary  Assume rewards are in [0,1]. Let (p>2)  For a stationary environment, with iid rewards this algorithm is hard to beat!  Formally: regret in T steps is O(log T)  Improvement: Estimate variance, use it in place of p [AuSzeMu ’07]  This principle can be used for achieving small regret in the full RL problem! [Auer et al. ’02]

18 UCRL2: UCB Applied to RL  [Auer, Jaksch & Ortner ’07]  Algorithm UCRL2( ± ): Phase initialization:  Estimate mean model p 0 using maximum likelihood (counts)  C := { p | ||p(.|x,a)-p 0 (.|x,a) · c |X| log(|A|T/delta) / N(x,a) }  p’ :=argmax p ½ * (p), ¼ := ¼ * (p’)  N 0 (x,a) := N(x,a), 8 (x,a) 2 X £ A Execution  Execute ¼ until some (x,a) have been visited at least N 0 (x,a) times in this phase

19 UCRL2 Results  Def: Diameter of an MDP M: D(M) = max x,y min ¼ E[ T(x  y; ¼ ) ]  Regret bounds Lower bound: E[L n ] = ( ( D |X| |A| T ) 1/2 ) Upper bounds:  w.p. 1- ± /T, L T · O( D |X| ( |A| T log( |A|T/ ± ) 1/2 )  w.p. 1- ±, L T · O( D 2 |X| 2 |A| log( |A|T/ ± )/ ¢ ) ¢ =performance gap between best and second best policy

20 Learning a Good Policy  Monte-Carlo methods  Temporal Difference methods Tabular case Function approximation  Batch learning

21 Learning a good policy  Model-based learning Learn p,r “Solve” the resulting MDP  Model-free learning Learn the optimal action-value function and (then) act greedily Actor-critic learning Policy gradient methods  Hybrid Learn a model and mix planning and a model-free method; e.g. Dyna

22 Monte-Carlo Methods  Episodic MDPs!  Goal: Learn V ¼ (.)  V ¼ (x) = E ¼ [  t ° t R t |X 0 =x]  (X t,A t,R t ): -- trajectory of ¼  Visits to a state f(x) = min {t|X t = x}  First visit E(x) = { t | X t = x }  Every visit  Return: S(t) = ° 0 R t + ° 1 R t+1 + …  K independent trajectories  S (k), E (k), f (k), k=1..K  First-visit MC: Average over { S (k) ( f (k) (x) ) : k=1..K }  Every-visit MC: Average over { S (k) ( t ) : k=1..K, t 2 E (k) (x) }  Claim: Both converge to V ¼ (.)  From now on S t = S(t) [Singh & Sutton ’96]

23 Learning to Control with MC  Goal: Learn to behave optimally  Method: Learn Q ¼ (x,a)..to be used in an approximate policy iteration (PI) algorithm  Idea/algorithm: Add randomness  Goal: all actions are sampled eventually infinitely often  e.g., ² -greedy or exploring starts Use the first-visit or the every-visit method to estimate Q ¼ (x,a) Update policy  Once values converged.. or..  Always at the states visited

24 Monte-Carlo: Evaluation  Convergence rate: Var(S(0)|X=x)/N  Advantages over DP: Learn from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (no bootstrapping)  Issue: maintaining sufficient exploration exploring starts, soft policies

25 Temporal Difference Methods  Every-visit Monte-Carlo: V(X t )  V(X t ) + ® t (X t ) (S t – V(X t ))  Bootstrapping S t = R t + ° S t+1 S t ’ = R t + ° V(X t+1 )  TD(0): V(X t )  V(X t ) + ® t (X t ) ( S t ’– V(X t ) )  Value iteration: V(X t )  E[ S t ’ | X t ]  Theorem: Let V t be the sequence of functions generated by TD(0). Assume 8 x, w.p.1  t ® t (x)= 1,  t ® t 2 (x)<+ 1. Then V t  V ¼ w.p.1  Proof: Stochastic approximations: V t+1 =T t (V t,V t ), U t+1 =T t (U t,V ¼ )  TV ¼. [Jaakkola et al., ’94, Tsitsiklis ’94, SzeLi99] [Samuel, ’59], [Holland ’75], [Sutton ’88]

26 TD or MC?  TD advantages: can be fully incremental, i.e., learn before knowing the final outcome  Less memory  Less peak computation learn without the final outcome  From incomplete sequences  MC advantage: Less harm by Markovian violations  Convergence rate? Var(S(0)|X=x) decides!

27 Learning to Control with TD  Q-learning [Watkins ’90]: Q(X t,A t )  Q(X t,A t ) + ® t (X t,A t ) {R t + ° max a Q (X t+1,a)–Q(X t,A t )}  Theorem: Converges to Q * [JJS’94, Tsi’94,SzeLi99]  SARSA [Rummery & Niranjan ’94]: A t ~ Greedy ² (Q,X t ) Q(X t,A t )  Q(X t,A t ) + ® t (X t,A t ) {R t + ° Q (X t+1,A t+1 )–Q(X t,A t )}  Off-policy (Q-learning) vs. on-policy (SARSA)  Expecti-SARSA  Actor-Critic [Witten ’77, Barto, Sutton & Anderson ’83, Sutton ’84]

28 Cliffwalking  greedy  = 0.1

29 N-step TD Prediction  Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

30  Monte Carlo: S t = R t + ° R t ° T - t R T  TD: S t (1) = R t + ° V(X t+1 ) Use V to estimate remaining return  n-step TD: 2 step return:  S t (2) = R t + ° R t+1 + ° 2 V(X t+2 ) n-step return:  S t (n) = R t + ° R t+1 + … + ° n V(X t+n ) N-step TD Prediction

31 Learning with n-step Backups  Learning with n-step backups: V(X t )  V(X t ) + ® t ( S t (n) - V(X t ))  n: controls how much to bootstrap

32 Random Walk Examples  How does 2-step TD work here?  How about 3-step TD?

33 A Larger Example  Task: 19 state random walk  Do you think there is an optimal n? for everything?

34 Averaging N-step Returns  Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step:  “complex backup” One backup

35 Forward View of TD()  Idea: Average over multiple backups -return: S t ( ¸ ) = (1- ¸ )  n 1 ¸ n S t (n+1)  TD( ¸ ): ¢ V(X t ) = ® t ( S t ( ¸ ) -V(X t ))  Relation to TD(0) and MC ¸ =0  TD(0) ¸ =1  MC [Sutton ’88]

36 -return on the Random Walk  Same 19 state random walk as before  Why intermediate values of are best?

37 Backward View of TD() ± t = R t + ° V(X t+1 ) – V(X t ) V(x)  V(x) + ® t ± t e(x) e(x)  ° ¸ e(x) + I(x=X t )  Off-line updates  Same as FW TD( ¸ )  e(x): eligibility trace Accumulating trace Replacing traces speed up convergence:  e(x)  max( °¸ e(x), I(x=X t ) ) [Sutton ’88, Singh & Sutton ’96]

38 Function Approximation with TD

39 Gradient Descent Methods transpose  Assume V t is a differentiable function of : V t (x) = V(x;).  Assume, for now, training examples of the form: { (X t, V  (X t )) }

40 Performance Measures  Many are applicable but…  a common and simple one is the mean-squared error (MSE) over a distribution P:  Why P?  Why minimize MSE?  Let us assume that P is always the distribution of states at which backups are done.  The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

41 Gradient Descent  Let L be any function of the parameters. Its gradient at any point  in this space is:  Iteratively move down the gradient:

42 Gradient Descent in RL  Function to descent on:  Gradient:  Gradient descent procedure:  Bootstrapping with S t ’  TD() (forward view):

43 Linear Methods  Linear FAPP: V(x; µ ) = µ T Á (x)  r µ V(x; µ ) = Á (x)  Tabular representation: Á (x) y = I(x=y)  Backward view: ± t = R t + ° V(X t+1 ) – V(X t ) µ  µ + ® t ± t e e  ° ¸ e + r µ V(X t ; µ )  Theorem [TsiVaR’97]: V t converges to V s.t. ||V-V ¼ || D,2 · ||V ¼ - ¦ V ¼ || D,2 /(1- ° ). [Sutton ’84, ’88, Tsitsiklis & Van Roy ’97]

44  Learning state-action values Training examples:  The general gradient-descent rule:  Gradient-descent Sarsa() Control with FA [Rummery & Niranjan ’94]

45 Mountain-Car Task [Sutton ’96], [Singh & Sutton ’96]

46 Mountain-Car Results

47 Baird’s Counterexample: Off-policy Updates Can Diverge [Baird ’95]

48 Baird’s Counterexample Cont.

49 Should We Bootstrap?

50 Batch Reinforcement Learning

51 Batch RL  Goal: Given the trajectory of the behavior policy ¼ b X 1,A 1,R 1, …, X t, A t, R t, …, X N compute a good policy!  “Batch learning”  Properties: Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role  Performance measures: ||V * (x) – V ¼ (x)|| 1 = sup x |V * (x) - V ¼ (x)| = sup x V * (x) - V ¼ (x) ||V * (x) - V ¼ (x)|| 2 = s (V * (x)-V ¼ (x)) 2 d ¹ (x)

52 Solution methods  Build a model  Do not build a model, but find an approximation to Q * using value iteration => fitted Q- iteration using policy iteration =>  Policy evaluated by approximate value iteration Policy evaluated by Bellman- residual minimization (BRM)  Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI  Policy search [Bradtke, Barto ’96], [Lagoudakis, Parr ’03], [AnSzeMu ’07]

53 Evaluating a policy: Fitted value iteration  Choose a function space F.  Solve for i=1,2,…,M the LS (regression) problems:  Counterexamples?!?!? [Baird ’95, Tsitsiklis and van Roy ’96]  When does this work??  Requirement: If M is big enough and the number of samples is big enough Q M should be close to Q ¼  We have to make some assumptions on F

54 Least-squares vs. gradient  Linear least squares (ordinary regression): y t = w * T x t + ² t (x t,y t ) jointly distributed r.v.s., iid, E[ ² t |x t ]=0.  Seeing (x t,y t ), t=1,…,T, find out w *.  Loss function: L(w) = E[ (y 1 – w T x 1 ) 2 ].  Least-squares approach: w T = argmin w  t=1 T (y t – w T x t ) 2  Stochastic gradient method: w t+1 = w t + ® t (y t -w t T x t ) x t  Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is the computation?

55 Fitted value iteration: Analysis  Goal: Bound ||Q M - Q ¼ || ¹ 2 in terms of max m || ² m || º 2, || ² m || º 2 = s ² m 2 (x,a) º (dx,da), where Q m+1 = T ¼ Q m + ² m, ² -1 = Q 0 -Q ¼  U m = Q m – Q ¼ After [AnSzeMu ’07]

56 Analysis/2

57 Summary  If the regression errors are all small and the system is noisy ( 8 ¼, ½, ½ P ¼ · C 1 º ) then the final error will be small.  How to make the regression errors small?  Regression error decomposition: Approximation error Estimation error

58 Controlling the approximation error

59 Controlling the approximation error

60 Controlling the approximation error

61 Controlling the approximation error  Assume smoothness!

62 Learning with (lots of) historical data  Data: A long trajectory of some exploration policy  Goal: Efficient algorithm to learn a policy  Idea: Use fitted action-values  Algorithms: Bellman residual minimization, FQI [AnSzeMu ’07] LSPI [Lagoudakis, Parr ’03]  Bounds: Oracle inequalities (BRM, FQI and LSPI) ) consistency

63 BRM insight  TD error:  t R t + ° Q(X t+1, ¼ (X t+1 ))-Q(X t,A t )  Bellman error: E[E[  t | X t,A t ] 2 ]  What we can compute/estimate: E[E[  t 2 | X t,A t ]]  They are different!  However: [AnSzeMu ’07]

64 Loss function

65 Algorithm (BRM++)

66 Do we need to reweight or throw away data?  NO!  WHY?  Intuition from regression: m(x) = E[Y|X=x] can be learnt no matter what p(x) is!  * (a|x): the same should be possible!  BUT.. Performance might be poor! => YES! Like in supervised learning when training and test distributions are different

67 Bound

68 The concentration coefficients  Lyapunov exponents  Our case: y t is infinite dimensional P t depends on the policy chosen If top-Lyap exp. · 0, we are good

69 Open question  Abstraction:  Let  True?

70 Relation to LSTD  LSTD: Linear function space Bootstrap the “normal equation” [AnSzeMu ’07]

71 Open issues Adaptive algorithms to take advantage of regularity when present to address the “curse of dimensionality”  Penalized least-squares/aggregation?  Feature relevance  Factorization  Manifold estimation Abstraction – build automatically Active learning Optimal on-line learning for infinite problems

72 References  [Auer et al. ’02] P. Auer, N. Cesa-Bianchi and P. Fischer: Finite time analysis of the multiarmed bandit problem, Machine Learning, 47:235—256,  [AuSzeMu ’07] J.-Y. Audibert, R. Munos and Cs. Szepesvári: Tuning bandit algorithms in stochastic environments, ALT,  [Auer, Jaksch & Ortner ’07] P. Auer, T. Jaksch and R. Ortner: Near-optimal Regret Bounds for Reinforcement Learning, (2007), available at  [Singh & Sutton ’96] S.P. Singh and R.S. Sutton: Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123—158,  [Sutton ’88] R.S. Sutton: Learning to predict by the method of temporal differences. Machine Learning, 3:9—44,  [Jaakkola et al. ’94] T. Jaakkola, M.I. Jordan, and S.P. Singh: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6: 1185—1201,  [Tsitsiklis, ’94] J.N. Tsitsiklis: Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185—202,  [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059,  [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis,  [Rummery and Niranjan ’94] G.A. Rummery and M. Niranjan: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department,  [Sutton ’84] R.S. Sutton: Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA,  [Tsitsiklis & Van Roy ’97] J.N. Tsitsiklis and B. Van Roy: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674—690,  [Sutton ’96] R.S. Sutton: Generalization in reinforcement learning: Successful examples using sparse coarse coding. NIPS,  [Baird ’95] L.C. Baird: Residual algorithms: Reinforcement learning with function approximation, ICML,  [Bradtke, Barto ’96] S.J. Bradtke and A.G. Barto: Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33—57,  [Lagoudakis, Parr ’03] M. Lagoudakis and R. Parr: Least-squares policy iteration, Journal of Machine Learning Research, 4:1107—1149,  [AnSzeMu ’07] A. Antos, Cs. Szepesvari and R. Munos: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning Journal, 2007.