Download presentation
Presentation is loading. Please wait.
1
1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/http://www.cs.ualberta.ca/~szepesva/MLSS08/ TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A A AA A
2
2 Contents Defining the problem(s) Learning optimally Learning a good policy Monte-Carlo Temporal Difference (bootstrapping) Batch – fitted value iteration and relatives
3
3 The Learning Problem The MDP is unknown but the agent can interact with the system Goals: Learn an optimal policy Where do the samples come from? Samples are generated externally The agent interacts with the system to get the samples (“active learning”) Performance measure: What is the performance of the policy obtained? Learn optimally: Minimize regret while interacting with the system Performance measure: loss in rewards due to not using the optimal policy from the beginning Exploration vs. exploitation
4
4 Learning from Feedback A protocol for prediction problems: x t – situation (observed by the agent) y t 2 Y – value to be predicted p t 2 Y – predicted value (can depend on all past values ) learning!) r t (x t,y t,y) – value of predicting y loss of learner: t = r t (x t,y t,y)-r t (x t, y t,p t ) Supervised learning: agent is told y t, r t (x t,y t,.) Regression: r t (x t,y t,y)=-(y-y t ) 2 t =(y t -p t ) 2 Full information prediction problem: 8 y 2 Y, r t (x t,y) is communicated to the agent, but not y t Bandit (partial information) problem: r t (x t,p t ) is communicated to the agent only
5
5 Learning Optimally Explore or exploit? Bandit problems Simple schemes Optimism in the face of uncertainty (OFU) UCB Learning optimally in MDPs with the OFU principle
6
6 Learning Optimally: Exploration vs. Exploitation Two treatments Unknown success probabilities Goal: find the best treatment while loosing few patients Explore or exploit?
7
7 Exploration vs. Exploitation: Some Applications Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker..
8
8 Bernoulli Bandits Payoff is 0 or 1 Arm 1: R 1 (1), R 2 (1), R 3 (1), R 4 (1), … Arm 2: R 1 (2), R 2 (2), R 3 (2), R 4 (2), … 0 110 10 1 0
9
9 Some definitions Payoff is 0 or 1 Arm 1: R 1 (1), R 2 (1), R 3 (1), R 4 (1), … Arm 2: R 1 (2), R 2 (2), R 3 (2), R 4 (2), … Now: t=9 T 1 (t-1) = 4 T 2 (t-1) = 4 A 1 = 1, A 2 = 2, … 0 110 10 1 0
10
10 The Exploration/Exploitation Dilemma Action values: Q * (a) = E[R t (a)] Suppose you form estimates The greedy action at t is: Exploitation: When the agent chooses to follow A t * Exploration: When the agent chooses to do something else You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always reduce exploring. Maybe.
11
11 Action-Value Methods Methods that adapt action-value estimates and nothing else How to estimate action-values? Sample average: Claim: if n t (a) !1 Why??
12
12 -Greedy Action Selection Greedy action selection: -Greedy:... the simplest way to “balance” exploration and exploitation
13
13 10-Armed Testbed n = 10 possible actions Repeat 2000 times: Q * (a) ~ N(0,1) Play 1000 rounds R t (a)~ N(Q * (a),1)
14
14 -Greedy Methods on the 10- Armed Testbed
15
15 Softmax Action Selection Problem with ² -greedy: Neglects action values Softmax idea: grade action probs. by estimated values. Gibbs, or Boltzmann action selection, or exponential weights: = t is the “computational temperature”
16
16 Incremental Implementation Sample average: Incremental computation: Common update rule form: NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
17
17 UCB: Upper Confidence Bounds Principle: Optimism in the face of uncertainty Works when the environment is not adversary Assume rewards are in [0,1]. Let (p>2) For a stationary environment, with iid rewards this algorithm is hard to beat! Formally: regret in T steps is O(log T) Improvement: Estimate variance, use it in place of p [AuSzeMu ’07] This principle can be used for achieving small regret in the full RL problem! [Auer et al. ’02]
18
18 UCRL2: UCB Applied to RL [Auer, Jaksch & Ortner ’07] Algorithm UCRL2( ± ): Phase initialization: Estimate mean model p 0 using maximum likelihood (counts) C := { p | ||p(.|x,a)-p 0 (.|x,a) · c |X| log(|A|T/delta) / N(x,a) } p’ :=argmax p ½ * (p), ¼ := ¼ * (p’) N 0 (x,a) := N(x,a), 8 (x,a) 2 X £ A Execution Execute ¼ until some (x,a) have been visited at least N 0 (x,a) times in this phase
19
19 UCRL2 Results Def: Diameter of an MDP M: D(M) = max x,y min ¼ E[ T(x y; ¼ ) ] Regret bounds Lower bound: E[L n ] = ( ( D |X| |A| T ) 1/2 ) Upper bounds: w.p. 1- ± /T, L T · O( D |X| ( |A| T log( |A|T/ ± ) 1/2 ) w.p. 1- ±, L T · O( D 2 |X| 2 |A| log( |A|T/ ± )/ ¢ ) ¢ =performance gap between best and second best policy
20
20 Learning a Good Policy Monte-Carlo methods Temporal Difference methods Tabular case Function approximation Batch learning
21
21 Learning a good policy Model-based learning Learn p,r “Solve” the resulting MDP Model-free learning Learn the optimal action-value function and (then) act greedily Actor-critic learning Policy gradient methods Hybrid Learn a model and mix planning and a model-free method; e.g. Dyna
22
22 Monte-Carlo Methods Episodic MDPs! Goal: Learn V ¼ (.) V ¼ (x) = E ¼ [ t ° t R t |X 0 =x] (X t,A t,R t ): -- trajectory of ¼ Visits to a state f(x) = min {t|X t = x} First visit E(x) = { t | X t = x } Every visit Return: S(t) = ° 0 R t + ° 1 R t+1 + … K independent trajectories S (k), E (k), f (k), k=1..K First-visit MC: Average over { S (k) ( f (k) (x) ) : k=1..K } Every-visit MC: Average over { S (k) ( t ) : k=1..K, t 2 E (k) (x) } Claim: Both converge to V ¼ (.) From now on S t = S(t) 12345 [Singh & Sutton ’96]
23
23 Learning to Control with MC Goal: Learn to behave optimally Method: Learn Q ¼ (x,a)..to be used in an approximate policy iteration (PI) algorithm Idea/algorithm: Add randomness Goal: all actions are sampled eventually infinitely often e.g., ² -greedy or exploring starts Use the first-visit or the every-visit method to estimate Q ¼ (x,a) Update policy Once values converged.. or.. Always at the states visited
24
24 Monte-Carlo: Evaluation Convergence rate: Var(S(0)|X=x)/N Advantages over DP: Learn from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (no bootstrapping) Issue: maintaining sufficient exploration exploring starts, soft policies
25
25 Temporal Difference Methods Every-visit Monte-Carlo: V(X t ) V(X t ) + ® t (X t ) (S t – V(X t )) Bootstrapping S t = R t + ° S t+1 S t ’ = R t + ° V(X t+1 ) TD(0): V(X t ) V(X t ) + ® t (X t ) ( S t ’– V(X t ) ) Value iteration: V(X t ) E[ S t ’ | X t ] Theorem: Let V t be the sequence of functions generated by TD(0). Assume 8 x, w.p.1 t ® t (x)= 1, t ® t 2 (x)<+ 1. Then V t V ¼ w.p.1 Proof: Stochastic approximations: V t+1 =T t (V t,V t ), U t+1 =T t (U t,V ¼ ) TV ¼. [Jaakkola et al., ’94, Tsitsiklis ’94, SzeLi99] [Samuel, ’59], [Holland ’75], [Sutton ’88]
26
26 TD or MC? TD advantages: can be fully incremental, i.e., learn before knowing the final outcome Less memory Less peak computation learn without the final outcome From incomplete sequences MC advantage: Less harm by Markovian violations Convergence rate? Var(S(0)|X=x) decides!
27
27 Learning to Control with TD Q-learning [Watkins ’90]: Q(X t,A t ) Q(X t,A t ) + ® t (X t,A t ) {R t + ° max a Q (X t+1,a)–Q(X t,A t )} Theorem: Converges to Q * [JJS’94, Tsi’94,SzeLi99] SARSA [Rummery & Niranjan ’94]: A t ~ Greedy ² (Q,X t ) Q(X t,A t ) Q(X t,A t ) + ® t (X t,A t ) {R t + ° Q (X t+1,A t+1 )–Q(X t,A t )} Off-policy (Q-learning) vs. on-policy (SARSA) Expecti-SARSA Actor-Critic [Witten ’77, Barto, Sutton & Anderson ’83, Sutton ’84]
28
28 Cliffwalking greedy = 0.1
29
29 N-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)
30
30 Monte Carlo: S t = R t + ° R t+1 +.. ° T - t R T TD: S t (1) = R t + ° V(X t+1 ) Use V to estimate remaining return n-step TD: 2 step return: S t (2) = R t + ° R t+1 + ° 2 V(X t+2 ) n-step return: S t (n) = R t + ° R t+1 + … + ° n V(X t+n ) N-step TD Prediction
31
31 Learning with n-step Backups Learning with n-step backups: V(X t ) V(X t ) + ® t ( S t (n) - V(X t )) n: controls how much to bootstrap
32
32 Random Walk Examples How does 2-step TD work here? How about 3-step TD?
33
33 A Larger Example Task: 19 state random walk Do you think there is an optimal n? for everything?
34
34 Averaging N-step Returns Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step: “complex backup” One backup
35
35 Forward View of TD() Idea: Average over multiple backups -return: S t ( ¸ ) = (1- ¸ ) n 1 ¸ n S t (n+1) TD( ¸ ): ¢ V(X t ) = ® t ( S t ( ¸ ) -V(X t )) Relation to TD(0) and MC ¸ =0 TD(0) ¸ =1 MC [Sutton ’88]
36
36 -return on the Random Walk Same 19 state random walk as before Why intermediate values of are best?
37
37 Backward View of TD() ± t = R t + ° V(X t+1 ) – V(X t ) V(x) V(x) + ® t ± t e(x) e(x) ° ¸ e(x) + I(x=X t ) Off-line updates Same as FW TD( ¸ ) e(x): eligibility trace Accumulating trace Replacing traces speed up convergence: e(x) max( °¸ e(x), I(x=X t ) ) [Sutton ’88, Singh & Sutton ’96]
38
38 Function Approximation with TD
39
39 Gradient Descent Methods transpose Assume V t is a differentiable function of : V t (x) = V(x;). Assume, for now, training examples of the form: { (X t, V (X t )) }
40
40 Performance Measures Many are applicable but… a common and simple one is the mean-squared error (MSE) over a distribution P: Why P? Why minimize MSE? Let us assume that P is always the distribution of states at which backups are done. The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.
41
41 Gradient Descent Let L be any function of the parameters. Its gradient at any point in this space is: Iteratively move down the gradient:
42
42 Gradient Descent in RL Function to descent on: Gradient: Gradient descent procedure: Bootstrapping with S t ’ TD() (forward view):
43
43 Linear Methods Linear FAPP: V(x; µ ) = µ T Á (x) r µ V(x; µ ) = Á (x) Tabular representation: Á (x) y = I(x=y) Backward view: ± t = R t + ° V(X t+1 ) – V(X t ) µ µ + ® t ± t e e ° ¸ e + r µ V(X t ; µ ) Theorem [TsiVaR’97]: V t converges to V s.t. ||V-V ¼ || D,2 · ||V ¼ - ¦ V ¼ || D,2 /(1- ° ). [Sutton ’84, ’88, Tsitsiklis & Van Roy ’97]
44
44 Learning state-action values Training examples: The general gradient-descent rule: Gradient-descent Sarsa() Control with FA [Rummery & Niranjan ’94]
45
45 Mountain-Car Task [Sutton ’96], [Singh & Sutton ’96]
46
46 Mountain-Car Results
47
47 Baird’s Counterexample: Off-policy Updates Can Diverge [Baird ’95]
48
48 Baird’s Counterexample Cont.
49
49 Should We Bootstrap?
50
50 Batch Reinforcement Learning
51
51 Batch RL Goal: Given the trajectory of the behavior policy ¼ b X 1,A 1,R 1, …, X t, A t, R t, …, X N compute a good policy! “Batch learning” Properties: Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role Performance measures: ||V * (x) – V ¼ (x)|| 1 = sup x |V * (x) - V ¼ (x)| = sup x V * (x) - V ¼ (x) ||V * (x) - V ¼ (x)|| 2 = s (V * (x)-V ¼ (x)) 2 d ¹ (x)
52
52 Solution methods Build a model Do not build a model, but find an approximation to Q * using value iteration => fitted Q- iteration using policy iteration => Policy evaluated by approximate value iteration Policy evaluated by Bellman- residual minimization (BRM) Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI Policy search [Bradtke, Barto ’96], [Lagoudakis, Parr ’03], [AnSzeMu ’07]
53
53 Evaluating a policy: Fitted value iteration Choose a function space F. Solve for i=1,2,…,M the LS (regression) problems: Counterexamples?!?!? [Baird ’95, Tsitsiklis and van Roy ’96] When does this work?? Requirement: If M is big enough and the number of samples is big enough Q M should be close to Q ¼ We have to make some assumptions on F
54
54 Least-squares vs. gradient Linear least squares (ordinary regression): y t = w * T x t + ² t (x t,y t ) jointly distributed r.v.s., iid, E[ ² t |x t ]=0. Seeing (x t,y t ), t=1,…,T, find out w *. Loss function: L(w) = E[ (y 1 – w T x 1 ) 2 ]. Least-squares approach: w T = argmin w t=1 T (y t – w T x t ) 2 Stochastic gradient method: w t+1 = w t + ® t (y t -w t T x t ) x t Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is the computation?
55
55 Fitted value iteration: Analysis Goal: Bound ||Q M - Q ¼ || ¹ 2 in terms of max m || ² m || º 2, || ² m || º 2 = s ² m 2 (x,a) º (dx,da), where Q m+1 = T ¼ Q m + ² m, ² -1 = Q 0 -Q ¼ U m = Q m – Q ¼ After [AnSzeMu ’07]
56
56 Analysis/2
57
57 Summary If the regression errors are all small and the system is noisy ( 8 ¼, ½, ½ P ¼ · C 1 º ) then the final error will be small. How to make the regression errors small? Regression error decomposition: Approximation error Estimation error
58
58 Controlling the approximation error
59
59 Controlling the approximation error
60
60 Controlling the approximation error
61
61 Controlling the approximation error Assume smoothness!
62
62 Learning with (lots of) historical data Data: A long trajectory of some exploration policy Goal: Efficient algorithm to learn a policy Idea: Use fitted action-values Algorithms: Bellman residual minimization, FQI [AnSzeMu ’07] LSPI [Lagoudakis, Parr ’03] Bounds: Oracle inequalities (BRM, FQI and LSPI) ) consistency
63
63 BRM insight TD error: t R t + ° Q(X t+1, ¼ (X t+1 ))-Q(X t,A t ) Bellman error: E[E[ t | X t,A t ] 2 ] What we can compute/estimate: E[E[ t 2 | X t,A t ]] They are different! However: [AnSzeMu ’07]
64
64 Loss function
65
65 Algorithm (BRM++)
66
66 Do we need to reweight or throw away data? NO! WHY? Intuition from regression: m(x) = E[Y|X=x] can be learnt no matter what p(x) is! * (a|x): the same should be possible! BUT.. Performance might be poor! => YES! Like in supervised learning when training and test distributions are different
67
67 Bound
68
68 The concentration coefficients Lyapunov exponents Our case: y t is infinite dimensional P t depends on the policy chosen If top-Lyap exp. · 0, we are good
69
69 Open question Abstraction: Let True?
70
70 Relation to LSTD LSTD: Linear function space Bootstrap the “normal equation” [AnSzeMu ’07]
71
71 Open issues Adaptive algorithms to take advantage of regularity when present to address the “curse of dimensionality” Penalized least-squares/aggregation? Feature relevance Factorization Manifold estimation Abstraction – build automatically Active learning Optimal on-line learning for infinite problems
72
72 References [Auer et al. ’02] P. Auer, N. Cesa-Bianchi and P. Fischer: Finite time analysis of the multiarmed bandit problem, Machine Learning, 47:235—256, 2002. [AuSzeMu ’07] J.-Y. Audibert, R. Munos and Cs. Szepesvári: Tuning bandit algorithms in stochastic environments, ALT, 2007. [Auer, Jaksch & Ortner ’07] P. Auer, T. Jaksch and R. Ortner: Near-optimal Regret Bounds for Reinforcement Learning, (2007), available at http://www.unileoben.ac.at/~infotech/publications/ucrlrevised.pdf http://www.unileoben.ac.at/~infotech/publications/ucrlrevised.pdf [Singh & Sutton ’96] S.P. Singh and R.S. Sutton: Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123—158, 1996. [Sutton ’88] R.S. Sutton: Learning to predict by the method of temporal differences. Machine Learning, 3:9—44, 1988. [Jaakkola et al. ’94] T. Jaakkola, M.I. Jordan, and S.P. Singh: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6: 1185—1201, 1994. [Tsitsiklis, ’94] J.N. Tsitsiklis: Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185—202, 1994. [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999. [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. [Rummery and Niranjan ’94] G.A. Rummery and M. Niranjan: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994. [Sutton ’84] R.S. Sutton: Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA, 1984. [Tsitsiklis & Van Roy ’97] J.N. Tsitsiklis and B. Van Roy: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674—690, 1997. [Sutton ’96] R.S. Sutton: Generalization in reinforcement learning: Successful examples using sparse coarse coding. NIPS, 1996. [Baird ’95] L.C. Baird: Residual algorithms: Reinforcement learning with function approximation, ICML, 1995. [Bradtke, Barto ’96] S.J. Bradtke and A.G. Barto: Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33—57, 1996. [Lagoudakis, Parr ’03] M. Lagoudakis and R. Parr: Least-squares policy iteration, Journal of Machine Learning Research, 4:1107—1149, 2003. [AnSzeMu ’07] A. Antos, Cs. Szepesvari and R. Munos: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning Journal, 2007.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.