Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:

Similar presentations


Presentation on theme: "1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:"— Presentation transcript:

1 1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/http://www.cs.ualberta.ca/~szepesva/MLSS08/ TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A A AA A

2 2 Contents  Defining the problem(s)  Learning optimally  Learning a good policy Monte-Carlo Temporal Difference (bootstrapping) Batch – fitted value iteration and relatives

3 3 The Learning Problem  The MDP is unknown but the agent can interact with the system  Goals: Learn an optimal policy  Where do the samples come from? Samples are generated externally The agent interacts with the system to get the samples (“active learning”)  Performance measure: What is the performance of the policy obtained? Learn optimally: Minimize regret while interacting with the system  Performance measure: loss in rewards due to not using the optimal policy from the beginning  Exploration vs. exploitation

4 4 Learning from Feedback  A protocol for prediction problems: x t – situation (observed by the agent) y t 2 Y – value to be predicted p t 2 Y – predicted value (can depend on all past values ) learning!) r t (x t,y t,y) – value of predicting y loss of learner: t = r t (x t,y t,y)-r t (x t, y t,p t )  Supervised learning: agent is told y t, r t (x t,y t,.) Regression: r t (x t,y t,y)=-(y-y t ) 2  t =(y t -p t ) 2  Full information prediction problem: 8 y 2 Y, r t (x t,y) is communicated to the agent, but not y t  Bandit (partial information) problem: r t (x t,p t ) is communicated to the agent only

5 5 Learning Optimally  Explore or exploit?  Bandit problems Simple schemes Optimism in the face of uncertainty (OFU)  UCB  Learning optimally in MDPs with the OFU principle

6 6 Learning Optimally: Exploration vs. Exploitation  Two treatments  Unknown success probabilities  Goal: find the best treatment while loosing few patients  Explore or exploit?

7 7 Exploration vs. Exploitation: Some Applications  Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page  More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker..

8 8 Bernoulli Bandits  Payoff is 0 or 1  Arm 1: R 1 (1), R 2 (1), R 3 (1), R 4 (1), …  Arm 2: R 1 (2), R 2 (2), R 3 (2), R 4 (2), … 0 110 10 1 0

9 9 Some definitions  Payoff is 0 or 1  Arm 1: R 1 (1), R 2 (1), R 3 (1), R 4 (1), …  Arm 2: R 1 (2), R 2 (2), R 3 (2), R 4 (2), … Now: t=9 T 1 (t-1) = 4 T 2 (t-1) = 4 A 1 = 1, A 2 = 2, … 0 110 10 1 0

10 10 The Exploration/Exploitation Dilemma  Action values: Q * (a) = E[R t (a)]  Suppose you form estimates  The greedy action at t is:  Exploitation: When the agent chooses to follow A t *  Exploration: When the agent chooses to do something else  You can’t exploit all the time; you can’t explore all the time  You can never stop exploring; but you should always reduce exploring. Maybe.

11 11 Action-Value Methods  Methods that adapt action-value estimates and nothing else  How to estimate action-values?  Sample average:  Claim: if n t (a) !1  Why??

12 12 -Greedy Action Selection  Greedy action selection:  -Greedy:... the simplest way to “balance” exploration and exploitation

13 13 10-Armed Testbed  n = 10 possible actions  Repeat 2000 times: Q * (a) ~ N(0,1) Play 1000 rounds  R t (a)~ N(Q * (a),1)

14 14 -Greedy Methods on the 10- Armed Testbed

15 15 Softmax Action Selection  Problem with ² -greedy: Neglects action values  Softmax idea: grade action probs. by estimated values.  Gibbs, or Boltzmann action selection, or exponential weights: = t is the “computational temperature”

16 16 Incremental Implementation  Sample average:  Incremental computation:  Common update rule form: NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

17 17 UCB: Upper Confidence Bounds  Principle: Optimism in the face of uncertainty  Works when the environment is not adversary  Assume rewards are in [0,1]. Let (p>2)  For a stationary environment, with iid rewards this algorithm is hard to beat!  Formally: regret in T steps is O(log T)  Improvement: Estimate variance, use it in place of p [AuSzeMu ’07]  This principle can be used for achieving small regret in the full RL problem! [Auer et al. ’02]

18 18 UCRL2: UCB Applied to RL  [Auer, Jaksch & Ortner ’07]  Algorithm UCRL2( ± ): Phase initialization:  Estimate mean model p 0 using maximum likelihood (counts)  C := { p | ||p(.|x,a)-p 0 (.|x,a) · c |X| log(|A|T/delta) / N(x,a) }  p’ :=argmax p ½ * (p), ¼ := ¼ * (p’)  N 0 (x,a) := N(x,a), 8 (x,a) 2 X £ A Execution  Execute ¼ until some (x,a) have been visited at least N 0 (x,a) times in this phase

19 19 UCRL2 Results  Def: Diameter of an MDP M: D(M) = max x,y min ¼ E[ T(x  y; ¼ ) ]  Regret bounds Lower bound: E[L n ] = ( ( D |X| |A| T ) 1/2 ) Upper bounds:  w.p. 1- ± /T, L T · O( D |X| ( |A| T log( |A|T/ ± ) 1/2 )  w.p. 1- ±, L T · O( D 2 |X| 2 |A| log( |A|T/ ± )/ ¢ ) ¢ =performance gap between best and second best policy

20 20 Learning a Good Policy  Monte-Carlo methods  Temporal Difference methods Tabular case Function approximation  Batch learning

21 21 Learning a good policy  Model-based learning Learn p,r “Solve” the resulting MDP  Model-free learning Learn the optimal action-value function and (then) act greedily Actor-critic learning Policy gradient methods  Hybrid Learn a model and mix planning and a model-free method; e.g. Dyna

22 22 Monte-Carlo Methods  Episodic MDPs!  Goal: Learn V ¼ (.)  V ¼ (x) = E ¼ [  t ° t R t |X 0 =x]  (X t,A t,R t ): -- trajectory of ¼  Visits to a state f(x) = min {t|X t = x}  First visit E(x) = { t | X t = x }  Every visit  Return: S(t) = ° 0 R t + ° 1 R t+1 + …  K independent trajectories  S (k), E (k), f (k), k=1..K  First-visit MC: Average over { S (k) ( f (k) (x) ) : k=1..K }  Every-visit MC: Average over { S (k) ( t ) : k=1..K, t 2 E (k) (x) }  Claim: Both converge to V ¼ (.)  From now on S t = S(t) 12345 [Singh & Sutton ’96]

23 23 Learning to Control with MC  Goal: Learn to behave optimally  Method: Learn Q ¼ (x,a)..to be used in an approximate policy iteration (PI) algorithm  Idea/algorithm: Add randomness  Goal: all actions are sampled eventually infinitely often  e.g., ² -greedy or exploring starts Use the first-visit or the every-visit method to estimate Q ¼ (x,a) Update policy  Once values converged.. or..  Always at the states visited

24 24 Monte-Carlo: Evaluation  Convergence rate: Var(S(0)|X=x)/N  Advantages over DP: Learn from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (no bootstrapping)  Issue: maintaining sufficient exploration exploring starts, soft policies

25 25 Temporal Difference Methods  Every-visit Monte-Carlo: V(X t )  V(X t ) + ® t (X t ) (S t – V(X t ))  Bootstrapping S t = R t + ° S t+1 S t ’ = R t + ° V(X t+1 )  TD(0): V(X t )  V(X t ) + ® t (X t ) ( S t ’– V(X t ) )  Value iteration: V(X t )  E[ S t ’ | X t ]  Theorem: Let V t be the sequence of functions generated by TD(0). Assume 8 x, w.p.1  t ® t (x)= 1,  t ® t 2 (x)<+ 1. Then V t  V ¼ w.p.1  Proof: Stochastic approximations: V t+1 =T t (V t,V t ), U t+1 =T t (U t,V ¼ )  TV ¼. [Jaakkola et al., ’94, Tsitsiklis ’94, SzeLi99] [Samuel, ’59], [Holland ’75], [Sutton ’88]

26 26 TD or MC?  TD advantages: can be fully incremental, i.e., learn before knowing the final outcome  Less memory  Less peak computation learn without the final outcome  From incomplete sequences  MC advantage: Less harm by Markovian violations  Convergence rate? Var(S(0)|X=x) decides!

27 27 Learning to Control with TD  Q-learning [Watkins ’90]: Q(X t,A t )  Q(X t,A t ) + ® t (X t,A t ) {R t + ° max a Q (X t+1,a)–Q(X t,A t )}  Theorem: Converges to Q * [JJS’94, Tsi’94,SzeLi99]  SARSA [Rummery & Niranjan ’94]: A t ~ Greedy ² (Q,X t ) Q(X t,A t )  Q(X t,A t ) + ® t (X t,A t ) {R t + ° Q (X t+1,A t+1 )–Q(X t,A t )}  Off-policy (Q-learning) vs. on-policy (SARSA)  Expecti-SARSA  Actor-Critic [Witten ’77, Barto, Sutton & Anderson ’83, Sutton ’84]

28 28 Cliffwalking  greedy  = 0.1

29 29 N-step TD Prediction  Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

30 30  Monte Carlo: S t = R t + ° R t+1 +.. ° T - t R T  TD: S t (1) = R t + ° V(X t+1 ) Use V to estimate remaining return  n-step TD: 2 step return:  S t (2) = R t + ° R t+1 + ° 2 V(X t+2 ) n-step return:  S t (n) = R t + ° R t+1 + … + ° n V(X t+n ) N-step TD Prediction

31 31 Learning with n-step Backups  Learning with n-step backups: V(X t )  V(X t ) + ® t ( S t (n) - V(X t ))  n: controls how much to bootstrap

32 32 Random Walk Examples  How does 2-step TD work here?  How about 3-step TD?

33 33 A Larger Example  Task: 19 state random walk  Do you think there is an optimal n? for everything?

34 34 Averaging N-step Returns  Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step:  “complex backup” One backup

35 35 Forward View of TD()  Idea: Average over multiple backups -return: S t ( ¸ ) = (1- ¸ )  n 1 ¸ n S t (n+1)  TD( ¸ ): ¢ V(X t ) = ® t ( S t ( ¸ ) -V(X t ))  Relation to TD(0) and MC ¸ =0  TD(0) ¸ =1  MC [Sutton ’88]

36 36 -return on the Random Walk  Same 19 state random walk as before  Why intermediate values of are best?

37 37 Backward View of TD() ± t = R t + ° V(X t+1 ) – V(X t ) V(x)  V(x) + ® t ± t e(x) e(x)  ° ¸ e(x) + I(x=X t )  Off-line updates  Same as FW TD( ¸ )  e(x): eligibility trace Accumulating trace Replacing traces speed up convergence:  e(x)  max( °¸ e(x), I(x=X t ) ) [Sutton ’88, Singh & Sutton ’96]

38 38 Function Approximation with TD

39 39 Gradient Descent Methods transpose  Assume V t is a differentiable function of : V t (x) = V(x;).  Assume, for now, training examples of the form: { (X t, V  (X t )) }

40 40 Performance Measures  Many are applicable but…  a common and simple one is the mean-squared error (MSE) over a distribution P:  Why P?  Why minimize MSE?  Let us assume that P is always the distribution of states at which backups are done.  The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

41 41 Gradient Descent  Let L be any function of the parameters. Its gradient at any point  in this space is:  Iteratively move down the gradient:

42 42 Gradient Descent in RL  Function to descent on:  Gradient:  Gradient descent procedure:  Bootstrapping with S t ’  TD() (forward view):

43 43 Linear Methods  Linear FAPP: V(x; µ ) = µ T Á (x)  r µ V(x; µ ) = Á (x)  Tabular representation: Á (x) y = I(x=y)  Backward view: ± t = R t + ° V(X t+1 ) – V(X t ) µ  µ + ® t ± t e e  ° ¸ e + r µ V(X t ; µ )  Theorem [TsiVaR’97]: V t converges to V s.t. ||V-V ¼ || D,2 · ||V ¼ - ¦ V ¼ || D,2 /(1- ° ). [Sutton ’84, ’88, Tsitsiklis & Van Roy ’97]

44 44  Learning state-action values Training examples:  The general gradient-descent rule:  Gradient-descent Sarsa() Control with FA [Rummery & Niranjan ’94]

45 45 Mountain-Car Task [Sutton ’96], [Singh & Sutton ’96]

46 46 Mountain-Car Results

47 47 Baird’s Counterexample: Off-policy Updates Can Diverge [Baird ’95]

48 48 Baird’s Counterexample Cont.

49 49 Should We Bootstrap?

50 50 Batch Reinforcement Learning

51 51 Batch RL  Goal: Given the trajectory of the behavior policy ¼ b X 1,A 1,R 1, …, X t, A t, R t, …, X N compute a good policy!  “Batch learning”  Properties: Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role  Performance measures: ||V * (x) – V ¼ (x)|| 1 = sup x |V * (x) - V ¼ (x)| = sup x V * (x) - V ¼ (x) ||V * (x) - V ¼ (x)|| 2 = s (V * (x)-V ¼ (x)) 2 d ¹ (x)

52 52 Solution methods  Build a model  Do not build a model, but find an approximation to Q * using value iteration => fitted Q- iteration using policy iteration =>  Policy evaluated by approximate value iteration Policy evaluated by Bellman- residual minimization (BRM)  Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI  Policy search [Bradtke, Barto ’96], [Lagoudakis, Parr ’03], [AnSzeMu ’07]

53 53 Evaluating a policy: Fitted value iteration  Choose a function space F.  Solve for i=1,2,…,M the LS (regression) problems:  Counterexamples?!?!? [Baird ’95, Tsitsiklis and van Roy ’96]  When does this work??  Requirement: If M is big enough and the number of samples is big enough Q M should be close to Q ¼  We have to make some assumptions on F

54 54 Least-squares vs. gradient  Linear least squares (ordinary regression): y t = w * T x t + ² t (x t,y t ) jointly distributed r.v.s., iid, E[ ² t |x t ]=0.  Seeing (x t,y t ), t=1,…,T, find out w *.  Loss function: L(w) = E[ (y 1 – w T x 1 ) 2 ].  Least-squares approach: w T = argmin w  t=1 T (y t – w T x t ) 2  Stochastic gradient method: w t+1 = w t + ® t (y t -w t T x t ) x t  Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is the computation?

55 55 Fitted value iteration: Analysis  Goal: Bound ||Q M - Q ¼ || ¹ 2 in terms of max m || ² m || º 2, || ² m || º 2 = s ² m 2 (x,a) º (dx,da), where Q m+1 = T ¼ Q m + ² m, ² -1 = Q 0 -Q ¼  U m = Q m – Q ¼ After [AnSzeMu ’07]

56 56 Analysis/2

57 57 Summary  If the regression errors are all small and the system is noisy ( 8 ¼, ½, ½ P ¼ · C 1 º ) then the final error will be small.  How to make the regression errors small?  Regression error decomposition: Approximation error Estimation error

58 58 Controlling the approximation error

59 59 Controlling the approximation error

60 60 Controlling the approximation error

61 61 Controlling the approximation error  Assume smoothness!

62 62 Learning with (lots of) historical data  Data: A long trajectory of some exploration policy  Goal: Efficient algorithm to learn a policy  Idea: Use fitted action-values  Algorithms: Bellman residual minimization, FQI [AnSzeMu ’07] LSPI [Lagoudakis, Parr ’03]  Bounds: Oracle inequalities (BRM, FQI and LSPI) ) consistency

63 63 BRM insight  TD error:  t R t + ° Q(X t+1, ¼ (X t+1 ))-Q(X t,A t )  Bellman error: E[E[  t | X t,A t ] 2 ]  What we can compute/estimate: E[E[  t 2 | X t,A t ]]  They are different!  However: [AnSzeMu ’07]

64 64 Loss function

65 65 Algorithm (BRM++)

66 66 Do we need to reweight or throw away data?  NO!  WHY?  Intuition from regression: m(x) = E[Y|X=x] can be learnt no matter what p(x) is!  * (a|x): the same should be possible!  BUT.. Performance might be poor! => YES! Like in supervised learning when training and test distributions are different

67 67 Bound

68 68 The concentration coefficients  Lyapunov exponents  Our case: y t is infinite dimensional P t depends on the policy chosen If top-Lyap exp. · 0, we are good

69 69 Open question  Abstraction:  Let  True?

70 70 Relation to LSTD  LSTD: Linear function space Bootstrap the “normal equation” [AnSzeMu ’07]

71 71 Open issues Adaptive algorithms to take advantage of regularity when present to address the “curse of dimensionality”  Penalized least-squares/aggregation?  Feature relevance  Factorization  Manifold estimation Abstraction – build automatically Active learning Optimal on-line learning for infinite problems

72 72 References  [Auer et al. ’02] P. Auer, N. Cesa-Bianchi and P. Fischer: Finite time analysis of the multiarmed bandit problem, Machine Learning, 47:235—256, 2002.  [AuSzeMu ’07] J.-Y. Audibert, R. Munos and Cs. Szepesvári: Tuning bandit algorithms in stochastic environments, ALT, 2007.  [Auer, Jaksch & Ortner ’07] P. Auer, T. Jaksch and R. Ortner: Near-optimal Regret Bounds for Reinforcement Learning, (2007), available at http://www.unileoben.ac.at/~infotech/publications/ucrlrevised.pdf http://www.unileoben.ac.at/~infotech/publications/ucrlrevised.pdf  [Singh & Sutton ’96] S.P. Singh and R.S. Sutton: Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123—158, 1996.  [Sutton ’88] R.S. Sutton: Learning to predict by the method of temporal differences. Machine Learning, 3:9—44, 1988.  [Jaakkola et al. ’94] T. Jaakkola, M.I. Jordan, and S.P. Singh: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6: 1185—1201, 1994.  [Tsitsiklis, ’94] J.N. Tsitsiklis: Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185—202, 1994.  [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999.  [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990.  [Rummery and Niranjan ’94] G.A. Rummery and M. Niranjan: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994.  [Sutton ’84] R.S. Sutton: Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA, 1984.  [Tsitsiklis & Van Roy ’97] J.N. Tsitsiklis and B. Van Roy: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674—690, 1997.  [Sutton ’96] R.S. Sutton: Generalization in reinforcement learning: Successful examples using sparse coarse coding. NIPS, 1996.  [Baird ’95] L.C. Baird: Residual algorithms: Reinforcement learning with function approximation, ICML, 1995.  [Bradtke, Barto ’96] S.J. Bradtke and A.G. Barto: Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33—57, 1996.  [Lagoudakis, Parr ’03] M. Lagoudakis and R. Parr: Least-squares policy iteration, Journal of Machine Learning Research, 4:1107—1149, 2003.  [AnSzeMu ’07] A. Antos, Cs. Szepesvari and R. Munos: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning Journal, 2007.


Download ppt "1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:"

Similar presentations


Ads by Google