Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Lecture 18: Temporal-Difference Learning

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.

Reinforcement Learning

TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Reinforcement learning (Chapter 21)

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

The loss function, the normal equation,

Infinite Horizon Problems

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.

Reinforcement learning

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Chapter 5: Monte Carlo Methods

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Machine Learning Lecture 11: Reinforcement Learning

Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.

Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.

Reinforcement Learning

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.

Retraction: I’m actually 35 years old. Q-Learning.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Reinforcement Learning Elementary Solution Methods

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.

TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Chapter 6: Temporal Difference Learning

Chapter 5: Monte Carlo Methods

Reinforcement learning (Chapter 21)

Reinforcement Learning

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Reinforcement learning

یادگیری تقویتی Reinforcement Learning

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Chapter 7: Eligibility Traces

Reinforcement Learning (2)

CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES

Presentation transcript:

Schedule for presentations

6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus, the second leg of our drive home is the same as it was before. But say traffic is significantly worse on the first leg of this drive than it was on the first leg before the change in work locations. With a MC approach, we'd be modifying our estimates of the time it takes to make the second leg of the drive based solely on the fact that the entire drive took longer. With a TD method, we'd only be modifying our estimates based on the next state, so this method would be able to learn that the first leg of the drive is taking longer and our estimates would reflect that. The second leg would be unaffected.

Ex. 6.4 V(B) = ¾ V(A) = 0? Or ¾?

4 is terminal state V(3) = 0.5 TD(0) here is better than MC. Why?

The above example illustrates a general difference between the estimates found by batch TD(0) and batch Monte Carlo methods. Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. In general, the maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is greatest. In this case, the maximum-likelihood estimate is the model of the Markov process formed in the obvious way from the observed episodes: the estimated transition probability from to is the fraction of observed transitions from that went to, and the associated expected reward is the average of the rewards observed on those transitions. Given this model, we can compute the estimate of the value function that would be exactly correct if the model were exactly correct. This is called the certainty-equivalence estimate because it is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated. In general, batch TD(0) converges to the certainty-equivalence estimate. This helps explain why TD methods converge more quickly than Monte Carlo methods. In batch form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence estimate. This explains the advantage of TD(0) shown in the batch results on the random walk task (Figure 6.8). The relationship to the certainty-equivalence estimate may also explain in part the speed advantage of nonbatch TD(0) (e.g., Figure 6.7). Although the nonbatch methods do not achieve either the certainty-equivalence or the minimum squared-error estimates, they can be understood as moving roughly in these directions. Nonbatch TD(0) may be faster than constant- $\alpha$ MC because it is moving toward a better estimate, even though it is not getting all the way there. At the current time nothing more definite can be said about the relative efficiency of on-line TD and Monte Carlo methods. Finally, it is worth noting that although the certainty-equivalence estimate is in some sense an optimal solution, it is almost never feasible to compute it directly. If is the number of states, then just forming the maximum-likelihood estimate of the process may require memory, and computing the corresponding value function requires on the order of computational steps if done conventionally. In these terms it is indeed striking that TD methods can approximate the same solution using memory no more than and repeated computations over the training set. On tasks with large state spaces, TD methods may be the only feasible way of approximating the certainty- equivalence solution.

maximum-likelihood models, certainty- equivalence estimates Josh: while TD and MC use very similar methods for computing the values of states they will converge to a different values. It surprises me, I actually had to read the chapter a couple of times to come to grips with it. Example 6.4 in section 6.3 is what finally convinced me however I had to go over it several times

Sarsa

Why could this be a challenge for MC? How could you solve this task with MC?

Ex 6.6: Resolve the windy gridworld task assuming eight possible actions, including the diagonal moves, rather than the usual four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

Q-Learning

On- vs. Off-Policy Learning

TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning (Note that neither Sarsa or Q-learning have an explicit policy….)

When would you use one vs. the others? TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning (Note that neither Sarsa or Q-learning have an explicit policy….)

Modified update On- or Off-policy? How do you think it’d do on the cliff world?

Actor-Critic On-policy learning Critic: Value function – If result is better than I thought it’d be, take that action more often Actor: Explicit policy Policy evaluation / improvement – Sound familiar?

Actor-Critic

Policy Improvement R.S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 2000 Policy gradient improvement θ is vector of policy parameters ρ is the policy performance (e.g., average reward per step) α is learning rate Can (usually) be assured to converge Small changes to θ can cause only small changes to policy / state- visitation distribution Approximate gradient via experience Also: V. Konda and J. Tsitsiklis. Actor-critic algorithms. In NIPS, 2000

Natural Actor-Critic: Peters and Schaal, 2008 Vanilla policy gradient follow gradient of the expected return function J(θ) Often gets stuck in plateaus

Natural Actor-Critic: Peters and Schaal, 2008 Natural gradients avoid this problem Does not follow steepest direction in parameters space, but steepest direction w.r.t. Fisher metric: G(θ) is Fisher information matrix – amount of information that an observed random variable X carries about an unknown parameter θ, upon which P(X) depends) Depends on choice of coordinate system Convergence to next local optimum is assured

Actor-critic advantage – Combine different methods for policy improvement with methods for policy evaluation – E.g., Least-Squares Temporal Difference (LSTD)

Ergodic process: nonzero probability of reaching any state from any other under any policy

Afterstates Where else would these be useful?

Unified View

Quiz! 1.If a policy is greedy w.r.t. the value function for the equiprobable random policy, then it is an optimal policy. Why or why not? 2.If a policy π is greed w.r.t. its own value function, V π, then it is an optimal policy. Why or why not? 3.When would you use TD(0), Sarsa, or Q-Learning (vs. the other 2 algorithms listed)