Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.

Slides:



Advertisements
Similar presentations
The Maximum Likelihood Method
Advertisements

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Reinforcement Learning
Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Lecture 18: Temporal-Difference Learning
Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Reinforcement Learning
Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
RL for Large State Spaces: Value Function Approximation
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
Reinforcement Learning
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Chapter 5: Monte Carlo Methods
Reinforcement Learning Introduction Presented by Alp Sardağ.
Chapter 6: Temporal Difference Learning
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Section 8.2 Estimating  When  is Unknown
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
RL2: Q(s,a) easily gets us U(s) and pi(s) Leftarrow with alpha above = move towards target, weighted by learning rate Sum of alphas is infinity, sum of.
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 6. Mean, Variance, Moments and Characteristic Functions For a r.v X, its p.d.f represents complete information about it, and for any Borel set B on the.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
Virtual University of Pakistan
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
Reinforcement Learning
Biomedical Data & Markov Decision Process
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
یادگیری تقویتی Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
Chapter 7: Eligibility Traces
Reinforcement Learning (2)
Presentation transcript:

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit vs. every visit

MC for Control, On-Policy (soft-policies)

Off Policy Control Learn about π while following π’ – Behavior policy vs. Estimation policy

π’ = policy followed, π = policy evaluating π’(s,a): probability that π’ will take action a π(s,a)=1 b/c π is deterministic, and wouldn’t consider s,a if π didn’t select

π’ = policy followed, π = policy evaluating π’(s,a): probability that π’ will take action a π(s,a)=1 b/c π is deterministic, and wouldn’t consider s,a if π didn’t select First, we’re looking at tail of episode including the exploration action

Example: we’re considering some s,a and we eventually get return of 100. Say π’ is unlikely to reach this goal state: π’ is 0.01 on one of the steps to the goal (the rest are 1) – w = 100, N = 100*100, D = 100 – Q = N/D = 100 Consider a different s’,a’, where we get a return of 100 but the goal state is always reached (π’ = 1.0 for all steps in the trajectory) – W = 1, N = 100, D = 1 – Q = N/D = 100

Example: we’re considering some s,a and we eventually get return of 100. Say π’ is unlikely to reach this goal state: π’ is 0.01 on one of the steps to the goal (the rest are 1) – w = 100, N = 100*100, D = 100 – Q = N/D = 100 Consider a different s’,a’, where we get a return of 100 but the goal state is always reached (π’ = 1.0 for all steps in the trajectory) – W = 1, N = 100, D = 1 – Q = N/D = 100 Second: the difference here is how updates for these different s,a pairs will be calculated in the future. We have to weight the updates based on how likely we are to experience them based on the sampling policy.

df df Achieving Master Level Play in 9×9 Computer Go 5 Minutes: – Summary of paper? – What’s interesting? – How could you improve their idea?

DP, MC, DP Model, online/offline update, bootstrapping, on-policy/off-policy Batch updating vs. online updating

Ex. 6.4 V(B) = ¾

Ex. 6.4 V(B) = ¾ V(A) = 0? Or ¾?

4 is terminal state V(3) = 0.5 TD(0) here is better than MC. Why?

4 is terminal state V(3) = 0.5 TD(0) here is better than MC. Why? Visit 2 on the kth time, state 3 visited 10k times Variance for MC will be much higher than TD(0) because of bootstrapping

4 is terminal state V(3) = 0.5 Change so that R(3,a,4) was deterministic Now, MC would be faster

Ex 6.4, Figure 6.7. RMS goes down and up again with high learning rates. Why?

6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus, the second leg of our drive home is the same as it was before. But say traffic is significantly worse on the first leg of this drive than it was on the first leg before the change in work locations. With a MC approach, we'd be modifying our estimates of the time it takes to make the second leg of the drive based solely on the fact that the entire drive took longer. With a TD method, we'd only be modifying our estimates based on the next state, so this method would be able to learn that the first leg of the drive is taking longer and our estimates would reflect that. The second leg would be unaffected.

The above example illustrates a general difference between the estimates found by batch TD(0) and batch Monte Carlo methods. Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. In general, the maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is greatest. In this case, the maximum-likelihood estimate is the model of the Markov process formed in the obvious way from the observed episodes: the estimated transition probability from to is the fraction of observed transitions from that went to, and the associated expected reward is the average of the rewards observed on those transitions. Given this model, we can compute the estimate of the value function that would be exactly correct if the model were exactly correct. This is called the certainty-equivalence estimate because it is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated. In general, batch TD(0) converges to the certainty-equivalence estimate. This helps explain why TD methods converge more quickly than Monte Carlo methods. In batch form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence estimate. This explains the advantage of TD(0) shown in the batch results on the random walk task (Figure 6.8). The relationship to the certainty-equivalence estimate may also explain in part the speed advantage of nonbatch TD(0) (e.g., Figure 6.7). Although the nonbatch methods do not achieve either the certainty-equivalence or the minimum squared-error estimates, they can be understood as moving roughly in these directions. Nonbatch TD(0) may be faster than constant- $\alpha$ MC because it is moving toward a better estimate, even though it is not getting all the way there. At the current time nothing more definite can be said about the relative efficiency of on-line TD and Monte Carlo methods. Finally, it is worth noting that although the certainty-equivalence estimate is in some sense an optimal solution, it is almost never feasible to compute it directly. If is the number of states, then just forming the maximum-likelihood estimate of the process may require memory, and computing the corresponding value function requires on the order of computational steps if done conventionally. In these terms it is indeed striking that TD methods can approximate the same solution using memory no more than and repeated computations over the training set. On tasks with large state spaces, TD methods may be the only feasible way of approximating the certainty- equivalence solution.

maximum-likelihood models, certainty- equivalence estimates Josh: while TD and MC use very similar methods for computing the values of states they will converge to a different values. It surprises me, I actually had to read the chapter a couple of times to come to grips with it. Example 6.4 in section 6.3 is what finally convinced me however I had to go over it several times