Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Approaches to Data Acquisition The LCA depends upon data acquisition Qualitative vs. Quantitative –While some quantitative analysis is appropriate, inappropriate.
Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Reinforcement Learning
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
Reinforcement Learning Tutorial
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
Policies and exploration and eligibility, oh my!.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Policies and exploration and eligibility, oh my!.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy,
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Introduction to Reinforcement Learning
Reinforcement Learning
Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.
Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Retraction: I’m actually 35 years old. Q-Learning.
MONTE CARLO ANALYSIS When a system contains elements that exhibit chance in their behavior, the Monte Carlo method of simulation may be applied.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Sutton & Barto, Chapter 4 Dynamic Programming. Policy Improvement Theorem Let π & π’ be any pair of deterministic policies s.t. for all s in S, Then,
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
Reinforcement learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
Biomedical Data & Markov Decision Process
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
یادگیری تقویتی Reinforcement Learning
Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Presentation transcript:

Monte Carlo Methods

Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary – Simulated: no need for full model

DP doesn’t need any experience MC doesn’t need a full model – Learn from actual or simulated experience MC expense independent of total # of states MC has no bootstrapping – Not harmed by Markov violations

Every-Visit MC: average returns every time s is visited in an episode First-visit MC: average returns only first time s is visited in an episode Both converge asymptotically

1 st Visit MC for V π What about control?

Want to improve policy by greedily following Q

MC can estimate Q when model isn’t available – Want to learn Q* Converges asymptotically if every state-action pair is visited “enough” – Need to explore: why? Exploring starts: every state-action pair has non-zero probability of being starting pair – Why necessary? – Still necessary if π is probabilistic?

MC for Control (Exploring Starts)

Q 0 (.,. ) = 0 π 0 = ? Action: 1 or 2 State: amount of money Transition: based on state, flip outcome, and action Reward: amount of money Return: money at end of episode

MC for Control, On-Policy Soft policy: – π(s,a) > 0 for all s,a – ε-greedy Can prove version of policy improvement for soft policies

MC for Control, On-Policy

Off-Policy Control Why consider off-policy methods?

Off Policy Control Learn about π while following π’ – Behavior policy vs. Estimation policy

Off policy learning Why only learn from tail? How choose policy to follow?