Retraction: I’m actually 35 years old. Q-Learning.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Reinforcement learning (Chapter 21)
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Reinforcement Learning
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning Tutorial
An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Machine Learning Lecture 11: Reinforcement Learning
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Reinforcement learning (Chapter 21)
CPSC 422, Lecture 10Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10 Sep, 30, 2015.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement learning
یادگیری تقویتی Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Deep Reinforcement Learning
Chapter 7: Eligibility Traces
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

Retraction: I’m actually 35 years old

Q-Learning

Decaying learning rate? Decaying exploration rate?

Sarsa

On- vs. Off-Policy Learning

Why could this be a challenge for MC? How could you solve this task with MC?

Ex 6.6: Resolve the windy gridworld task assuming eight possible actions, including the diagonal moves, rather than the usual four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

TD(0), prediction Sarsa, On-policy learning Q-Learning, Off-policy learning Sarsa & Q-learning have no explicit policy

Policy, Q(up), Q(down), Q(left), Q(right) Deterministic transitions α = 0.1, γ = 1.0, step r = -0.1 U 0,0,0,0 R 0,0,0,0 R 0,0,0,0 +1 U 0,0,0,0 U 0,0,0,0 U 0,0,0,0 R 0,0,0,0 R 0,0,0,0 U 0,0,0,0

Modified update On- or Off-policy? How do you think it’d do on the cliff world?

Actor-Critic On-policy learning Critic: Value function – If result is better than I thought it’d be, take that action more often Actor: Explicit policy Policy evaluation / improvement – Sound familiar?

Actor-Critic

Minimal computation to select actions Can learn explicit stochastic policies

Policy Improvement R.S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS, 2000 Policy gradient improvement θ is vector of policy parameters ρ is the policy performance (e.g., average reward per step) α is learning rate Can (usually) be assured to converge Small changes to θ can cause only small changes to policy / state-visitation distribution Approximate gradient via experience Also: V. Konda and J. Tsitsiklis. Actor-critic algorithms. In NIPS, 2000

Natural Actor-Critic: Peters and Schaal, 2008 Vanilla policy gradient follow gradient of the expected return function J(θ) Often gets stuck in plateaus

Natural Actor-Critic: Peters and Schaal, 2008 Natural gradients avoid this problem Does not follow steepest direction in parameters space, but steepest direction w.r.t. Fisher metric: G(θ) is Fisher information matrix – amount of information that an observed random variable X carries about an unknown parameter θ, upon which P(X) depends Depends on choice of coordinate system Convergence to next local optimum is assured

Actor-critic advantage – Combine different methods for policy improvement with methods for policy evaluation – E.g., Least-Squares Temporal Difference (LSTD)

Ergodic process: nonzero probability of reaching any state from any other under any policy

Afterstates Where else would these be useful?

Quiz! 1.If a policy is greedy w.r.t. the value function for the equiprobable random policy, then it is an optimal policy. Why or why not? 2.If a policy π is greedy w.r.t. its own value function, V π, then it is an optimal policy. Why or why not? 3.When would you use TD(0), Sarsa, or Q- Learning (vs. the other 2 algorithms listed)

Unified View