RL2: Q(s,a) easily gets us U(s) and pi(s) Leftarrow with alpha above = move towards target, weighted by learning rate Sum of alphas is infinity, sum of.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
RL for Large State Spaces: Value Function Approximation
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement learning (Chapter 21)
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
1 Eligibility Traces (ETs) Week #7. 2 Introduction A basic mechanism of RL. λ in TD(λ) refers to an eligibility trace. TD methods such as Sarsa and Q-learning.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Reinforcement Learning
Reinforcement Learning
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning Tutorial
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Blackjack as a Test Bed for Learning Strategies in Neural Networks A. Perez-Uribe and E. Sanchez Swiss Federal Institute of Technology IEEE IJCNN'98.
Chapter 5: Monte Carlo Methods
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,
Mean and Standard Deviation of Discrete Random Variables.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
Reinforcement learning (Chapter 21)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement Learning
Stochastic tree search and stochastic games
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
Reinforcement learning (Chapter 21)
Deep reinforcement learning
"Playing Atari with deep reinforcement learning."
Reinforcement learning
Infinite Geometric Series
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Reinforcement Learning with Neural Networks
یادگیری تقویتی Reinforcement Learning
Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Figure:
Presentation transcript:

RL2: Q(s,a) easily gets us U(s) and pi(s) Leftarrow with alpha above = move towards target, weighted by learning rate Sum of alphas is infinity, sum of squared alphas is less than infinity Simulated annealing -> epsilon exploration – Combine random with greedy: exploration + exploitation GLIE: Greedy limit + infinite exploration

t, r -> planner -> policyDyn. Prog. s, a, s’, r -> learner -> policyMonte Carlo s, a, s’, r -> modeler -> t,r model -> simulate -> s, a, s’, r Q-Learning ? TD ?

DP, MC, DP Model, online/offline update, bootstrapping, on-policy/off-policy Batch updating vs. online updating

Ex. 6.4 V(B) = ¾

Ex. 6.4 V(B) = ¾ V(A) = 0? Or ¾?

4 is terminal state V(3) = 0.5 TD(0) here is better than MC for V(2). Why?

4 is terminal state V(3) = 0.5 TD(0) here is better than MC. Why? Visit 2 on the kth time, state 3 visited 10k times Variance for MC will be much higher than TD(0) because of bootstrapping

4 is terminal state V(3) = 0.5 Change so that R(3,a,4) was deterministic Now, MC would be faster

What happened in the 1 st episode?

Ex 6.4, Figure 6.7. MC vs TD? TD RMS goes down and up again with high learning rates?

Ex 6.4, Figure 6.7. RMS goes down and up again with high learning rates. Would things change if you had a -1 on each step?

Q-Learning

Decaying learning rate? Decaying exploration rate?