Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Slides:

Advertisements

Similar presentations

Lecture 18: Temporal-Difference Learning

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Reinforcement Learning

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006.

Markov Decision Processes

Reinforcement learning

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Reinforcement Learning Tutorial

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Chapter 5: Monte Carlo Methods

לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning (1)

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Reinforcement Learning

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

INTRODUCTION TO Machine Learning

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.

Retraction: I’m actually 35 years old. Q-Learning.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning Elementary Solution Methods

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Chapter 6: Temporal Difference Learning

Reinforcement Learning (1)

Chapter 5: Monte Carlo Methods

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Reinforcement learning (Chapter 21)

Markov Decision Processes

Biomedical Data & Markov Decision Process

Reinforcement learning

Instructors: Fei Fang (This Lecture) and Dave Touretzky

یادگیری تقویتی Reinforcement Learning

September 22, 2011 Dr. Itamar Arel College of Engineering

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Chapter 17 – Making Complex Decisions

CS 188: Artificial Intelligence Spring 2006

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University

Outline Last week –Goal of Reinforcement Learning –Mathematical Model (MDP) –Planning Value iteration Policy iteration This week: Learning Algorithms –Model based –Model Free

Planning - Basic Problems. Policy evaluation - Given a policy , estimate its return. Optimal control - Find an optimal policy  *  (maximizes the return from any start state). Given a complete MDP model.

Planning - Value Functions V  (s) The expected value starting at state s following  Q  (s,a) The expected value starting at state s with action a and then following  V  (s) and Q  (s,a) are define using an optimal policy  . V  (s) = max  V  (s)

Algorithms - optimal control CLAIM: A policy  is optimal if and only if at each state s: V  (s)  MAX a  Q  (s,a)} (Bellman Eq.) The greedy policy with respect to Q  (s,a) is  (s) = argmax a {Q  (s,a) }

MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. 3. Policy Iteration method.

Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning.

Example - Elevator Control Planning (alone) : Given arrival model build schedule Learning (alone): Model the arrival model well. Real objective: Construct a schedule while updating model

Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning).

Learning - Model Based Estimate the model from the observation. (Both transition probability and rewards.) Use the estimated model as the true model, and find optimal policy. If we have a “good” estimated model, we should have a “good” estimation.

Learning - Model Based: off policy Let the policy run for a “long” time. –what is “long” ?! Build an “observed model”: –Transition probabilities –Rewards Use the “observed model” to estimate value of the policy.

Learning - Model Based sample size Sample size (optimal policy): Naive: O(|S| 2 |A| log (|S| |A|) ) samples. (approximates each transition  (s,a,s’) well.) Better: O(|S| |A| log (|S| |A|) ) samples. (Sufficient to approximate optimal policy.) [KS, NIPS’98]

Learning - Model Based: on policy The learner has control over the action. –The immediate goal is to lean a model As before: –Build an “observed model”: Transition probabilities and Rewards –Use the “observed model” to estimate value of the policy. Accelerating the learning: –How to reach “new” places ?!

Learning - Model Based: on policy Well sampled nodes Relatively unknown nodes

Learning: Policy improvement Assume that we can perform: –Given a policy , –Compute V and Q functions of  Can run policy improvement: –  = Greedy (Q) Process converges if estimations are accurate.

Learning: Monte Carlo Methods Assume we can run in episodes –Terminating MDP –Discounted return Simplest: sample the return of state s: –Wait to reach state s, –Compute the return from s, –Average all the returns.

Learning: Monte Carlo Methods First visit: –For each state in the episode, –Compute the return from first occurrence –Average the returns Every visit: –Might be biased! Computing optimal policy: –Run policy iteration.

Learning - Model Free Policy evaluation: TD(0) An online view: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” is A t =r t +  V(s t+1 )-V(s t ), The update: V t +1 (s t ) = V t (s t ) +  A t Note that for the correct value function we have: E[r+  V(s’)-V(s)] =0

Learning - Model Free Optimal Control: off-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+  r t +  V t (s t+1 ) - Q t (s t,a t )] OFF POLICY: Q-Learning Any underlying policy selects actions. Assumes every state action performed infinitely often Learning rate dependency. Convergence in the limit: GUARANTEED [DW,JJS,S,TS]

Learning - Model Free Optimal Control: on-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+  r t +  Q t (s t+1,a t+1 ) - Q t (s t,a t )] ON-Policy: SARSA a t+1 the  -greedy policy for Q t. The policy selects the action! Need to balance exploration and exploitation. Convergence in the limit: GUARANTEED [DW,JJS,S,TS]

Learning - Model Free Policy evaluation: TD( ) Again: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” A=r t +  V(s t+1 )-V(s t ), Update every state s: V t +1 (s) = V t (s ) +  A e(s) Update of e(s) : When visiting s: incremented by 1: e(s) = e(s)+1 For all s: decremented by  every step: e(s) =  e(s)

Summary Markov Decision Process: Mathematical Model. Planning Algorithms. Learning Algorithms: Model Based Monte Carlo TD(0) Q-Learning SARSA TD( )