Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Markov Decision Process
Reinforcement Learning
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Slides for the book: Probabilistic Robotics Authors: Sebastian Thrun Wolfram Burgard Dieter Fox Publisher: MIT Press, Web site for the book & more.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning Tutorial
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham.
Machine Learning Lecture 11: Reinforcement Learning
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))
Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Retraction: I’m actually 35 years old. Q-Learning.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
An Overview of Reinforcement Learning
Reinforcement learning
Hierarchical POMDP Solutions
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 6: Temporal Difference Learning
CS 416 Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy Wyatt (U Birmingham)

So Far MDPs –Known transition model –Known rewards

Return: A long term measure of performance Let’s assume our agent acts according to some rules, called a policy,  The return R t is a measure of long term reward collected after time t r9r9 r5r5 r4r4 r1r1

Bellman equations Conditional independence allows us to define expected return in terms of a recurrence relation: where and where s  (s)

Today: Reinforcement Learning (RL) Learning from punishments and rewards Agent moves through world, observing states and rewards Adapts its behaviour to maximise some function of reward Issues: –Unknown transition model –Action selection while finding optimal policy

Two types of RL We can bootstrap using explicit knowledge of P and R (Dynamic Programming) Or we can bootstrap using samples from P and R (Temporal Difference learning) stst s t+1 r t+1 s 435  (s) a t =  (s t )

TD(0) learning t:=0  is the policy to be evaluated Initialise arbitrarily for all Repeat select an action a t from  (s t ) observe the transition update according to t:=t+1 stst s t+1 r t+1 a t

On and Off policy learning On policy: evaluate the policy you are following, e.g. TD learning Off-policy: evaluate one policy while following another policy E.g. One step Q-learning stst s t+1 r t+1 a t s t+1 atat stst r t+1

Off policy learning of control Q-learning is powerful because –it allows us to evaluate   –while taking non-greedy actions (explore)  -greedy is a simple and popular exploration rule: take a greedy action with probability  Take a random action with probability 1-  Q-learning is guaranteed to converge for MDPs (with the right exploration policy) Is there a way of finding   with an on-policy learner?

On policy learning of control: Sarsa Discard the max operator Learn about the policy you are following Change the policy gradually Guaranteed to converge for Greedy in the Limit Infinite Exploration Policies atat s t+1 stst a t+1 r t+1

On policy learning of control: Sarsa t:=0 Initialise arbitrarily for all select an action a t from explore( ) Repeat observe the transition select an action a t+1 from explore( ) update according to t:=t+1 atat s t+1 stst r t+1

Summary: TD, Q-learning, Sarsa TD learning One step Q-learning Sarsa learning stst r t+1 a t s t+1 atat stst r t+1 s t+1 atat stst a t+1 r t+1

We add an eligibility for each state: where Update in every state proportional to the eligibility Speeding up learning: Eligibility traces, TD(  TD learning only passes the TD error to one state a t-2 a t-1 atat r t-1 rtrt r t+1 s t-2 s t-1 stst s t+1

Eligibility traces for learning control: Q( ) There are various eligibility trace methods for Q-learning Update for every s,a pair Pass information backwards through a non-greedy action Lose convergence guarantee of one step Q-learning Watkin’s Solution: zero all eligibilities after a non-greedy action Problem: you lose most of the benefit of eligibility traces r t+1 s t+1 rtrt a t-1 atat a t+1 s t-1 stst a greedy

Eligibility traces for learning control: Sarsa( ) Solution: use Sarsa since it’s on policy Update for every s,a pair Keeps convergence guarantees atat a t+1 a t+2 r t+1 r t+2 stst s t+1 s t+2

Hot Topics in Reinforcement Learning Efficient Exploration and Optimal learning Learning with structured models (eg. Bayes Nets) Learning with relational models Learning in continuous state and action spaces Hierarchical reinforcement learning Learning in processes with hidden state (eg. POMDPs) Policy search methods

Reinforcement Learning: key papers Overviews R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, J. Wyatt, Reinforcement Learning: A Brief Overview. Perspectives on Adaptivity and Learning. Springer Verlag, L.Kaelbling, M.Littman and A.Moore, Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4: , Value Function Approximation D. Bersekas and J.Tsitsiklis. Neurodynamic Programming. Athena Scientific, Eligibility Traces S.Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22: , 1996.

Reinforcement Learning: key papers Structured Models and Planning C. Boutillier, T. Dean and S. Hanks. Decision Theoretic Planning: Structural Assumptions and Computational Leverage. Journal of Artificial Intelligence Research, 11:1-94, R. Dearden, C. Boutillier and M.Goldsmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1-2):49-107, B. Sallans. Reinforcement Learning for Factored Markov Decision Processes Ph.D. Thesis, Dept. of Computer Science, University of Toronto, K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, University of California, Berkeley, 2002.

Reinforcement Learning: key papers Policy Search R. Williams. Simple statistical gradient algorithms for connectionist reinforcement learning. Machine Learning, 8: R. Sutton, D. McAllester, S. Singh, Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 12, Hierarchical Reinforcement Learning R. Sutton, D. Precup and S. Singh. Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112: R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD Thesis, University of California, Berkeley, A. Barto and S. Mahadevan. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Systems Journal 13: 41-77, 2003.

Reinforcement Learning: key papers Exploration N. Meuleau and P.Bourgnine. Exploration of multi-state environments: Local Measures and back-propagation of uncertainty. Machine Learning, 35: , J. Wyatt. Exploration control in reinforcement learning using optimistic model selection. In Proceedings of 18 th International Conference on Machine Learning, POMDPs L. Kaelbling, M. Littman, A. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101:99-134, 1998.