Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

An Introduction to Markov Decision Processes Sarah Hickmott

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Markov Decision Processes

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Reinforcement Learning Tutorial

An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.

Reinforcement Learning (1)

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Reinforcement Learning

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.

INTRODUCTION TO Machine Learning

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

1 Introduction to Reinforcement Learning Freek Stulp.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement Learning

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

Reinforcement learning

Making complex decisions

Markov Decision Processes

Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes

UAV Route Planning in Delay Tolerant Networks

Markov Decision Processes

Reinforcement learning

Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

Chapter 17 – Making Complex Decisions

Reinforcement Learning Dealing with Partial Observability

Reinforcement learning

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th November, 15:30-17:00

Reinforcement Learning is… … learning from trial-and-error and reward by interaction with an environment.

Today’s Lecture A formal framework: Markov Decision Processes Optimality criteria Value functions Solution methods: Q-learning Examples and exercises Alternative models Summary and applications

Markov Decision Processes The reinforcement learning problem can be represented as: A set S of states {s 1, s 2, s 3, …} A set A of actions {a 1, a 2, a 3, …} A transition function T:S x A  S (deterministic) orT:S x A x S  [0, 1] (stochastic) A reward function R:S x A  Real orR:S x A x S  Real A policy π:S  A (deterministic) orπ:S x A  [0, 1] (stochastic)

Optimality Criteria Suppose an agent receives a reward r t at time t. Then optimal behaviour might: Maximise the sum of expected future reward: Maximise over a finite horizon: Maximise over an infinite horizon: Maximise over a discounted infinite horizon: Maximise average reward:

Value Functions State value function V π :S  Real or V π (s) State-action value function Q π :S x A  Real or Q π (s, a) The expected sum of discounted reward for following the policy π from state s to the end of time. The expected sum of discounted reward for starting in state s, taking action a once then following the policy π from state s’ to the end of time.

Optimal State Value Function V*(s) = E{ R(s, a, s’) + γ V*(s’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γ V*(s’) ] A Bellman Equation Can be solved using dynamic programming Requires knowledge of the transition function T

Optimal State-Action Value Function Q*(s, a) = E{ R(s, a, s’) + γ Q*(s’, a’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γ Q*(s’, a’) ] Also a Bellman Equation Also requires knowledge of the transition function T to solve using dynamic programming Can now define action selection: π*(s) = Q*(s, a)

A Possible Application…

Solution Methods Model based: –For example dynamic programming –Require a model (transition function) of the environment for learning Model free: –Learn from interaction with the environment without requiring a model –For example Q-learning…

Q-Learning by Example: Driving in Canberra Parked Clean Driving Clean Parked Dirty Driving Dirty Drive Park Drive Park Clean Drive Clean Drive Park Clean

Formulating the Problem States s 1 Park clean s 2 Park dirty s 3 Drive clean s 4 Drive dirty Actions a 1 Drive a 2 Clean a 3 Park Reward 1 for transitions to a r t = ‘clean’ state 0 otherwise State-Action Table or Q-Table a1a1 a2a2 a3a3 s1s1 ??? s2s2 ??? s3s3 ??? s4s4 ???

A Q-Learning Agent Agent Environment stst rtrt atat Learning update to π t Action selection from π t

Q-Learning Algorithmic Components Learning update (to Q-Table): Q(s, a)  (1-α)Q(s, a) + α[r + γ Q(s’, a’)] or Q(s, a)  Q(s, a) + α[r + γ Q(s’, a’) - Q(s, a)] Action selection (from Q-Table): a = f(Q(s, a))

Matlab Code Available on Request

Exercise You need to program a small robot to learn to find food. What assumptions will you make about the robot’s sensors and actuators to represent the environment? How could you model the problem as an MDP? Calculate a few learning iterations in your domain by hand.

Alternatives Function approximation of the Q-table: –Neural networks –Decision trees –Gradient descent methods Reinforcement learning variants: –Relational reinforcement learning –Hierarchical reinforcement learning –Intrinsically motivated reinforcement learning

A final application…

References and Further Reading Sutton, R., Barto, A., (2000) Reinforcement Learning: an Introduction, The MIT Press Kaelbling, L., Littman, M., Moore, A., (1996) Reinforcement Learning: a Survey, Journal of Artificial Intelligence Research, 4: Barto, A., Mahadevan, S., (2003) Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Dynamic Systems: Theory and Applications, 13(4):41-77