Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning Tutorial

Reinforcement Learning

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Chapter 5: Monte Carlo Methods

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.

Chapter 6: Temporal Difference Learning

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Search and Planning for Inference and Learning in Computer Vision

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Reinforcement Learning

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

INTRODUCTION TO Machine Learning

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning Elementary Solution Methods

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Chapter 6: Temporal Difference Learning

An Overview of Reinforcement Learning

Biomedical Data & Markov Decision Process

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Presentation transcript:

Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

Outline Stochastic Process Markov Property Markov Chain Markov Decision Process Reinforcement Learning RL Techniques Example Applications

Stochastic Process Quick definition: A Random Process Often viewed as a collection of indexed random variables Useful to us: Set of states with probabilities of being in those states indexed over time We’ll deal with discrete stochastic processes Image:AAMarkov.jpg

Stochastic Process Example Classic: Random Walk Start at state X 0 at time t 0 At time t i, move a step Z i where P(Z i = -1) = p and P(Z i = 1) = 1 - p At time t i, state X i = X 0 + Z 1 +…+ Z i

Markov Property Also thought of as the “memoryless” property A stochastic process is said to have the Markov property if the probability of state X n+1 having any given value depends only upon state X n Very much depends on description of states

Markov Property Example Checkers: Current State: The current configuration of the board Contains all information needed for transition to next state Thus, each configuration can be said to have the Markov property

Markov Chain Discrete-time stochastic process with the Markov property Industry Example: Google’s PageRank algorithm Probability distribution representing likelihood of random linking ending up on a page

Markov Decision Process (MDP) Discrete time stochastic control process Extension of Markov chains Differences: Addition of actions (choice) Addition of rewards (motivation) If the actions are fixed, an MDP reduces to a Markov chain

Description of MDPs Tuple (S, A, P(.,.), R(.))) S -> state space A -> action space P a (s, s’) = Pr(s t+1 = s’ | s t = s, a t = a) R(s) = immediate reward at state s Goal is to maximize some cumulative function of the rewards Finite MDPs have finite state and action spaces

Simple MDP Example Recycling MDP Robot Can search for trashcan, wait for someone to bring a trashcan, or go home and recharge battery Has two energy levels – high and low Searching runs down battery, waiting does not, and a depleted battery has a very low reward news.bbc.co.uk

Transition Probabilities s = s t s’ = s t+1 a = a t P a ss’ R a ss’ high searchαR search highlowsearch1 - αR search lowhighsearch1 - β-3 low searchβR search high wait1R wait highlowwait0R wait lowhighwait0R wait low wait1R wait lowhighrecharge10 low recharge00

Transition Graph state node action node

Solution to an MDP = Policy π Gives the action to take from a given state regardless of history Two arrays indexed by state V is the value function, namely the discounted sum of rewards on average from following a policy π is an array of actions to be taken in each state (Policy) V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2 basic steps

Variants Value Iteration Policy Iteration Modified Policy Iteration Prioritized Sweeping V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2 basic steps 1 2 Value Function

Value Iteration kV k (PU)V k (PF)V k (RU)V k (RF) V(s) = R(s) + γmaxa∑P a (s,s')V(s')

Why So Interesting? If the transition probabilities are known, this becomes a straightforward computational problem, however… If the transition probabilities are unknown, then this is a problem for reinforcement learning.

Typical Agent In reinforcement learning (RL), the agent observes a state and takes an action. Afterward, the agent receives a reward.

Mission: Optimize Reward Rewards are calculated in the environment Used to teach the agent how to reach a goal state Must signal what we ultimately want achieved, not necessarily subgoals May be discounted over time In general, seek to maximize the expected return

Value Functions V π is a value function (How good is it to be in this state?) V π is the unique solution to its Bellman Equation Expresses relationship between a state and its successor states Bellman Equation: State-value function for policy π

Another Value Function Q π defines the value of taking action a in state s under policy π Expected return starting from s, taking action a, and thereafter following policy π Backup diagrams for (a) V π and (b) Q π Action-value function for policy π

Dynamic Programming Classically, a collection of algorithms used to compute optimal policies given a perfect model of environment as an MDP The classical view is not so useful in practice since we rarely have a perfect environment model Provides foundation for other methods Not practical for large problems

DP Continued… Use value functions to organize and structure the search for good policies. Turn Bellman equations into update policies. Iterative policy evaluation using full backups

Policy Improvement When should we change the policy? If we pick a new action α from state s and thereafter follow the current policy and V(π’) >= V(π), then picking α from state s is a better policy overall. Results from the policy improvement theorem

Policy Iteration Continue improving the policy π and recalculating V( π ) A finite MDP has a finite number of policies, so convergence is guaranteed in a finite number of iterations

Remember Value Iteration? Used to truncate policy iteration by combining one sweep of policy evaluation and one of policy improvement in each of its sweeps.

Monte Carlo Methods Requires only episodic experience – on-line or simulated Based on averaging sample returns Value estimates and policies only changed at the end of each episode, not on a step-by-step basis

Policy Evaluation Compute average returns as the episode runs Two methods: first-visit and every-visit First-visit is most widely studied First-visit MC method

Estimation of Action Values State values are not enough without a model – we need action values as well Q π (s, a)  expected return when starting in state s, taking action a, and thereafter following policy π Exploration vs. Exploitation Exploring starts

Example Monte Carlo Algorithm First-visit Monte Carlo assuming exploring starts

Another MC Algorithm On-line, first-visit, ε-greedy MC without exploring starts

Temporal-Difference Learning Central and novel to reinforcement learning Combines Monte Carlo and DP methods Can learn from experience w/o a model – like MC Updates estimates based on other learned estimates (bootstraps) – like DP

TD(0) Simplest TD method Uses sample backup from single successor state or state-action pair instead of full backup of DP methods

SARSA – On-policy Control Quintuple of events (s t, a t, r t+1, s t+1, a t+1 ) Continually estimate Q π while changing π

Q-Learning – Off-policy Control Learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of policy being followed

Case Study Job-shop Scheduling Temporal and resource constraints Find constraint-satisfying schedules of short duration In it’s general form, NP-complete

NASA Space Shuttle Payload Processing Problem (SSPPP) Schedule tasks required for installation and testing of shuttle cargo bay payloads Typical: 2-6 shuttle missions, each requiring tasks Zhang and Dietterich (1995, 1996; Zhang, 1996) First successful instance of RL applied in plan-space states = complete plans actions = plan modifications

SSPPP – continued… States were an entire schedule Two types of actions: REASSIGN-POOL operators – reassigns a resource to a different pool MOVE operators – moves task to first earlier or later time with satisfied resource constraints Small negative reward for each step Resource dilation factor (RDF) formula for rewarding final schedule’s duration

Even More SSPPP… Used TD( ) to learn value function Actions selected by decreasing ε-greedy policy with one-step lookahead Function approximation used multilayer neural networks Training generally took 10,000 episodes Each resulting network represented different scheduling algorithm – not a schedule for a specific instance!

RL and CBR Example: CBR used to store various policies and RL used to learn and modify those policies Ashwin Ram and Juan Carlos Santamarıa, 1993 Autonomous Robotic Control Job shop scheduling: RL used to repair schedules, CBR used to determine which repair to make Similar methods can be used for IDSS

References Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998 Stochastic Processes, ess ess Using Case-Based Reasoning as a Reinforcement Learning framework for Optimization with Changing Criteria, Zeng, D. and Sycara, K. 1995