An Introduction to Markov Decision Processes Sarah Hickmott

Slides:



Advertisements
Similar presentations
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
1 Operations Research Prepared by: Abed Alhameed Mohammed Alfarra Supervised by: Dr. Sana’a Wafa Al-Sayegh 2 nd Semester ITGD4207 University.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
A Call Admission Control for Service Differentiation and Fairness Management in WDM Grooming Networks Kayvan Mosharaf, Jerome Talim and Ioannis Lambadaris.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Markov Decision Process (MDP)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Reinforcement learning
Logic for Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Markov Decision Problems
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

An Introduction to Markov Decision Processes Sarah Hickmott

Decision Theory Probability Theory + Utility Theory = Decision Theory Describes what an agent should believe based on evidence. Describes what an agent wants. Describes what an agent should do. MDPs fall under the blanket of decision theory

Markov Assumption Markov Assumption: Markov Assumption: Andrei Markov (1913) Markov Assumption: The next state’s conditional probability depends only on a finite history of previous states kth order Markov Process Markov Assumption: The next state’s conditional probability depends only on its immediately previous state 1st order Markov Process The Markov assumption The definitions are equivalent!!! Any algorithm that makes the 1st order Markov Assumption can be applied to any Markov Process

Markov Decision Process The specification of a sequential decision problem for a fully observable environment that satisfies the Markov Assumption and yields additive costs.

Markov Decision Process An MDP has: A set of states S = {s1 , s2 , … sN} A set of actions A = {a1 , a2 , … aM} A real valued cost function g(s, a) A transition probability function p(s’ | s, a) Note: We will assume the stationary Markov transition property. This states that the effect of an action is independent of time

xk+1 = f(xk , μk(xk) ) k=0…N-1 Notation k indexes discrete time xk is the state of the system at time k; μk(xk) is the control variable to be selected given the system is in state xk at time k ; μk : Sk → Ak π is a policy; π = {μ0,,..., μN-1} π* is the optimal policy N is the horizon, or number of times the control is applied xk+1 = f(xk , μk(xk) ) k=0…N-1

Policy A policy is a mapping from states to actions Following a policy: 1. Determine current state xk 2. Execute action μk(xk) 3. Repeat 1-2

Solution to an MDP The expected cost of a policy π = {μ0,,..., μN-1} starting at state state x0 is: Goal: Find the policy π* which specifies which action to take in each state, so as to minimise the cost function. This is encapsulated by Bellman’s Equation: A Markov Decision Process (MDP) is just like a Markov Chain, except the transition matrix depends on the action taken by the decision maker (agent) at each time step. The agent receives a reward, which depends on the action and the state. The goal is to find a function, called a policy, which specifies which action to take in each state, so as to maximize some function (e.g., the mean or expected discounted sum) of the sequence of rewards. One can formalize this in terms of Bellman's equation, which can be solved iteratively using policy iteration. The unique fixed point of this equation is the optimal policy.

Assigning Costs to Sequences The objective cost function maps infinite sequences of costs to single real numbers Options: Set a finite horizon and simply add the costs If the horizon is infinite, i.e. N → ∞, some possibilities are: Discount to prefer earlier costs Average the cost per stage

MDP Algorithms Value Iteration For each state select any initial value Jo(s) k=1 while k < maximum iterations For each state s find the action a that minimises the equation: Then assign μ(s) = a k = k+1 end

MDP Algorithms Policy Iteration Start with a randomly selected initial policy, then refine it repeatedly. Value Determination: solve |S| simultaneous Bellman equations Policy Improvement: for any state, if an action exists which reduces the current estimated cost, then change it in the policy. Each step of Policy Iteration is computationally more expensive than Value Iteration. However Policy Iteration needs fewer steps to converge than Value Iteration.

MDPs and PNs MDPs modeled by live Petri nets lead to Average Cost per Stage problems. A policy is equivalent to a trace through the net The aim is to use the finite prefix of an unfolding to derive decentralised Bellman’s equations, possibly associated with local configurations, and the communication between interacting parts. Initially we will assume actions and their effects are deterministic. Some work has been done unfolding Petri nets such that concurrent events are statistically independent.