Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Slides:



Advertisements
Similar presentations
The simplex algorithm The simplex algorithm is the classical method for solving linear programs. Its running time is not polynomial in the worst case.
Advertisements

Unità di Perugia e di Roma “Tor Vergata” "Uncertain production systems: optimal feedback control of the single site and extension to the multi-site case"
Markov Decision Process
Dynamic Decision Processes
Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
Lecture 6  Calculating P n – how do we raise a matrix to the n th power?  Ergodicity in Markov Chains.  When does a chain have equilibrium probabilities?
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
CS433 Modeling and Simulation Lecture 06 – Part 03 Discrete Markov Chains Dr. Anis Koubâa 12 Apr 2009 Al-Imam Mohammad Ibn Saud University.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Chapter 19 Probabilistic Dynamic Programming
Markov Chains 1.
Topics Review of DTMC Classification of states Economic analysis
11 - Markov Chains Jim Vallandingham.
An Introduction to Markov Decision Processes Sarah Hickmott
Entropy Rates of a Stochastic Process
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
The Simplex Method: Standard Maximization Problems
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Review.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Department of Computer Science Undergraduate Events More
Math for CSLecture 51 Function Optimization. Math for CSLecture 52 There are three main reasons why most problems in robotics, vision, and arguably every.
Entropy Rate of a Markov Chain
MAKING COMPLEX DEClSlONS
Random Walks and Markov Chains Nimantha Thushan Baranasuriya Girisha Durrel De Silva Rahul Singhal Karthik Yadati Ziling Zhou.
Chapter 3 Linear Programming Methods 高等作業研究 高等作業研究 ( 一 ) Chapter 3 Linear Programming Methods (II)
1. The Simplex Method.
ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 1 Chapter 1: The DP Algorithm To do:  sequential decision-making  state.
ENCI 303 Lecture PS-19 Optimization 2
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Vector Norms and the related Matrix Norms. Properties of a Vector Norm: Euclidean Vector Norm: Riemannian metric:
Dynamic Programming Applications Lecture 6 Infinite Horizon.
Department of Computer Science Undergraduate Events More
Part 4 Nonlinear Programming 4.5 Quadratic Programming (QP)
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© 2015 McGraw-Hill Education. All rights reserved. Chapter 19 Markov Decision Processes.
CS433 Modeling and Simulation Lecture 07 – Part 01 Continuous Markov Chains Dr. Anis Koubâa 14 Dec 2008 Al-Imam.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Discrete Time Markov Chains
Simplex Method Simplex: a linear-programming algorithm that can solve problems having more than two decision variables. The simplex technique involves.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Stochastic Optimization
To be presented by Maral Hudaybergenova IENG 513 FALL 2015.
Department of Computer Science Undergraduate Events More
Meaning of Markov Chain Markov Chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only.
Chapter 15: Recursion. Recursive Definitions Recursion: solving a problem by reducing it to smaller versions of itself – Provides a powerful way to solve.
Krishnendu ChatterjeeFormal Methods Class1 MARKOV CHAINS.
EMGT 6412/MATH 6665 Mathematical Programming Spring 2016
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Dynamic Programming Lecture 13 (5/31/2017).
Markov Decision Processes
The Simplex Method.
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Chapter 8. General LP Problems
Markov Decision Problems
Hidden Markov Models (cont.) Markov Decision Processes
Chapter 8. General LP Problems
Chapter 8. General LP Problems
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted cost and average cost criteria.

Markov Decision Processes2 Markov Decision Process Let X = {X 0, X 1, …} be a system description process on state space E and let D = {D 0, D 1, …} be a decision process with action space A. The process (X, D) is a Markov decision process if, for j  E and n = 0, 1, …, Furthermore, for each k  A, let f k be a cost vector and P k be a one-step transition probability matrix. Then the cost f k (i) is incurred whenever X n = i and D n = k, and The problem is to determine how to choose a sequence of actions in order to minimize cost.

Markov Decision Processes3 Policies A policy is a rule that specifies which action to take at each point in time. Let  denote the set of all policies. In general, the decisions specified by a policy may –depend on the current state of the system description process –be randomized (depend on some external random event) –also depend on past states and/or decisions A stationary policy is defined by a (deterministic) action function that assigns an action to each state, independent of previous states, previous actions, and time n. Under a stationary policy, the MDP is a Markov chain.

Markov Decision Processes4 Cost Minimization Criteria Since a MDP goes on indefinitely, it is likely that the total cost will be infinite. In order to meaningfully compare policies, two criteria are commonly used: 1.Expected total discounted cost computes the present worth of future costs using a discount factor  < 1, such that one dollar obtained at time n = 1 has a present value of  at time n = 0. Typically, if r is the rate of return, then  = 1/(1 + r). The expected total discounted cost is 2.The long run average cost is

Markov Decision Processes5 Optimization with Stationary Policies If the state space E is finite, there exists a stationary policy that solves the problem to minimize the discounted cost: If every stationary policy results in an irreducible Markov chain, there exists a stationary policy that solves the problem to minimize the average cost:

Markov Decision Processes6 Computing Expected Discounted Costs Let X = {X 0, X 1, …} be a Markov chain with one-step transition probability matrix P, let f be a cost function that assigns a cost to each state of the M.C., and let  (0 <  < 1) be a discount factor. Then the expected total discounted cost is Why? Starting from state i, the expected discounted cost can be found recursively as Note that the expected discounted cost always depends on the initial state, while for the average cost criterion the initial state is unimportant.

Markov Decision Processes7 Solution Procedures for Discounted Costs Let v  be the (vector) optimal value function whose ith component is For each i  E, These equations uniquely determine v . If we can somehow obtain the values v  that satisfy the above equations, then the optimal policy is the vector a, where “arg min” is “the argument that minimizes”

Markov Decision Processes8 Value Iteration for Discounted Costs Make a guess … keep applying the optimal value equations until the fixed point is reached. Step 1. Choose  > 0, set n = 0, let v 0 (i) = 0 for each i in E. Step 2. For each i in E, find v n+1 (i) as Step 3. Let Step 4. If  < , stop with v  = v n+1. Otherwise, set n = n+1 and return to Step 2.

Markov Decision Processes9 Policy Improvement for Discounted Costs Start myopic, then consider longer-term consequences. Step 1. Set n = 0 and let a 0 (i) = Step 2. Adopt the cost vector and transition matrix: Step 3. Find the value function Step 4. Re-optimize: Step 5. If a n+1 (i) = a n (i), then stop with v  = v and a  = a n (i). Otherwise, set n = n + 1 and return to Step 2.

Markov Decision Processes10 Linear Programming for Discounted Costs Consider the linear program: The optimal value of u(i) will be v  (i), and the optimal policy is identified by the constraints that hold as equalities in the optimal solution (slack variables equal 0). Note: the decision variables are unrestricted in sign!

Markov Decision Processes11 Long Run Average Cost per Period For a given policy d, its long run average cost could be found from its cost vector f d and one-step transition probability matrix P d : First, find the limiting probabilities by solving Then So, in principle we could simply enumerate all policies and choose the one with the smallest average cost… not practical if A and E are large.

Markov Decision Processes12 Recursive Equation for Average Cost Assume that every stationary policy yields an irreducible Markov chain. There exists a scalar    and a vector h such that for all states i in E, The scalar    is the optimal average cost and the optimal policy is found by choosing for each state the action that achieves the minimum on the right-hand-side. The vector h is unique up to an additive constant … as we will see, the difference between h(i) - h(j) represents the increase in total cost from starting out in state i rather than j.

Markov Decision Processes13 Relationships between Discounted Cost and Long Run Average Cost If a cost of c is incurred each period and  is the discount factor, then the total discounted cost is Therefore, a total discounted cost v is equivalent to an average cost of c = (1-  )v per period, so Let v  be the optimal discounted cost vector,  * be the optimal average cost and h be the mystery vector from the previous slide.

Markov Decision Processes14 Policy Improvement for Average Costs Designate one state in E to be “state number 1” Step 1. Set n = 0 and let a 0 (i) = Step 2. Adopt the cost vector and transition matrix: Step 3. With h(1) = 0, solve Step 4. Re-optimize: Step 5. If a n+1 (i) = a n (i), then stop with  * =  and a * (i) = a n (i). Otherwise, set n = n + 1 and return to Step 2.

Markov Decision Processes15 Linear Programming for Average Costs Consider randomized policies: let w i (k) = P{D n = k | X n = i}. A stationary policy has w i (k) = 1 for each k=a(i) and 0 otherwise. The decision variables are x(i,k) = w i (k)  (i). The objective is to minimize the expected value of the average cost (expectation taken over the randomized policy): Note that one constraint will be redundant and may be dropped.