Markov Decision Problems

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Dynamic Decision Processes
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
Infinite Horizon Problems
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Markov Decision Processes
Nov 14 th  Homework 4 due  Project 4 due 11/26.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Analysis of Algorithms
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Dynamic Programming Applications Lecture 6 Infinite Horizon.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Dynamic Programming Discrete time frame Multi-stage decision problem Solves backwards.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Department of Computer Science Undergraduate Events More
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Stochastic Optimization
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Vincent Conitzer CPS Repeated games Vincent Conitzer
Biomedical Data & Markov Decision Process
Reinforcement Learning
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Dynamic Programming and Applications
CONTEXT DEPENDENT CLASSIFICATION
Graphs, Linear Equations, and Functions
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
October 6, 2011 Dr. Itamar Arel College of Engineering
Study Guide for ES205 Yu-Chi Ho Jonathan T. Lee Jan. 11, 2001
Vincent Conitzer Repeated games Vincent Conitzer
Chapter 17 – Making Complex Decisions
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Hidden Markov Models (cont.) Markov Decision Processes
Reinforcement Learning Dealing with Partial Observability
Kalman Filter: Bayes Interpretation
Markov Decision Processes
Generalized Perturbation and Simulation
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Vincent Conitzer CPS Repeated games Vincent Conitzer
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Markov Decision Problems Date: 2003-8-31 The purpose of this week is to introduce the basic concepts in Markov Decision Problems (MDP). As we will see in the next week module, there are relationships between MDP and PA. The reference for this week is: Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc. 1994.

The Multistage System Figure 2.2.1 Flow chart for a multistage system Applied Optimal Control Optimization, Estimation, and Control Arthur E. Bryson, Jr. Yu-Chi Ho Hemisphere Publishing Corporation, 1975, p44, First recall the multistage system introduced in the last week. This is the basis of understanding decision problems. There are three elements in multistage system: state, input and transition functions. In the figure above, x represents the state, u the input and f the transition functions. Generally, the state, input and transition functions variate with the stage. After defining a cost function based on the state and input, an optimization problem is formulated. Usually the control with the least cost and meets the constraints is interesting. And such control is called the optimal control. Recall the big picture in the last slide of the last week. One special application of optimal control is the decision problems. U is the input, made by person based on some policy. Since the cost(x,u) is not known before the action is taken, this is also an optimal control problem. Further more, if the state transition function is only the function of the current state and input, while not remembering the earlier history, it is the Markov Decision Problem. Generally, if the lifetime of each state contains non-exponential distribution, it is Semi-Markov Decision Problem. Copyright by Yu-Chi Ho

The Content of MDP Providing conditions under which there exist easily implementable optimal policies; Determining how to recognize these policies; Developing and enhancing algorithms for computing them; Establishing convergence of these algorithms. Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc. 1994, Preface xv. Markov Decision Problems are models for sequential decision making, with uncertain outcomes. The stage in the multistage system is the decision epoch. The input is the decision. The state transition function decides the next state based on the current state, the input and the transition probability. More strict formulation is in the next slide. The four points of content illustrate the main part studied in MDP. And in this week module, we simply illustrate the third point. Copyright by Yu-Chi Ho

Problem Formulation T, decision epochs and periods S, state As, action sets pt(.|s,a), transition probabilities rt(s,a), rewards Decision rules Policies We use the notations in Puterman’s book to illustrate the basic concept in MDP. The Markov Decision Problem can be described by the collections of five objects: The decision epochs and periods. It is similar to the stage in multistage system. It can be discrete or continuous, deterministic or stochastic, with exponential or non-exponential lifetime. The state. It is similar to the state in multistage system. In MDP, the current state contains all the information needed to for the state transition function to decide the next state, sometimes with probability. The action sets. It is similar to the feasible events in GSMP, recall that in week 3. With the comparison to the multistage system, it is similar to the feasible input set. It is a function of the current state in MDP, since some actions cannot be taken in special states. E.g. the light cannot be turn off if it is already off. And the decision should be selected from this set. The transition probabilities. To general the deterministic cases, the transition probabilities are introduced. Thus even when the action at one state is decided, generally the next state cannot be determined excluding the distribution. It is a function of the current state, the action and the time. In special cases, it can be stationary if it is independent of the time. The rewards. It is similar to the cost function in the multistage system. Here the reward is general. Thus it can be positive for the income and negative for the cost. For the optimization problem, an optimal criterion should be introduced. Most of the optimal criteria are based on the rewards, though there are different forms. Details will be introduced in the classification of MDP in the following slide. Besides the five concepts introduced above, there are two other important ones. Decision rules. It determines how to find the decision based on the current state, input, transition probability and optimal criteria. There are many kinds of decision rules. In Puterman’s book the classification includes the history dependent and randomized, history dependent and deterministic, Markovian and randomized, or Markovian and deterministic. Policies. Intuitively, the policies can be seemed as a sequential decisions. Copyright by Yu-Chi Ho

Classification of MDP Horizon Criteria Time Finite & infinite Expected Total Reward Criteria Expected Total Discounted Reward Criteria Average Reward and Related Criteria Time Discrete & continuous We can classify the MDP through different criteria. And the following are some examples: Horizon. It is similar to the stage in multistage. It includes finite and infinite. The finite horizon MDP is a simple one. And many practical systems can be formulated into this model. And the optimal criteria based on the whole horizons. And infinite horizon MDP corresponds to infinite stages. Thus the optimal criteria based on the average cost or the whole horizons. Criteria. Although most of the optimal criteria are based on the cost of each stage, there are different criteria for different purposes. The expected total reward criteria can be used in finite and infinite horizon MDP. And in infinite ones, new kind of evaluation of the formula should be dressed to deal with the divergence. Its purpose is to consider the whole performance of the process. The expected total discounted reward criteria can be used in infinite ones. Its purpose is to introduce the basic idea in economy, since the one unit value of money in tomorrow has only lambda in today, where 0<l<1. The average reward and related criteria. Its purpose is to consider the average performance of the system. Time. In discrete case, the decision can be made only at deterministic time points, usually after the transitions. In continuous case, the decision can be made at any time point after the transitions. And the transitions in this case can happen in continuous time period. Copyright by Yu-Chi Ho

Optimality Equations and the Principle of Optimality The Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. R. E. Bellman, Dynamic Programming (Princeton University Press, Princeton, NJ, 1957, p.83) The optimal equation shown above is in Puterman’s Book p83. It is the optimal equation in Finite Horizon MDP. And the optimal equation is similar in the other cases. The h here represents the history till the current state. Recall the iterative equation of u in Dynamic Programming. They are similar in the form. Here we want to maximize the reward function. In general case, we use supremum while not maximum. The equation represents a basic idea when solving it, the backward induction. If the input at (t+1) is given, the input at t can be decided through this equation. This is also the basic idea used in Dynamic Programming. Recall the principle of optimality in the Dynamic Programming. It is one of the most important principles in DP. Copyright by Yu-Chi Ho

The Algorithms Finite: Dynamic Programming Infinite: Value Iteration Policy Iteration Action Elimination Etc. We discuss how to solve the optimal equations. For finite horizon MDP, as shown in the last slide, DP is a useful and basic approach. For infinite horizon MDP, generally, we introduce the optimal equation for discounted MDP. (Puterman’s Book, p.159) the lambda here is to consider the discounted factor. In many cases the deterministic stationary Markov policies can be found. Thus we only discuss this case. Then the determination of v can be seemed as an iteration process. That is the basic idea of Value Iteration Algorithm. The basic idea of Value Iteration is that: Given an initial v Use the optimal equation to get the iterated v If the difference between the new v and the original one is small enough, the iteration can stop. Otherwise, it keeps. There are some theorems for the convergence of this algorithm. See also Puterman’s book in 6.3.1. The basic idea of Policy Iteration is that: Give an initial policy Each iteration will find the best policy to maximize the reward so far. The iteration continues until the policies of two iterations are the same. But this algorithm requires the action sets are all the same to each state. The difference of the Value Iteration and Policy Iteration. The previous one tries to find the optimum point of the reward function. It requires little about the structure of MDP, so can be applied in many cases. The later one only applies to stationary infinite-horizon problem. And combination of the Value Iteration and the Policy Iteration are the Action Elimination. The basic idea is that: with the iteration proceeds, the action set may change. And sometimes we do know some actions in the action set lead to less reward than the best so far. This information can be gained through the Value Iteration. Then we can use this information to reduce the search space for Policy Iteration. Thus just eliminate the worse actions. In many cases, the combination of the Value Iteration and the Policy Iteration lead to faster convergence. There are also many other extensions for some special cases. See also Puterman’s Book in Chapter 6.5 6.6. We focus on the finite horizon MDP and the discounted MDP to discuss the algorithm. In other cases, first to change the optimal equation to proper cases, and then change the iteration algorithms to new proper forms. That’s the basic idea of the algorithms in other cases. See also Puterman’s Book in Chapter 8. And the continuous MDP is an extension of discrete time MDP. The model can be defined similar to the discrete case, such as the reward, the decision rules and policies. Copyright by Yu-Chi Ho