1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei.

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Dynamic Programming Rahul Mohare Faculty Datta Meghe Institute of Management Studies.

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Markov Game Analysis for Attack and Defense of Power Networks Chris Y. T. Ma, David K. Y. Yau, Xin Lou, and Nageswara S. V. Rao.

Markov Game Analysis for Attack and Defense of Power Networks Chris Y. T. Ma, David K. Y. Yau, Xin Lou, and Nageswara S. V. Rao.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Decision Theoretic Planning

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Optimal Policies for POMDP Presented by Alp Sardağ.

In this handout Stochastic Dynamic Programming

Chapter 19 Probabilistic Dynamic Programming

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

Topics Review of DTMC Classification of states Economic analysis

Stevenson and Ozgur First Edition Introduction to Management Science with Spreadsheets McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

1 Adaptive resource management with dynamic reallocation for layered multimedia on wireless mobile communication net work Date ： 2005/06/07 Student ： Jia-Hao.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Markov Decision Processes

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Ch 6.1: Definition of Laplace Transform Many practical engineering problems involve mechanical or electrical systems acted upon by discontinuous or impulsive.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

To accompany Quantitative Analysis for Management, 8e by Render/Stair/Hanna M2-1 © 2003 by Prentice Hall, Inc. Upper Saddle River, NJ Module 2 Dynamic.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Stochastic Process1 Indexed collection of random variables {X t } t   for each t  T  X t is a random variable T = Index Set State Space = range.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Search and Planning for Inference and Learning in Computer Vision

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Operations Research Prepared by: Abed Alhameed Mohammed Alfarra Supervised by: Dr. Sana’a Wafa Al-Sayegh 2 nd Semester ITGD4207 University.

Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Carnegie Mellon University

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

Lecture 2 Managerial Finance FINA 6335 Ronald F. Singer.

Ant colony optimization. HISTORY introduced by Marco Dorigo (MILAN,ITALY) in his doctoral thesis in 1992 Using to solve traveling salesman problem(TSP).traveling.

Markov Decision Processes: A Survey

An Optimal Distributed Call Admission control for Adaptive Multimedia in Wireless/Mobile Networks Reporter: 電機所鄭志川.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Markov Decision Process (MDP)

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Markov Decision Processes: A Survey II1/73 Markov Decision Processes: A Survey II Cheng-Ta Lee March 27, 2006.

1 Automated Planning and Decision Making 2007 Automated Planning and Decision Making Prof. Ronen Brafman Various Subjects.

Introduction and Preliminaries D Nagesh Kumar, IISc Water Resources Planning and Management: M4L1 Dynamic Programming and Applications.

1 Dynamic Programming Chapter 3. Learning Objectives 2 After completing this chapter, students will be able to: 1.Understand the overall approach of dynamic.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Reliability Engineering

On-Line Markov Decision Processes for Learning Movement in Video Games

Markov Decision Process (MDP)

Chapter 11 Dynamic Programming.

Making complex decisions

Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

Relating Reinforcement Learning Performance to Classification performance Presenter: Hui Li Sept.11, 2006.

Markov Decision Processes

Chapter 3 Dynamic Programming.

Markov Decision Problems

Chapter 17 – Making Complex Decisions

Discrete-time markov chain (continuation)

Presentation transcript:

1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei

2 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

3 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

4 Simple Markov process ( 1 ) Definition ： The probability of a transition to state j during the next time interval, given that the system now occupies state i, is a function only of i and j and not of any history of the system before its arrival in i. Example ： Frog in a lily pond.

5 Simple Markov process ( 2 ) We may specify a set of conditional probabilities that a system which now occupies state i will occupy state j after its net transition.

6 Toymaker Example ( 1 ) Has two states ： ( discrete-time MP ) First state ： The toy he is currently producing has found great favor with the public. Second state ： his toy is out of favor. Define state probability the probability that the system will occupy state i after n transitions if its state at n = 0 is known.

7 Toymaker Example ( 2 ) Transition Matrix ： Transition Diagram ： 12

8 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

9 Markov Processes with Rewards Definition ： Suppose that an N-state Markov process earns “dollars” when it makes a transition from state i to state j. We call (may be negative) the “reward” associated with the transition from i to j. Reward matrix R. Example ： Frog in a lily pond.

10 Definition ： The expected total earnings in the next n transitions if the system is now in state i. ： The reward to be expected in the next transition out of state i ; it will be called the expected immediate reward for state i.

11 Formula of ：

12 Toymaker with Reward Example ( 1 ) Reward matrix ： Transition matrix ： Can find

13 Toymaker with Reward Example ( 2 )

14 Toymaker with Reward Example ( 3 )

15 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

16 Introduction of Alternatives The Concept of alternative for an N-state system.

17 The constrain of Alternative The number of alternatives in any state must be finite, but the number of alternatives in each state may be different from the numbers in other states.

18 Concepts of Decision ( 1 ) ： The number of the alternative in the ith state that will be used at stage n. We call the “decision” in state i at the nth stage. When has been specified for all i and all n, a “policy” has been determined. The optimal policy is the one that maximizes total expected return for each i and n.

19 Concepts of Decision ( 2 ) Here redefine as the total expected return in n stages starting from state i if an optimal policy is followed.

20 The Toymaker’s Problem Solved by Value Iteration ( 1 )

21 The Toymaker’s Problem Solved by Value Iteration ( 2 )

22 Value-Iteration The method that has just been described for the solution of the sequential process may be called the value-iteration method because the or “values” are determined iteratively. Even if we were patient enough to solve the long-duration process by value iteration, the convergence on the best alternative in each state is asymptotic and difficult to measure analytically.

23 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

24 Policy

25 Decision and Policy The alternative thus selected is called the “decision” for that state; it is no longer a function of n. The set of X’s or the set of decisions for all states is called a “policy”. Selection of a policy thus determines the Markov process with rewards that will describe the operations of the system.

26 Problems of finding optimal policies In previous figure, there are 4x3x2x1x5=120 different policies. It is still conceivable, but it becomes unfeasible for very large problems. For example ： a problem with 50 states and 50 alternatives in each state contains Solution ： It is composed of two parts, the value- determination operation and the policy- improvement routine.

27 Formula The Value-Determination Operation ： The Policy-Improvement Routine ：

28 The Iteration Cycle

29 The Toymaker’s Problem Two Alternatives ： 1 、 2 、 Four Possible Policies ：

30 Result

31 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

32 Conclusion In Markov process, We can use the “Iteration Cycle” to select an optimal policy in order to earn the maximum reward.

33 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

34 Reference An introduction to Markov decision processes --Ronald A. Howard, Dynamic Programming and Markov Processes, MIT Press, 1960

35 z-Transform For the study of transient behavior and for theoretical convenience, it is useful to study the Markov process from the point of view of the generating function or, as we shall call it,the z- Transform. The relationship between time function f(n) and its transform F(z) is unique; each time function has only one transform, and the inverse transformation of the transform will produce once more the original time function.