Download presentation
Presentation is loading. Please wait.
Published byDeirdre Higgins Modified over 9 years ago
1
Markov Decision Processes: A Survey II1/73 Markov Decision Processes: A Survey II Cheng-Ta Lee March 27, 2006
2
Markov Decision Processes: A Survey II2/73 Outline Introduction Markov Theory Markov Decision Processes Conclusion Future Work in Sensor Networks
3
Markov Decision Processes: A Survey II3/73 Introduction Decision Theory Probability Theory + Utility Theory = Decision Theory Describes what an agent should believe based on evidence. Describes what an agent wants. Describes what an agent should do.
4
Markov Decision Processes: A Survey II4/73 Introduction Markov decision processes (MDPs) theory has developed substantially in the last three decades and become an established topic within many operational research. Modeling of (infinite) sequence of recurring decision problems (general behavioral strategies) MDPs defined Objective functions Utility function Revenue Cost Policies Set of decision Dynamic (MDPs) Static
5
Markov Decision Processes: A Survey II5/73 Outline Introduction Markov Theory Markov Decision Processes Conclusion Future Work in Sensor Networks
6
Markov Decision Processes: A Survey II6/73 Markov Theory Markov process A mathematical model that us useful in the study of complex systems. The basic concepts of the Markov process are those “state” of a system and state “transition”. A graphic example of a Markov process is presented by a frog in a lily pond. State transition system Discrete-time process Continuous-time process
7
Markov Decision Processes: A Survey II7/73 Markov Theory To study the discrete-time process Suppose that there are N states in the system numbered from 1 to N. If the system is a simple Markov process, then the probability of a transition to state j during the next time interval, given that the system now occupies state i, is a function only of i and j and not of any history of the system before its arrival in i. (Memoryless) In other words, we may specify a set of conditional probability p ij. where
8
Markov Decision Processes: A Survey II8/73 The Toymaker Example First state: the toy is great favor. Second state: the toy is out of favor. Matrix form Transition diagram
9
Markov Decision Processes: A Survey II9/73 The Toymaker Example , the probability that the system will occupy state i after n transitions. If its state at n=0 is known. It follow that
10
Markov Decision Processes: A Survey II10/73 The Toymaker Example If the toymaker starts with a successful toy, then and, so that
11
Markov Decision Processes: A Survey II11/73 The Toymaker Example Table 1.1 Successive State Probabilities of Toymaker Starting with a Successful Toy Table 1.2 Successive State Probabilities of Toymaker Starting without a Successful Toy n=012345 … 10.50.450.4450.44450.44445 … 00.50.550.5550.55550.55555 … n=012345 … 00.40.440.4440.44440.44444 … 10.60.560.5560.55560.55556 …
12
Markov Decision Processes: A Survey II12/73 The Toymaker Example The row vector with components is thus the limit as n approaches infinity of
13
Markov Decision Processes: A Survey II13/73 z-Transformation For the study of transient behavior and for theoretical convenience, it is useful to study the Markov process from the point of view of the generating function or, as we shall call it, the z-transform. Consider a time function f(n) that takes on arbitrary values f(0), f(1), f(2), and so on, at nonnegative, discrete, integrally spaced points of time and that is zero for negative time. Such a time function is shown in Fig. 2.4 Fig. 2.4 An Arbitrary discrete- time function
14
Markov Decision Processes: A Survey II14/73 z-Transformation z-transform F(z) such that Table 1.3. z-Transform Pairs Time Function for n>=0z-Transform f(n)F(z) f 1 (n)+f 2 (n)F 1 (z)+F 2 (z) kf(n) (k is a constant)kF(z) f(n-1)zF(z) f(n+1)z -1 [F(z)-f(0)] 1 (unit step) n (unit ramp)
15
Markov Decision Processes: A Survey II15/73 z-Transformation Consider first the step function the z-transform is or For the geometric sequence f(n)=α n,n ≧ 0, or
16
Markov Decision Processes: A Survey II16/73 z-Transformation We shall now use the z-transform to analyze Markov processes. In this expression I is the identity matrix.
17
Markov Decision Processes: A Survey II17/73 z-Transformation Let us investigate the toymaker’s problem by z-transformation. Let the matrix H(n) be the inverse transform of (I-zP) -1 on an element- by-element basis
18
Markov Decision Processes: A Survey II18/73 z-Transformation If the toymaker starts in the successful state 1, then π(0)=[1 0] and or, If the toymaker starts in the unsuccessful state 2, then π(0)=[0 1] and or, We have now obtained analytic forms for the data in Table 1.1 and 1.2.Table 1.1 and 1.2.
19
Markov Decision Processes: A Survey II19/73 Laplace Transformation We shall extend our previous work to the case in which the process may make transitions at random time intervals. The Laplace transform of a time function f(t) which is zero for t<0 is defined by Table 2.4. Laplace Transform Pairs Time Function for t>=0z-Transform f(t)F(s) f 1 (t)+f 2 (t)F 1 (n)+F 2 (n) kf(t) (k is a constant)kF(s) sF(s)-f(0) 1 (unit step) t (unit ramp) F(s+a)
20
Markov Decision Processes: A Survey II20/73 Laplace Transformation
21
Markov Decision Processes: A Survey II21/73 Laplace Transformation We shall now use the Laplace transform to analyze Markov processes. For discrete processes, or
22
Markov Decision Processes: A Survey II22/73 Laplace Transformation Recall the toymaker’s initial policy, for which the transition- probability matrix was
23
Markov Decision Processes: A Survey II23/73 Laplace Transformation Let the matrix H(t) be the inverse transform (sI-A) -1 Then becomes by means of inverse transformation
24
Markov Decision Processes: A Survey II24/73 Laplace Transformation If the toymaker starts in the successful state 1, then π(0)=[1 0] and or, If the toymaker starts in the unsuccessful state 2, then π(0)=[0 1] and or, We have now obtained analytic forms for the data in Table 1.1 and 1.2.Table 1.1 and 1.2.
25
Markov Decision Processes: A Survey II25/73 Outline Introduction Markov Theory Markov Decision Processes Conclusion Future Work in Sensor Networks
26
Markov Decision Processes: A Survey II26/73 Markov Decision Processes MDPs applies dynamic programming to the solution of a stochastic decision with a finite number of states. The transition probabilities between the states are described by a Markov chain. The reward structure of the process is described by a matrix that represents the revenue (or cost) associated with movement from one state to another. Both the transition and revenue matrices depend on the decision alternatives available to the decision maker. The objective of the problem is to determine the optimal policy that maximizes the expected revenue over a finite or infinite number of stages.
27
Markov Decision Processes: A Survey II27/73 Markov Process with Rewards Suppose that an N-state Markov process earns r ij dollars when it makes a transition from state i to j. We call r ij the “reward” associated with the transition from i to j. The rewards need not be in dollars, they could be voltage levels, unit of production, or any other physical quantity relevant to the problem. Let us define v i (n) as the expected total earnings in the next n transitions if the system is now in state i.
28
Markov Decision Processes: A Survey II28/73 Markov Process with Rewards Recurrence relation v(n)=q+Pv(n-1) ij v ij V j (n-1) V i (n)
29
Markov Decision Processes: A Survey II29/73 The Toymaker Example Table 3.1. Total Expected Reward for Toymaker as a Function of State and Number of Weeks Remaining n=012345 … 067.58.559.55510.5555 … 0-3-2.4-1.44-0.4440.5556 … Note: -0.74+(-2.4-(-1.7))=1.44 p.38
30
Markov Decision Processes: A Survey II30/73 Toymaker’s problem: total expected reward in each state as a function of week remaining
31
Markov Decision Processes: A Survey II31/73 z-Transform Analysis of the Markov Process with Rewards The z-Transform of the total-value vector v(n) will be called where v(0)=0
32
Markov Decision Processes: A Survey II32/73 z-Transform Analysis of the Markov Process with Rewards , The total-value vector v(n) is then F(n)q by inverse transformation of, and, since Let the matrix F(n) be the inverse transform of We see that, as n becomes very large. Both v 1 (n) and v 2 (n) have slope 1 and v 1 (n)-v 2 (n)=10.
33
Markov Decision Processes: A Survey II33/73 Optimization Techniques in General Markov Decision Processes Value Iteration Exhaustive Enumeration Policy Iteration (Policy Improvement) Linear Programming Lagrangian Relaxation
34
Markov Decision Processes: A Survey II34/73 Value Iteration Original Advertising? No Yes Research? No Yes
35
Markov Decision Processes: A Survey II35/73 Diagram of States and Alternatives
36
Markov Decision Processes: A Survey II36/73 The Toymaker’s Problem Solved by Value Iteration The quantity is the expected reward from a single transition from state i under alternative k. Thus, The alternatives for the toymaker are presented in Table 3.1. StateAlternative Transition Probabilities Reward Expected Immediate Reward ik 1 (Successful toy) 1 (No advertising)0.5 936 2 (Advertising)0.80.2444 2 (Unsuccessful toy) 1 (No research)0.40.63-7-3 2 (Research)0.70.31-19-5
37
Markov Decision Processes: A Survey II37/73 The Toymaker’s Problem Solved by Value Iteration We call d i (n) the “decision” in state i at the nth stage. When d i (n) has been specified for all i and all n, a “policy” has been determined. The optimal policy is the one that maximizes total expected return for each i and n. To analyze this problem. Let us redefine as the total expected return in n stages starting from state i if an optimal policy is followed. It follows that for any n “Principle of optimality” of dynamic programming: in an optimal sequence of decisions or choices, each subsequence must also be optimal.
38
Markov Decision Processes: A Survey II38/73 The Toymaker’s Problem Solved by Value Iteration n=01234 … 068.210.2212.222 … 0-3-1.70.232.223 … -1222 … -1222 … Table 3.6 Toymaker’s Problem Solved by Value Iteration 0.5(9)+0.5(3)=6 0.8(4)+0.2(4)=4 0.4(3)+0.6(-7)=-3 -0.7(1)+0.3(-19)=-5 6+0.5(6)+0.5(-3)=7.5 4+0.8(6)+0.2(-3)=8.2 -3+0.4(6)+0.6(-3)=-2.4 -5+0.7(6)+0.3(-3)=-1.7 6+0.5(8.2)+0.5(-1.7)=9.25 4+0.8(8.2)+0.2(-1.7)=10.22 -3+0.4(8.2)+0.6(-1.7)=-0.74 (Note: -0.74+(-2.4-(-1.7))=1.44) -5+0.7(8.2)+0.3(-1.7)=-0.23
39
Markov Decision Processes: A Survey II39/73 The Toymaker’s Problem Solved by Value Iteration Note that for n=2, 3, and 4, the second alternative in each state is to be preferred. This means that the toymaker is better advised to advertise and to carry on research in spite of the costs of these activities. For this problem the convergence seems to have taken place at n=2, and the second alternative in each state has been chosen. However, in many problems it is difficult to tell when convergence has been obtained.
40
Markov Decision Processes: A Survey II40/73 Evaluation of the Value-Iteration Approach Even though the value-iteration method is not particularly suited to long-duration processes. then stop
41
Markov Decision Processes: A Survey II41/73 Exhaustive Enumeration The methods for solving the infinite-stage problem. The method calls for evaluating all possible stationary policies of the decision problem. This is equivalent to an exhaustive enumeration process and can be used only if the number of stationary policies is reasonably small.
42
Markov Decision Processes: A Survey II42/73 Exhaustive Enumeration Suppose that the decision problem has S stationary policies, and assume that P s and R s are the (one-step) transition and revenue matrices associated with the policy, s=1, 2, …, S.
43
Markov Decision Processes: A Survey II43/73 Exhaustive Enumeration The steps of the exhaustive enumeration method are as follows. Step 1. Compute v s i, the expected one-step (one-period) revenue of policy s given state i, i=1, 2, …, m. Step 2. Compute π s i, the long-run stationary probabilities of the transition matrix P s associated with policy s. These probabilities, when they exist, are computed from the equations π s P s =π s π s 1 +π s 2 +…+π s m =1 where π s =(π s 1, π s 2, …, π s m ). Step 3. Determine E s, the expected revenue of policy s per transition step (period), by using the formula Step 4. The optimal policy s* id determined such that
44
Markov Decision Processes: A Survey II44/73 Exhaustive Enumeration We illustrate the method by solving the gardener problem for an infinite-period planning horizon. The gardener problem has a total of eight stationary policies, as the following table shows: Stationary policy, sAction 1Do not fertilize at all. 2Fertilize regardless of the state. 3Fertilize if in state 1. 4Fertilize if in state 2. 5Fertilize if in state 3. 6Fertilize if in state 1 or 2. 7Fertilize if in state 1 or 3. 8Fertilize if in state 2 or 3.
45
Markov Decision Processes: A Survey II45/73 Exhaustive Enumeration The matrices P s and R s for policies 3 through 8 are derived from those of policies 1 and 2 and are given as
46
Markov Decision Processes: A Survey II46/73 Exhaustive Enumeration Step1: The values of v s i can thus be computed as given in the following table. s i=1i=2i=3 15.33 24.73.10.4 34.73 45.33.1 55.330.4 64.73.1 74.730.4 85.33.10.4
47
Markov Decision Processes: A Survey II47/73 Exhaustive Enumeration Step 2: The computations of the stationary probabilities are achieved by using the equations π s P s =π s π s 1 +π s 2 +…+π s m =1 As an illustration, consider s=2. The associated equations are The solution yields In this case, the expected yearly revenue is
48
Markov Decision Processes: A Survey II48/73 Exhaustive Enumeration Step 3&4: The following table summarizes π s and E s for all the stationary policies. Policy 2 yields the largest expected yearly revenue. The optimum long- range policy calls for applying fertilizer regardless of the system. 2.256= S 1001 26/5931/5922/59 30010.4 4001 55/15469/15480/1541.724 6001 75/13762/16770/1371.734 812/13569/13554/1352.216
49
Markov Decision Processes: A Survey II49/73 Policy Iteration The system is completely ergodic, the limiting state probabilities π i are independent of the starting state, and the gain g of the system is where q i is the expected immediate return in state i defined by 各態遍歷 (Ergodic) 的馬可夫鏈:當馬可 夫鏈狀態數為有限的、不可約的及非週 期性的,便可以將之歸類為各態遍歷的 馬可夫鏈。 不可約 (irreducible) 之馬可夫鏈狀態 可約 (reducible) 之馬可夫鏈狀態
50
Markov Decision Processes: A Survey II50/73 Policy Iteration A possible five-state problem. The alternative thus selected is called the “decision” for that state; it is no longer a function of n. The set of X’s or the set of decisions for all states is called a “policy”.
51
Markov Decision Processes: A Survey II51/73 Policy Iteration It is possible to describe the policy by a decision vector d whose elements represent the number of the alternative selected in each state. In this case An optimal policy is defined as a policy that maximizes the gain, or average return per transition.
52
Markov Decision Processes: A Survey II52/73 Policy Iteration In five-state problem diagrammed, there are different policies. However feasible this may be for 120 policies, it becomes unfeasible for very large problem. For example, a problem with 50 states and 50 alternatives in each state contains 50 50 ( ≒ 10 85 ) policies. The policy-iteration method that will be described will find the optimal policy in a small number of iterations. It is composed of two parts, the value-determination operation and the policy-improvement routine.
53
Markov Decision Processes: A Survey II53/73 Policy Iteration The Iteration Cycle
54
Markov Decision Processes: A Survey II54/73 The Toymaker’s Problem Let us suppose that we have no a priori knowledge about which policy is best. Then if we set v 1 =v 2 =0 and enter the policy- improvement routine. It will select as an initial policy the one that maximizes expected immediate reward in each state. For the toymaker, this policy consists of selection of alternative 1 in both state 1 and 2. For this policy
55
Markov Decision Processes: A Survey II55/73 The Toymaker’s Problem We are now ready to begin the value-determination operation that will evaluate our initial policy. Setting v 2 =0 and solving these equation, we obtain We are now ready to enter the policy-improvement routing as shown in Table 3.8 StateAlternativeTest Quantity ik 1 1212 6+0.5(10)+0.5(0)=11 4+0.8(10)+0.2(0)=12 2 1212 -3+0.4(10)+0.6(0)=1 -5+0.7(10)+0.3(0)=2
56
Markov Decision Processes: A Survey II56/73 The Toymaker’s Problem The policy-improvement routine reveals that the second alternative in each state produces a higher value of the test quantity than does the first alternative. For this policy, We are now ready to the value-determination operation that will evaluate our policy. With v 2 =0, the results of the value-determination operation are The gain of the policy is thus twice that of the original policy, we have found the optimal policy. For the optimal policy, v 1 =10, v 2 =0, so that v 1 -v 2 =10. This means that, even when the toymaker is following the optimal policy by using advertising and research.
57
Markov Decision Processes: A Survey II57/73 Linear Programming The infinite-stage Markov decision problems, can be formulated and solved as linear programs. We have defined the policy of MDP and can be defined by. Each state has k decisions, so D can be characterized by assigning values in the matrix,, where each row must contain a single 1with the rest of the elements zero. When an element =1, it can be interpreted as calling for decision k when the system is in state i.
58
Markov Decision Processes: A Survey II58/73 Linear Programming When we use linear programming to solve the MDP problem, we will define the formulation as. The linear programming formulation is best expressed in terms of a variable, which is related to as follows. Let be the unconditional probability that the system is in state i and decision k is made; that is,. From the rules of conditional probability,. Furthermore,. So that
59
Markov Decision Processes: A Survey II59/73 Linear Programming There exist several constraints on 1. 2. 3., so that. from the results on steady-state probabilities,, so that
60
Markov Decision Processes: A Survey II60/73 Linear Programming The long run expected average revenue per unit time is given by, hence the problem to choose the that, subject to the constrains. 1. 2. 3. This is clearly a linear programming problem that can be solved by the simplex method. Once the is obtained, the
61
Markov Decision Processes: A Survey II61/73 Linear Programming The following is an LP formulation of the gardener problem without discounting: Maximize E=5.3w 11 +4.7w 12 +3w 21 +3.1w 22 -w 31 +0.4w 32 subject to w 11 + w 12 - (0.2w 11 + 0.3w 12 + 0.1w 22 + 0.05w 32 ) = 0 w 21 + w 22 - (0.5w 11 + 0.6w 12 + 0.5w 21 + 0.6w 22 + 0.4w 32 ) = 0 w 31 + w 32 - (0.3w 11 + 0.1w 12 + 0.5w 21 + 0.3w 22 + w 31 + 0.55w 32 ) = 0 w 11 + w 12 + w 21 + w 22 + w 31 + w 32 = 1gardener problem without discounting w ik >=0, for all I and k The optimal solution is w 11 = w 21 = w 31 = 0 and w 12 = 0.1017, w 22 = 0.5254, and w 32 = 0.3729. This result mean that d 12 =d 22 =d 32 =1. Thus, the optimal policy selects alternative k=2 for i=1, 2, and 3. The optimal values of E is 4.7(0.1017)+3.1(0.5254)+0.4(0.3729)=2.256.2.256.
62
Markov Decision Processes: A Survey II62/73 Largrangian Relexation If the linear programming method can not find the optimal solution with the additional constraints. we can use Lagrangian relaxation to bind the constraints to the object function, and then solve this new sub problem without the additional constraints added. By adjusting the multiplier of Lagrangian relaxation, we can get the upper bound and the lower bound of this problem. We will use the multiplier of Lagrangian relaxation to rearrange the revenue of Markovian decision process, and then do the original Markovian.
63
Markov Decision Processes: A Survey II63/73 Comparison Characteristic Methods Calculates simply Large problem Optimal policy Additional constraints Value Iteration Exhaustive Enumeration Policy Iteration Linear Programming Lagrangian Relaxation
64
Markov Decision Processes: A Survey II64/73 Semi-Markov Decision Processes So far we have assumed that decisions are taken at each of a sequence of unit time intervals. We will allow decisions to be taken at varying integral multiples of the unit time interval. The interval between decisions may be predetermined or random.
65
Markov Decision Processes: A Survey II65/73 Partially Observable MDPs MDPs assume complete observable (can always tell what state you’re in) We can’t always be certain of the current state Lamp bright degree POMDPs are more difficult to solve than MDPs Most real-world problems are POMDPs State space transformation[22]
66
Markov Decision Processes: A Survey II66/73 Applications on MDPs Capacity Expansion Decision Analysis Update video. (2004/9/16, VoD) Network Control Optimization of GPRS Time Slot Allocation Packet Classification Queueing System Control Inventory management
67
Markov Decision Processes: A Survey II67/73 Outline Introduction Markov Theory Markov Decision Processes Conclusion Future Work in Sensor Networks
68
Markov Decision Processes: A Survey II68/73 Conclusion MDPs provide elegant and formal framework for sequential decision making. Powerful tool for formulating models and finding the optimal policies. Five algorithms were presented Value Iteration Exhaustive Enumeration Policy Iteration Linear Programming Lagrangian Relaxation
69
Markov Decision Processes: A Survey II69/73 Outline Introduction Markov Theory Markov Decision Processes Conclusion Future Work in Sensor Networks
70
Markov Decision Processes: A Survey II70/73 Future Work in Sensor Networks Markovian recovering policy in object tracking sensor networks Objective function: minimum communication delay (response) time or maximum system lifetime Policies: ALL_NBR and ALL_NODE Constraint: energy Markovian monitoring and reporting policy in WSNs (2004/10/7, WSNs Oral) Objective functions: minimum communication cost or delay (response) time Policies: sensor node density and number of sink Markovian sensor nodes placement policy with application to the WSNs Objective functions: minimum budget cost or maximize coverage the sensor field Policies: planning and deployment, post-deployment, and redeployment
71
Markov Decision Processes: A Survey II71/73 References 1. Hamdy A. Taha, “Operations Research: an Introduction,” third edition, 1982. 2. Hillier and Lieberman,”Introduction to Operations Research,” fourth edition, Holden-Day, Inc, 1986. 3. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, “Network Flows,” Prentice-Hall, 1993. 4. Leslie Pack Kaelbling, “Techniques in Artificial Intelligence: Markov Decision Processes,” MIT OpenCourseWare, Fall 2002.MIT OpenCourseWare 5. Ronald A. Howard, “Dynamic Programming and Markov Processes,” Wiley, New York, 1970. 6. D. J. White, “Markov Decision Processes,” Wiley, 1993. 7. Dean L. Isaacson and Richard W. Madsen, “Markov Chains Theory and Applications,” Wiley, 1976 8. M. H. A. Davis “Markov Models and Optimization,” Chapman & Hall, 1993. 9. Martin L. Puterman, “Markov Decision Processes: Discrete Stochastic Dynamic Programming,” Wiley, New York, 1994. 10. Hsu-Kuan Hung, Adviser : Yeong-Sung Lin,“Optimization of GPRS Time Slot Allocation”, June, 2001. 11. Hui-Ting Chuang, Adviser : Yeong-Sung Lin,“Optimization of GPRS Time Slot Allocation Considering Call Blocking Probability Constraints”, June, 2002.
72
Markov Decision Processes: A Survey II72/73 References 12. 高孔廉,「作業研究 -- 管理決策之數量方法」,三民總經銷,民國 74 年四版。 13. 李朝賢,「作業研究概論」,弘業文化實業股份有限公司出版,民國 66 年 8 月。 14. 楊超然,「作業研究」,三民書局出版,民國 66 年 9 月初版。 15. 葉若春,「作業研究」,中興管理顧問公司出版,民國 86 年 8 月五版。 16. 薄喬萍,「作業研究決策分析」,復文書局發行,民國 78 年 6 月初版。 17. 葉若春,「線性規劃理論與應用」,民國 73 年 9 月增定十版。 18. Leonard Kleinrock, “Queueing Systems Volume I: Threory,” Wiley, New York, 1975. 19. Chiu, Hsien-Ming, “Lagrangian Relaxation,” Tamkang University, Fall 2003. 20. L. Cheng, E. Subrahmanian, A. W. Westerberg, “Design and planning under uncertainty: issues on problem formulation and solution”, Computers and Chemical Engineering, 27, 2003, pp.781-801. 21. Regis Sabbadin, “Possibilistic Markov Decision Processes”, Engineering Application of Artificial Intelligence, 14, 2001, pp.287-300. 22. K. Karen Yin, Hu Liu, Neil E. Johnson, “Markovian Inventory Policy with Application to the Paper Industry”, Computers and Chemical Engineering, 26, 2002, pp.1399-1413.
73
Markov Decision Processes: A Survey II73/73 Q & AQ & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.