Dynamic Programming Applications Lecture 6 Infinite Horizon
DPA62 Infinite horizon Rules of the game: Infinite no. of stages Stationary system Finite number of states Why do we care? Good appox. for problem w/many states Analysis is elegant and insightful Implementation of optimal policy is simple Stationary policy: …
DPA63 Total Cost Problems J (x 0 ) = lim N E{ k g(x k, k (x k ), w k )} J (x 0 ) = min J (x 0 ) Stochastic Shortest Paths (SSP) : =1 Objective: reach cost free termination state Discounted problems w/bounded cost/stage: <1 |g|<M, so J (x 0 ) < M/(1- ) is well defined (e.g. if the state and control sets are finite.) Discounted problems unbounded cost/stage: ?1 Hard: we don’t do it here.. k=0 N-1
DPA64 Average cost problems J (x 0 ) = for all feas. policies and state x 0 Average cost/stage: lim 1 E { k g(x k, k (x k ), w k )} this is well defined and finite LATER k=0 N-1 N N
DPA65 Preview Convergence: J*(x)=lim N J * N (x), for all x Limiting solution (Bellman Equation): J*(x) = min u E w {g(x,u,w) + J*(f(x,u,w))} Optimal stationary policy: (x) that solves above.
DPA66 SSP Finite* constraint set U(i) for all i Zero-cost state: p 00 (u)=1, g(0,u,0)=0, u U(0) Special cases: Deterministic SP Finite horizon
DPA67 Shorthand J=(J(1),…,J(n)); J(0)=0 TJ (i)= min p ij (u)( g(i,u,j) + J(j) ) TJ : optimal cost-to-go for one stage problem w/cost per stage g and initial cost J. T J (i)= p ij ( (i))( g(i, (i),j) + J(j) ) T J : cost-to-go under for one stage problem w/cost per stage g and initial cost J.
DPA68 Shorthand T J = g + P J where g i = j p ij ( (i)) g(i, (i),j) and P = (p ij ( (i))) for i,j=1,…n (not 0) TJ = g + PJ where g i = j p ij ( (i)) g(i, (i),j) and P = P
DPA69 Value iteration T k J=T(T k-1 J), T 0 J =J T k J: optimal cost-to-go for k-stage problem w/cost/stage g and initial cost J …and similarly for T
DPA610 T Properties Monotonicity Lemma: If J J’ and stationary, then T k J T k J’ and T k J T k J’. Subadditivity: If stationary, e=(1,1..1), r >0, then T k (J + re)(i) T k J(i) + r and T k (J + re)(i) T k J(i) + r
DPA611 Property Define: Proper stationary policy : Terminal state reachable from any state w.p. > 0 (in n stages) Assumptions: 1.There exists at least one proper 2.Cost-to-go J (i) of improper is infinite for some i. 2’. Expected cost/stage: g(i,u)= j p ij (u)g(i,u,j) 0 i 0 and u U(i). What do these mean in deterministic case?
DPA612 Alternative assumption There exists an integer m such that for any policy and initial state x, the probability of reaching the terminal state from x in m stages under policy is non-zero. (3) This is a stronger assumption than 1 & 2.
DPA613 Main Theorem Under assumptions 1 and 2 (or under 3): 1.lim k T k J=J*, for every vector J. 2.J*=TJ*, and J* is the only solution of J=TJ. 3.For any proper policy and for every vector J, lim k T k (J)= J and J = T J and J is the only solution. 4.Stationary is optimal iff T J*=TJ*
DPA614 Lemma Suppose all stationary policies are proper. Then >0 s.t. for all stationary , T and T are contraction mappings w.r.t. the weighted max-norm ||.|| . weighted max-norm: ||J|| =max|J(i)|/ (i) contraction mapping: ||TJ –TJ’|| ||J –J’||
DPA615 How to find J* and *? Value iteration Policy iteration Variants
DPA616 Asynchronous Value Iteration Start with arbitrary J 0. Stage k: pick i k and iterate J k+1 (i k ) TJ k (i k ) (all rest is same: J k+1 (i) =J k (i), for i k i ). Assume each i k is chosen infinitely often. Then J k J*. This is also called the Gaus-Seidel method.
DPA617 Decomposition Suppose S can be partitioned into S 1,S 2,..S M so that if i S m then under any policy, the successor state j=0 or j S m-k, for some m-1 k 0 Then the solution decomposes as sequential solution of M SSPs that can be solved using optimal sol. of the preceding subproblems. If k > 0 above, then the Gauss-Seidel method that iterates on states in order of their membership in S m needs only one iteration per state to get to optimum. (e.g. finite horizon problems)
DPA618 Policy Iteration Start with given policy k : Policy evaluation step Compute J k (i) by solving linear system (J(0)=0): J = g k + P k J Policy improvement step Compute new policy k+1 as solution to: T k+1 J k =TJ k, that is k+1 (i)= arg min p ij (u)( g(i,u j) + J k (j) ) Terminate iff J k = J k+1 (no improvement): k
DPA619 Policy Iteration Theorem The algorithm generates an improving sequence of proper policies, that is for all i,k: J k+1 (i) J k (i) and terminates with an optimal policy.
DPA620 Multistage Look-ahead start at state i make m subsequent decisions & incur costs end up in state j and pay terminal cost J (j) Multistage policy iteration terminates w/optimal policy under same conditions.
DPA621 Value vs. Policy iteration In general value iteration requires infinite number of iterations to obtain optimal cost-to-go Policy iteration always terminates finitely Value iteration is easier operation than policy iter. Idea: should combine them.
DPA622 Modified policy iteration Let J 0 s.t. TJ 0 J 0, and J 1,J 2,… and 0, 1, 2,.. s.t. T k J = TJ k and J k+1 = (T k ) m k (J k ) if m k =1 for all k: value iteration if m k = for all k: policy iteration, where the evaluation step done iteratively via value iteration heuristic choices of m k >1 keeping in mind that T J is much cheaper to compute than TJ
DPA623 Asynchronous Policy Iteration Generate a sequence of costs-to-go J k and stationary policies k. Given (J k, k ): select S k, generate new (J k+1, k+1 ) by alternatively updating : a) J k+1 (i) = T k J k (i), if i S k J k (i), else and k+1 = k b) k+1 (i)= arg min p ij (u)(g(i,u j)+ J k (j)), if i S k k (i), else and J k+1 = J k
DPA624 Convergence If both value update and policy update are executed infinitely often for all states, and If initial conditions J 0 and 0 are s.t. T 0 J 0 J 0 (for example select 0 and J 0 = J 0 ). Then J k converges to J*.
DPA625 Linear programming Since lim k T k J =J* for all J, then J TJ J J* = TJ* So J* = arg max{J | J TJ}, that is: maximize i subject to i p ij (u)(g(i,u j)+ j ) i=1,..,n, u U(i) Problem: very big when n is big !
DPA626 Discounted problems Let < 1. No termination state. Prove special case of SSP modify definitions and proofs TJ (i)= min p ij (u)( g(i,u,j) + J(j) ) T J (i)= p ij ( (i))( g(i, (i),j) + J(j) ) T J = g + P J
DPA627 T -Properties Monotonicity Lemma: If J J’ and stationary, then T k J T k J’ and T k J T k J’. -Subadditivity: If stationary, r >0,then T k (J + re)(i) T k J(i) + k r and T k (J + re)(i) T k J(i) + k r
DPA628 Contraction For any J and J’ and any policy , the following contraction properties hold: ||TJ –TJ’|| ||J –J’|| ||T J –T J’|| ||J –J’|| max-norm: ||J|| =max|J(i)|
DPA629 Convergence Theorem Convert to SSP Define new terminal state 0 and transition probabilities: P (j|i,u) = P(j|i,u) P (0|i,u) = 1- All policies are proper All previous algorithms & convergence properties. Separate proof for infinite no. of states Can extend to compact control set w/continuous probabilities.
DPA630 Applications 1. Asset selling w/infinite horizon- continued 2. Inventory w/batch processing - infinite horizon: An order is placed at time t w.p. p Given current backlog j, the manufacturer can either –process the whole batch at a fixed cost K or –postpone and incur a cost c/unit. The maximum backlog is n Policy that minimizes expected total cost ?