Dynamic Optimization and Learning for Renewal Systems Michael J. Neely, University of Southern California Asilomar Conference on Signals, Systems, and Computers, Nov PDF of paper at: of paper at: Sponsored in part by the NSF Career CCF , ARL Network Science Collaborative Tech. Alliance t T/R Network Coordinator Task 1 Task 2 Task 3 T[0]T[1]T[2]
A General Renewal System t T[0]T[1]T[2] y[2] y[1]y[0] Renewal Frames r in {0, 1, 2, …}. π[r] = Policy chosen on frame r. P = Abstract policy space (π[r] in P for all r). Policy π[r] affects frame size and penalty vector on frame r. π[r] y[r] = [y 0 (π[r]), y 1 (π[r]), …, y L (π[r])] T[r] = T(π[r]) = Frame Duration
A General Renewal System t T[0]T[1]T[2] y[2] y[1]y[0] Renewal Frames r in {0, 1, 2, …}. π[r] = Policy chosen on frame r. P = Abstract policy space (π[r] in P for all r). Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]) : π[r] y[r] = [y 0 (π[r]), y 1 (π[r]), …, y L (π[r])] T[r] = T(π[r]) = Frame Duration
A General Renewal System t T[0]T[1]T[2] y[2] y[1]y[0] Renewal Frames r in {0, 1, 2, …}. π[r] = Policy chosen on frame r. P = Abstract policy space (π[r] in P for all r). Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]) : π[r] y[r] = [1.2, 1.8, …, 0.4] T[r] = 8.1 = Frame Duration
A General Renewal System t T[0]T[1]T[2] y[2] y[1]y[0] Renewal Frames r in {0, 1, 2, …}. π[r] = Policy chosen on frame r. P = Abstract policy space (π[r] in P for all r). Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]) : π[r] y[r] = [0.0, 3.8, …, -2.0] T[r] = 12.3 = Frame Duration
A General Renewal System t T[0]T[1]T[2] y[2] y[1]y[0] Renewal Frames r in {0, 1, 2, …}. π[r] = Policy chosen on frame r. P = Abstract policy space (π[r] in P for all r). Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]) : π[r] y[r] = [1.7, 2.2, …, 0.9] T[r] = 5.6 = Frame Duration
Example 1: Opportunistic Scheduling S[r] = (S 1 [r], S 2 [r], S 3 [r]) All Frames = 1 Slot S[r] = (S 1 [r], S 2 [r], S 3 [r]) = Channel States for Slot r Policy p[r]: On frame r: First observe S[r], then choose a channel to serve (i.,e, {1, 2, 3}). Example Objectives: thruput, energy, fairness, etc.
Example 2: Markov Decision Problems M(t) = Recurrent Markov Chain (continuous or discrete) Renewals are defined as recurrences to state 1. T[r] = random inter-renewal frame size (frame r). y[r] = penalties incurred over frame r. π[r] = policy that affects transition probs over frame r. Objective: Minimize time average of one penalty subj. to time average constraints on others
Example 3: Task Processing over Networks T/R Network Coordinator Infinite Sequence of Tasks. E.g.: Query sensors and/or perform computations. Renewal Frame r = Processing Time for Frame r. Policy Types: Low Level: {Specify Transmission Decisions over Net} High Level: {Backpressure1, Backpressure2, Shortest Path} Example Objective: Maximize quality of information per unit time subject to per-node power constraints. Task 1 Task 2 Task 3 T/R
Quick Review of Renewal-Reward Theory (Pop Quiz Next Slide!) Define the frame-average for y 0 [r]: The time-average for y 0 [r] is then: *If i.i.d. over frames, by LLN this is the same as E{y 0 }/E{T}.
Pop Quiz: (10 points) Let y 0 [r] = Energy Expended on frame r. Time avg. power = (Total Energy Use)/(Total Time) Suppose (for simplicity) behavior is i.i.d. over frames. To minimize time average power, which one should we minimize? (a)(b)
Pop Quiz: (10 points) Let y 0 [r] = Energy Expended on frame r. Time avg. power = (Total Energy Use)/(Total Time) Suppose (for simplicity) behavior is i.i.d. over frames. To minimize time average power, which one should we minimize? (a)(b)
Two General Problem Types: 1) Minimize time average subject to time average constraints: 2) Maximize concave function φ(x 1, …, x L ) of time average:
Solving the Problem (Type 1): Define a “Virtual Queue” for each inequality constraint: Z l [r] c l T[r] y l [r] Z l [r+1] = max[Z l [r] – c l T[r] + y l [r], 0]
Lyapunov Function and “Drift-Plus-Penalty Ratio”: Z 2 (t) Z 1 (t) L[r] = Z 1 [r] 2 + Z 2 [r] 2 + … + Z L [r] 2 Δ(Z[r]) = E{L[r+1] – L[r] | Z[r]} = “Frame-Based Lyap. Drift” Scalar measure of queue sizes: Algorithm Technique: Every frame r, observe Z 1 [r], …, Z L [r]. Then choose a policy π[r] in P to minimize: Δ(Z[r]) + VE{y 0 [r]|Z[r]} E{T|Z[r]} “Drift-Plus-Penalty Ratio” =
The Algorithm Becomes: Observe Z[r] = (Z 1 [r], …, Z L [r]). Choose π[r] in P to solve: Then update virtual queues: Δ(Z[r]) + VE{y 0 [r]|Z[r]} E{T|Z[r]} Z l [r+1] = max[Z l [r] – c l T[r] + y l [r], 0]
Theorem: Assume the constraints are feasible. Then under this algorithm, we achieve: Δ(Z[r]) + VE{y 0 [r]|Z[r]} E{T|Z[r]} DPP Ratio: (a) (b) For all frames r in {1, 2, 3, …}
Solving the Problem (Type 2): We reduce it to a problem with the structure of Type 1 via: Auxiliary Variables γ[r] = (γ 1 [r], …, γ L [r]). The following variation on Jensen’s Inequality: For any concave function φ(x 1,.., x L ) and any (arbitrarily correlated) vector of random variables (x 1, x 2, …, x L, T), where T>0, we have: E{Tφ(X 1, …, X L )} E{T} E{T(X 1, …, X L )} E{T} φ( ) ≤
The Algorithm (type 2) Becomes: On frame r, observe Z[r] = (Z 1 [r], …, Z L [r]). (Auxiliary Variables) Choose γ 1 [r], …, γ L [r] to max the below deterministic problem: (Policy Selection) Choose π[r] in P to minimize: Then update virtual queues: Z l [r+1] = max[Z l [r] – c l T[r] + y l [r], 0], G l [r+1] = max[G l [r] + γ l [r]T[r] - y l [r], 0]
Example Problem – Task Processing: T/R Network Coordinator Task 1 Task 2 Task 3 Every Task reveals random task parameters η[r]: η[r] = [(qual 1 [r], T 1 [r]), (qual 2 [r], T 2 [r]), …, (qual 5 [r], T 5 [r])] Choose π[r] = [which node to transmit, how much idle] in {1,2,3,4,5} X [0, I max ] Transmissions incur power We use a quality distribution that tends to be better for higher-numbered nodes. Maximize quality/time subject to p av ≤ 0.25 for all nodes. Setup Transmit Idle I[r] Frame r
Minimizing the Drift-Plus-Penalty Ratio: Minimizing a pure expectation, rather than a ratio, is typically easier (see Bertsekas, Tsitsiklis Neuro-DP). Define: “Bisection Lemma”:
Learning via Sampling from the past: Suppose randomness characterized by: {η 1, η 2,..., η W } (past random samples) Want to compute (over unknown random distribution of η): Approximate this via W samples from the past:
Simulation: Sample Size W Quality of Information / Unit Time Drift-Plus-Penalty Ratio Alg. With Bisection Alternative Alg. With Time Averaging
Concluding Sims (values for W=10): Quick Advertisement: New Book: M. J. Neely, Stochastic Network Optimization with Application to Communication and Queueing Systems. Morgan & Claypool, T007 PDF also available from “Synthesis Lecture Series” (on digital library) Lyapunov Optimization theory (including these renewal system problems) Detailed Examples and Problem Set Questions.