Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University
Bob Givan Electrical and Computer Engineering Purdue University 1 November 4-9, 2001 Markov Decision Process (MDP) Ingredients: System state x in state space X Control action a in A(x) Reward R(x,a) State-transition probability P(x,y,a) Find control policy to maximize objective fun
Bob Givan Electrical and Computer Engineering Purdue University 2 November 4-9, 2001 Optimal Policies Policy – mapping from state and time to actions Stationary Policy – mapping from state to actions Goal – a policy maximizing the objective function V H *(x 0 ) = max Obj [R(x 0,a 0 ), …, R(x H-1,a H-1 )] where the “max” is over all policies u = u 0,…,u H-1 For large H, a 0 independent of H. (w/ergodicity assum.) Stationary optimal action a 0 for H = via receding horizon control
Bob Givan Electrical and Computer Engineering Purdue University 3 November 4-9, 2001 Q Values Fix a large H, focus on finite-horizon reward Define Q(x,a) = R(x,a) + E[V H-1 *(y)] “Utility” of action a at state x. Name: Q-value of action a at state x. Key identities (Bellman’s equations): V H *(x) = max a Q(x,a) 0 * (x) = argmax a Q(x,a)
Bob Givan Electrical and Computer Engineering Purdue University 4 November 4-9, 2001 Solution Methods Recall: u 0 * (x) = argmax a Q(x,a) Q(x,a) = R(x,a) + E [V H-1 *(y)] Problem: Q-value depends on optimal policy. State space is extremely large (often continuous) Two-pronged solution approach: Apply a receding-horizon method Estimate Q-values via simulation/sampling
Bob Givan Electrical and Computer Engineering Purdue University 5 November 4-9, 2001 Methods for Q-value Estimation Previous work by other authors: Unbiased sampling (exact Q value) [Kearns et al., IJCAI-99] Policy rollout (lower bound) [Bertsekas & Castanon, 1999] Our techniques: Hindsight optimization (upper bound) Parallel rollout (lower bound)
Bob Givan Electrical and Computer Engineering Purdue University 6 November 4-9, 2001 Expectimax Tree for V *
Bob Givan Electrical and Computer Engineering Purdue University 7 November 4-9, 2001 Unbiased Sampling
Bob Givan Electrical and Computer Engineering Purdue University 8 November 4-9, 2001 Unbiased Sampling (Cont’d) For a given desired accuracy, how large should sampling width and depth be? Answered: Kearns, Mansour, and Ng (1999) Requires prohibitive sampling width and depth e.g. C 10 8, H s > 60 to distinguish “best” and “worst” policies in our scheduling domain We evaluate with smaller width and depth
Bob Givan Electrical and Computer Engineering Purdue University 9 November 4-9, 2001 How to Look Deeper?
Bob Givan Electrical and Computer Engineering Purdue University 10 November 4-9, 2001 Policy Roll-out
Bob Givan Electrical and Computer Engineering Purdue University 11 November 4-9, 2001 Policy Rollout in Equations Write V H u (y) for the value of following policy u Recall: Q(x,a)= R(x,a) + E [V H-1 *(y)] = R(x,a) + E [max u V H-1 u (y)] Given a base policy u, use R(x,a) + E [V H-1 u (y)] as an lower bound estimate of Q-value. Resulting policy is PI(u), given infinite sampling
Bob Givan Electrical and Computer Engineering Purdue University 12 November 4-9, 2001 Policy Roll-out (cont’d)
Bob Givan Electrical and Computer Engineering Purdue University 13 November 4-9, 2001 Parallel Policy Rollout Generalization of policy rollout, due to [Chang, Givan, and Chong, 2000] Given a set U of base policies, use R(x,a) + E [max u ∊ U V H-1 u (y)] as an estimate of Q-value More accurate estimate than policy rollout Still gives a lower bound to true Q-value Still gives a policy no worse than any in U
Bob Givan Electrical and Computer Engineering Purdue University 14 November 4-9, 2001 Hindsight Optimization – Tree View
Bob Givan Electrical and Computer Engineering Purdue University 15 November 4-9, 2001 Hindsight Optimization – Equations Swap Max and Exp in expectimax tree. Solve each off-line optimization problem O (kC’ f(H)) time where f(H) is the offline problem complexity Jensen’s inequality implies upper bounds
Bob Givan Electrical and Computer Engineering Purdue University 16 November 4-9, 2001 Hindsight Optimization (Cont’d)
Bob Givan Electrical and Computer Engineering Purdue University 17 November 4-9, 2001 Application to Example Problems Apply unbiased sampling, policy rollout, parallel rollout, and hindsight optimization to: Multi-class deadline scheduling Random early dropping Congestion control
Bob Givan Electrical and Computer Engineering Purdue University 18 November 4-9, 2001 Basic Approach Traffic model provides a stochastic description of possible future outcomes Method Formulate network decision problems as POMDPs by incorporating traffic model Solve belief-state MDP online using sampling (choose time-scale to allow for computation time)
Bob Givan Electrical and Computer Engineering Purdue University 19 November 4-9, 2001 Domain 1: Deadline Scheduling Objective: Minimize weighted loss
Bob Givan Electrical and Computer Engineering Purdue University 20 November 4-9, 2001 Domain 2: Random Early Dropping Objective: Minimize delay without sacrificing throughput
Bob Givan Electrical and Computer Engineering Purdue University 21 November 4-9, 2001 Domain 3: Congestion Control
Bob Givan Electrical and Computer Engineering Purdue University 22 November 4-9, 2001 Traffic Modeling A Hidden Markov Model (HMM) for each source Note: state is hidden, model is partially observed
Bob Givan Electrical and Computer Engineering Purdue University 23 November 4-9, 2001 Deadline Scheduling Results Non-sampling Policies: EDF: earliest deadline first. Deadline sensitive, class insensitive. SP: static priority. Deadline insensitive, class sensitive. CM: current minloss [Givan et al., 2000] Deadline and class sensitive. Minimizes weighted loss for the current packets.
Bob Givan Electrical and Computer Engineering Purdue University 24 November 4-9, 2001 Deadline Scheduling Results Objective: minimize weighted loss Comparison: Non-sampling policies Unbiased sampling (Kearns et al.) Hindsight optimization Rollout with CM as base policy Parallel rollout Results due to H. S. Chang
Bob Givan Electrical and Computer Engineering Purdue University 25 November 4-9, 2001 Deadline Scheduling Results
Bob Givan Electrical and Computer Engineering Purdue University 26 November 4-9, 2001 Deadline Scheduling Results
Bob Givan Electrical and Computer Engineering Purdue University 27 November 4-9, 2001 Deadline Scheduling Results
Bob Givan Electrical and Computer Engineering Purdue University 28 November 4-9, 2001 Random Early Dropping Results Objective: minimize delay subject to throughput loss-tolerance Comparison: Candidate policies: RED and “buffer-k” KMN-sampling Rollout of buffer-k Parallel rollout Hindsight optimization Results due to H. S. Chang.
Bob Givan Electrical and Computer Engineering Purdue University 29 November 4-9, 2001 Random Early Dropping Results
Bob Givan Electrical and Computer Engineering Purdue University 30 November 4-9, 2001 Random Early Dropping Results
Bob Givan Electrical and Computer Engineering Purdue University 31 November 4-9, 2001 Congestion Control Results MDP Objective: minimize weighted sum of throughput, delay, and loss-rate Fairness is hard-wired Comparisons: PD-k (proportional-derivative with k target queue) Hindsight optimization Rollout of PD-k == parallel rollout Results due to G. Wu, in progress
Bob Givan Electrical and Computer Engineering Purdue University 32 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 33 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 34 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 35 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 36 November 4-9, 2001 Results Summary Unbiased sampling cannot cope Parallel rollout wins in 2 domains Not always equal to simple rollout of one base policy Hindsight optimization wins in 1 domain Simple policy rollout – the cheapest method Poor in domain 1 Strong in domain 2 with best base policy – but how to find this policy? So-so in domain 3 with any base policy
Bob Givan Electrical and Computer Engineering Purdue University 37 November 4-9, 2001 Talk Summary Case study of MDP sampling methods New methods offering practical improvements Parallel policy rollout Hindsight optimization Systematic methods for using traffic models to help make network control decisions Feasibility of real-time implementation depends on problem timescale
Bob Givan Electrical and Computer Engineering Purdue University 38 November 4-9, 2001 Ongoing Research Apply to other control problems (different timescales): Admission/access control QoS routing Link bandwidth allotment Multiclass connection management Problems arising in proxy-services Diagnosis and recovery
Bob Givan Electrical and Computer Engineering Purdue University 39 November 4-9, 2001 Ongoing Research (Cont’d) Alternative traffic models Multi-timescale models Long-range dependent models Closed-loop traffic Fluid models Learning traffic model online Adaptation to changing traffic conditions
Bob Givan Electrical and Computer Engineering Purdue University 40 November 4-9, 2001 Congestion Control (Cont’d)
Bob Givan Electrical and Computer Engineering Purdue University 41 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 42 November 4-9, 2001 Hindsight Optimization (Cont’d)
Bob Givan Electrical and Computer Engineering Purdue University 43 November 4-9, 2001 Policy Rollout (Cont’d) Base Policy Policy-performanceperformance
Bob Givan Electrical and Computer Engineering Purdue University 44 November 4-9, 2001 Receding-horizon Control For large horizon H, policy is ~ stationary. At each time, if state is x, then apply action u * (x) = argmax a Q(x,a) = argmax a R(x,a) + E [V H-1 *(y)] Compute estimate of Q-value at each time.
Bob Givan Electrical and Computer Engineering Purdue University 45 November 4-9, 2001 Congestion Control (Cont’d)
Bob Givan Electrical and Computer Engineering Purdue University 46 November 4-9, 2001 Domain 3: Congestion Control High-priority traffic: Open-loop controlled Low-priority traffic: Closed-loop controlled Resources: Bandwidth and buffer Objective: optimize throughput, delay, loss, and fairness Bottleneck Node High-priority Traffic Best-effort Traffic...
Bob Givan Electrical and Computer Engineering Purdue University 47 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 48 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 49 November 4-9, 2001 Congestion Control Results
Bob Givan Electrical and Computer Engineering Purdue University 50 November 4-9, 2001 Congestion Control Results