Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University.

Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering Purdue University 1 November 4-9, 2001 Markov Decision Process (MDP)  Ingredients:  System state x in state space X  Control action a in A(x)  Reward R(x,a)  State-transition probability P(x,y,a)  Find control policy to maximize objective fun

Bob Givan Electrical and Computer Engineering Purdue University 2 November 4-9, 2001 Optimal Policies Policy – mapping from state and time to actions  Stationary Policy – mapping from state to actions  Goal – a policy maximizing the objective function V H *(x 0 ) = max Obj [R(x 0,a 0 ), …, R(x H-1,a H-1 )] where the “max” is over all policies u = u 0,…,u H-1  For large H, a 0 independent of H. (w/ergodicity assum.)  Stationary optimal action a 0 for H =  via receding horizon control

Bob Givan Electrical and Computer Engineering Purdue University 3 November 4-9, 2001 Q Values  Fix a large H, focus on finite-horizon reward  Define Q(x,a) = R(x,a) + E[V H-1 *(y)]  “Utility” of action a at state x.  Name: Q-value of action a at state x.  Key identities (Bellman’s equations):  V H *(x) = max a Q(x,a)   0 * (x) = argmax a Q(x,a)

Bob Givan Electrical and Computer Engineering Purdue University 4 November 4-9, 2001 Solution Methods  Recall:  u 0 * (x) = argmax a Q(x,a)  Q(x,a) = R(x,a) + E [V H-1 *(y)]  Problem: Q-value depends on optimal policy.  State space is extremely large (often continuous)  Two-pronged solution approach:  Apply a receding-horizon method  Estimate Q-values via simulation/sampling

Bob Givan Electrical and Computer Engineering Purdue University 5 November 4-9, 2001 Methods for Q-value Estimation Previous work by other authors:  Unbiased sampling (exact Q value) [Kearns et al., IJCAI-99]  Policy rollout (lower bound) [Bertsekas & Castanon, 1999] Our techniques:  Hindsight optimization (upper bound)  Parallel rollout (lower bound)

Bob Givan Electrical and Computer Engineering Purdue University 6 November 4-9, 2001 Expectimax Tree for V *

Bob Givan Electrical and Computer Engineering Purdue University 7 November 4-9, 2001 Unbiased Sampling

Bob Givan Electrical and Computer Engineering Purdue University 8 November 4-9, 2001 Unbiased Sampling (Cont’d)  For a given desired accuracy, how large should sampling width and depth be?  Answered: Kearns, Mansour, and Ng (1999)  Requires prohibitive sampling width and depth  e.g. C  10 8, H s > 60 to distinguish “best” and “worst” policies in our scheduling domain  We evaluate with smaller width and depth

Bob Givan Electrical and Computer Engineering Purdue University 9 November 4-9, 2001 How to Look Deeper?

Bob Givan Electrical and Computer Engineering Purdue University 10 November 4-9, 2001 Policy Roll-out

Bob Givan Electrical and Computer Engineering Purdue University 11 November 4-9, 2001 Policy Rollout in Equations  Write V H u (y) for the value of following policy u  Recall: Q(x,a)= R(x,a) + E [V H-1 *(y)] = R(x,a) + E [max u V H-1 u (y)]  Given a base policy u, use R(x,a) + E [V H-1 u (y)] as an lower bound estimate of Q-value.  Resulting policy is PI(u), given infinite sampling

Bob Givan Electrical and Computer Engineering Purdue University 12 November 4-9, 2001 Policy Roll-out (cont’d)

Bob Givan Electrical and Computer Engineering Purdue University 13 November 4-9, 2001 Parallel Policy Rollout  Generalization of policy rollout, due to [Chang, Givan, and Chong, 2000]  Given a set U of base policies, use R(x,a) + E [max u ∊ U V H-1 u (y)] as an estimate of Q-value  More accurate estimate than policy rollout  Still gives a lower bound to true Q-value  Still gives a policy no worse than any in U

Bob Givan Electrical and Computer Engineering Purdue University 14 November 4-9, 2001 Hindsight Optimization – Tree View

Bob Givan Electrical and Computer Engineering Purdue University 15 November 4-9, 2001 Hindsight Optimization – Equations  Swap Max and Exp in expectimax tree.  Solve each off-line optimization problem  O (kC’ f(H)) time  where f(H) is the offline problem complexity  Jensen’s inequality implies upper bounds

Bob Givan Electrical and Computer Engineering Purdue University 16 November 4-9, 2001 Hindsight Optimization (Cont’d)

Bob Givan Electrical and Computer Engineering Purdue University 17 November 4-9, 2001 Application to Example Problems  Apply unbiased sampling, policy rollout, parallel rollout, and hindsight optimization to:  Multi-class deadline scheduling  Random early dropping  Congestion control

Bob Givan Electrical and Computer Engineering Purdue University 18 November 4-9, 2001 Basic Approach  Traffic model provides a stochastic description of possible future outcomes  Method  Formulate network decision problems as POMDPs by incorporating traffic model  Solve belief-state MDP online using sampling (choose time-scale to allow for computation time)

Bob Givan Electrical and Computer Engineering Purdue University 19 November 4-9, 2001 Domain 1: Deadline Scheduling Objective: Minimize weighted loss

Bob Givan Electrical and Computer Engineering Purdue University 20 November 4-9, 2001 Domain 2: Random Early Dropping Objective: Minimize delay without sacrificing throughput

Bob Givan Electrical and Computer Engineering Purdue University 21 November 4-9, 2001 Domain 3: Congestion Control

Bob Givan Electrical and Computer Engineering Purdue University 22 November 4-9, 2001 Traffic Modeling  A Hidden Markov Model (HMM) for each source  Note: state is hidden, model is partially observed

Bob Givan Electrical and Computer Engineering Purdue University 23 November 4-9, 2001 Deadline Scheduling Results Non-sampling Policies:  EDF: earliest deadline first.  Deadline sensitive, class insensitive.  SP: static priority.  Deadline insensitive, class sensitive.  CM: current minloss [Givan et al., 2000]  Deadline and class sensitive.  Minimizes weighted loss for the current packets.

Bob Givan Electrical and Computer Engineering Purdue University 24 November 4-9, 2001 Deadline Scheduling Results  Objective: minimize weighted loss  Comparison:  Non-sampling policies  Unbiased sampling (Kearns et al.)  Hindsight optimization  Rollout with CM as base policy  Parallel rollout  Results due to H. S. Chang

Bob Givan Electrical and Computer Engineering Purdue University 25 November 4-9, 2001 Deadline Scheduling Results

Bob Givan Electrical and Computer Engineering Purdue University 28 November 4-9, 2001 Random Early Dropping Results  Objective: minimize delay subject to throughput loss-tolerance  Comparison:  Candidate policies: RED and “buffer-k”  KMN-sampling  Rollout of buffer-k  Parallel rollout  Hindsight optimization  Results due to H. S. Chang.

Bob Givan Electrical and Computer Engineering Purdue University 29 November 4-9, 2001 Random Early Dropping Results

Bob Givan Electrical and Computer Engineering Purdue University 30 November 4-9, 2001 Random Early Dropping Results

Bob Givan Electrical and Computer Engineering Purdue University 31 November 4-9, 2001 Congestion Control Results  MDP Objective: minimize weighted sum of throughput, delay, and loss-rate  Fairness is hard-wired  Comparisons:  PD-k (proportional-derivative with k target queue)  Hindsight optimization  Rollout of PD-k == parallel rollout  Results due to G. Wu, in progress

Bob Givan Electrical and Computer Engineering Purdue University 32 November 4-9, 2001 Congestion Control Results

Bob Givan Electrical and Computer Engineering Purdue University 36 November 4-9, 2001 Results Summary  Unbiased sampling cannot cope  Parallel rollout wins in 2 domains  Not always equal to simple rollout of one base policy  Hindsight optimization wins in 1 domain  Simple policy rollout – the cheapest method  Poor in domain 1  Strong in domain 2 with best base policy – but how to find this policy?  So-so in domain 3 with any base policy

Bob Givan Electrical and Computer Engineering Purdue University 37 November 4-9, 2001 Talk Summary  Case study of MDP sampling methods  New methods offering practical improvements  Parallel policy rollout  Hindsight optimization  Systematic methods for using traffic models to help make network control decisions  Feasibility of real-time implementation depends on problem timescale

Bob Givan Electrical and Computer Engineering Purdue University 38 November 4-9, 2001 Ongoing Research  Apply to other control problems (different timescales):  Admission/access control  QoS routing  Link bandwidth allotment  Multiclass connection management  Problems arising in proxy-services  Diagnosis and recovery

Bob Givan Electrical and Computer Engineering Purdue University 39 November 4-9, 2001 Ongoing Research (Cont’d)  Alternative traffic models  Multi-timescale models  Long-range dependent models  Closed-loop traffic  Fluid models  Learning traffic model online  Adaptation to changing traffic conditions

Bob Givan Electrical and Computer Engineering Purdue University 40 November 4-9, 2001 Congestion Control (Cont’d)

Bob Givan Electrical and Computer Engineering Purdue University 42 November 4-9, 2001 Hindsight Optimization (Cont’d)

Bob Givan Electrical and Computer Engineering Purdue University 43 November 4-9, 2001 Policy Rollout (Cont’d) Base Policy Policy-performanceperformance

Bob Givan Electrical and Computer Engineering Purdue University 44 November 4-9, 2001 Receding-horizon Control  For large horizon H, policy is ~ stationary.  At each time, if state is x, then apply action u * (x) = argmax a Q(x,a) = argmax a R(x,a) + E [V H-1 *(y)]  Compute estimate of Q-value at each time.

Bob Givan Electrical and Computer Engineering Purdue University 45 November 4-9, 2001 Congestion Control (Cont’d)

Bob Givan Electrical and Computer Engineering Purdue University 46 November 4-9, 2001 Domain 3: Congestion Control  High-priority traffic:  Open-loop controlled  Low-priority traffic:  Closed-loop controlled  Resources: Bandwidth and buffer  Objective: optimize throughput, delay, loss, and fairness Bottleneck Node High-priority Traffic Best-effort Traffic...

Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University.

Similar presentations

Presentation on theme: "Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University.

Similar presentations

Presentation on theme: "Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University."— Presentation transcript:

Similar presentations

About project

Feedback