Download presentation
Presentation is loading. Please wait.
Published byKathryn Hodges Modified over 9 years ago
1
Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014
2
Restless Multi-arm Bandit Problem 2/31 Agenda Restless multi-arm bandits problem Algorithms and policies Numerical experiments ▫ Simulated problem instances ▫ Real application: the capacity management problem
3
Restless Multi-arm Bandit Problem 3/31 Restless Multi-arm Bandits Problem … … Active Passive
4
Restless Multi-arm Bandit Problem 4/31 Objective ▫ Discounted rewards (finite, infinite horizon) ▫ Time average A general modeling framework ▫ N-choose-M problem ▫ Limited capacity (production capacity, service capacity) Connection with Multi-arm bandit problem Passive arm:
5
Restless Multi-arm Bandit Problem 5/31 Exact Optimal Solution: Dynamic Programming Markov decision process (MDP) ▫ State: ▫ Action: ▫ Transition matrix: ▫ Rewards: Algorithm: ▫ Finite horizon: backward induction ▫ Infinite horizon (discounted): value iteration, policy iteration Problem size: becomes a disaster quickly SNM# of states Space for transition Matrix (Mb) 3522434.5 452102480 46240961,920 ~ 2Gb 4721638443,008 ~ 43Gb
6
Restless Multi-arm Bandit Problem 6/31 Lagrangian Relaxation: Upper Bound max
7
Restless Multi-arm Bandit Problem 7/31 Index Policies
8
Restless Multi-arm Bandit Problem 8/31 The Whittle’s Index Policy (Discounted Rewards) For a fixed arm, and a given state “Subsidy” W The Whittle’s Index: W(s) ▫ The subsidy that makes passive and active arms indifferent Closed form solution depends on specific models W too smallarm is better/largeActive/Passive activepassive-W W-subsidy problem
9
Restless Multi-arm Bandit Problem 9/31 Numerical Algorithm for Solving Whittles’ Index STEP 1: Find the plausible range of W STEP 2: Use binary search within the range [L,U] No Yes -Value iteration
10
Restless Multi-arm Bandit Problem 10/31 The Primal-Dual Index Policy How harmful for passive active How harmful for active passive How harmful for passive active How harmful for active passive Being active if >0
11
Restless Multi-arm Bandit Problem 11/31 Heuristic Index Policies
12
Restless Multi-arm Bandit Problem 12/31 Agenda Restless multi-arm bandits problem Algorithms and policies Numerical experiments ▫ Simulated problem instances ▫ Real application: the capacity management problem
13
Restless Multi-arm Bandit Problem 13/31 Experiment Settings Assume active rewards are larger than passive rewards Non-identical arms Structures in transition dynamics ▫ Uniformly sampled transition matrix ▫ IFR matrix with non-increasing rewards ▫ P1 is stochastically smaller than P2 ▫ Less-connected chain Evaluation ▫ Small instances: exact optimal solution ▫ Large instances: upper bound & Monte-Carlo simulation Performance measure ▫ Average gaps from Optimality or Upper Bound
14
Restless Multi-arm Bandit Problem 14/31 5 Questions of Interest 1.How do different policies compare under different problem structures? 2.How do different policies compare under various problem sizes? 3.How do different policies compare under different discount factors? 4.How does a multi-period look ahead improve a myopic policy? 5. How do different policies compare under different time horizons?
15
Restless Multi-arm Bandit Problem 15/31 Question 1: Does problem structure help? Uniformly sampled transition matrix and rewards Increasing failure rate matrix and non-increasing rewards Less-connected Markov chain P1 stochastically smaller than P2, non-increasing rewards
16
Restless Multi-arm Bandit Problem 16/31 Question 1: Does problem structure help?
17
Restless Multi-arm Bandit Problem 17/31 Question 2: Does problem size matter? Optimality gap: Fixed N and M, increasing S
18
Restless Multi-arm Bandit Problem 18/31 Question 2: Does problem size matter? Optimality gap: Fixed M and S, increasing N
19
Restless Multi-arm Bandit Problem 19/31 Question 3: Does discount factor matter? Infinite horizon: discount factors
20
Restless Multi-arm Bandit Problem 20/31 Question 4: Does look ahead help a myopic policy? Greedy policies vs Rolling-horizon policies different H Problem size: S=8, N=6, M=2, Problem structure: Uniform vs. less-connected
21
Restless Multi-arm Bandit Problem 21/31 Question 4: Does look ahead help a myopic policy? Greedy policies vs Rolling-horizon policies different H Problem size: S=8, N=6, M=2, Problem structure: Uniform vs. less-connected
22
Restless Multi-arm Bandit Problem 22/31 Question 5: Does look ahead help a myopic policy? Greedy policies vs Rolling-horizon policies different H Problem size: S=8, N=6, M=2, Problem structure: Uniform vs. less-connected
23
Restless Multi-arm Bandit Problem 23/31 Question 4: Does look ahead help a myopic policy? Greedy policies vs Rolling-horizon policies different H Problem size: S=8, N=6, M=2, Problem structure: Uniform vs. less-connected
24
Restless Multi-arm Bandit Problem 24/31 Agenda Restless multi-arm bandits problem Algorithms and policies Numerical experiments ▫ Simulated problem instances ▫ Real application: the capacity management problem
25
Restless Multi-arm Bandit Problem 25/31 Clinical Capacity Management Problem (Deo et al. 2013) School-based asthma care for children Scheduling Policy Medical records of patientsWho to schedule (treat)? Van capacity h=health state at the last appointment n=the time since the last appointment State (h,n), capacity M, population N OBJECTIVE: maximize total benefit of the community Current guidelines (fixed duration policy) Whittle’s index policy Primal-dual index policy Greedy (myopic) policy Rolling-horizon policy H-N priority policy, N-H priority policy No-schedule [baseline]
26
Restless Multi-arm Bandit Problem 26/31 How Large Is It?
27
Restless Multi-arm Bandit Problem 27/31 Performance of Policies
28
Restless Multi-arm Bandit Problem 28/31 Performance of Policies
29
Restless Multi-arm Bandit Problem 29/31 Performance of Policies
30
Restless Multi-arm Bandit Problem 30/31 Whittle’s Index vs. Gitten’s Index (S,N,M=1) vs. (S,N,M=2) Sample 20 instances for each problem size Whittle’s Index policy vs. DP exact solution ▫ Optimality tolerance = 0.002 SNM=1M=2 350%25% 360%25% 550%40% Percentage of time when Whittle’s Index policy is NOT optimal
31
Restless Multi-arm Bandit Problem 31/31 Summary Whittles’ Index and Primal-dual Index work well and efficiently Relative greedy policy can work well depending on problem structure Policies perform worse on the less-connected Markov chain All policies tend to work better if capacity is tight Look ahead policies have limited marginal benefit for small discount factor
32
32
33
Restless Multi-arm Bandit Problem 33/31 Question 5: Does decision horizon matter? Finite horizon: # of periods
34
34
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.