Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014

Restless Multi-arm Bandit Problem 2/31 Agenda Restless multi-arm bandits problem Algorithms and policies Numerical experiments ▫ Simulated problem instances ▫ Real application: the capacity management problem

Restless Multi-arm Bandit Problem 3/31 Restless Multi-arm Bandits Problem … … Active Passive

Restless Multi-arm Bandit Problem 4/31 Objective ▫ Discounted rewards (finite, infinite horizon) ▫ Time average A general modeling framework ▫ N-choose-M problem ▫ Limited capacity (production capacity, service capacity) Connection with Multi-arm bandit problem Passive arm:

Restless Multi-arm Bandit Problem 5/31 Exact Optimal Solution: Dynamic Programming Markov decision process (MDP) ▫ State: ▫ Action: ▫ Transition matrix: ▫ Rewards: Algorithm: ▫ Finite horizon: backward induction ▫ Infinite horizon (discounted): value iteration, policy iteration Problem size: becomes a disaster quickly SNM# of states Space for transition Matrix (Mb) 3522434.5 452102480 46240961,920 ~ 2Gb 4721638443,008 ~ 43Gb

Restless Multi-arm Bandit Problem 6/31 Lagrangian Relaxation: Upper Bound max

Restless Multi-arm Bandit Problem 7/31 Index Policies

Restless Multi-arm Bandit Problem 8/31 The Whittle’s Index Policy (Discounted Rewards) For a fixed arm, and a given state  “Subsidy” W The Whittle’s Index: W(s) ▫ The subsidy that makes passive and active arms indifferent Closed form solution depends on specific models W too smallarm is better/largeActive/Passive activepassive-W W-subsidy problem

Restless Multi-arm Bandit Problem 9/31 Numerical Algorithm for Solving Whittles’ Index STEP 1: Find the plausible range of W STEP 2: Use binary search within the range [L,U] No Yes -Value iteration

Restless Multi-arm Bandit Problem 10/31 The Primal-Dual Index Policy How harmful for passive  active How harmful for active  passive How harmful for passive  active How harmful for active  passive Being active if >0

Restless Multi-arm Bandit Problem 11/31 Heuristic Index Policies

Restless Multi-arm Bandit Problem 13/31 Experiment Settings Assume active rewards are larger than passive rewards Non-identical arms Structures in transition dynamics ▫ Uniformly sampled transition matrix ▫ IFR matrix with non-increasing rewards ▫ P1 is stochastically smaller than P2 ▫ Less-connected chain Evaluation ▫ Small instances: exact optimal solution ▫ Large instances: upper bound & Monte-Carlo simulation Performance measure ▫ Average gaps from Optimality or Upper Bound

Restless Multi-arm Bandit Problem 14/31 5 Questions of Interest 1.How do different policies compare under different problem structures? 2.How do different policies compare under various problem sizes? 3.How do different policies compare under different discount factors? 4.How does a multi-period look ahead improve a myopic policy? 5. How do different policies compare under different time horizons?

Restless Multi-arm Bandit Problem 15/31 Question 1: Does problem structure help? Uniformly sampled transition matrix and rewards Increasing failure rate matrix and non-increasing rewards Less-connected Markov chain P1 stochastically smaller than P2, non-increasing rewards

Restless Multi-arm Bandit Problem 16/31 Question 1: Does problem structure help?

Restless Multi-arm Bandit Problem 17/31 Question 2: Does problem size matter? Optimality gap: Fixed N and M, increasing S

Restless Multi-arm Bandit Problem 18/31 Question 2: Does problem size matter? Optimality gap: Fixed M and S, increasing N

Restless Multi-arm Bandit Problem 19/31 Question 3: Does discount factor matter? Infinite horizon: discount factors

Restless Multi-arm Bandit Problem 20/31 Question 4: Does look ahead help a myopic policy? Greedy policies vs Rolling-horizon policies different H Problem size: S=8, N=6, M=2, Problem structure: Uniform vs. less-connected

Restless Multi-arm Bandit Problem 25/31 Clinical Capacity Management Problem (Deo et al. 2013) School-based asthma care for children Scheduling Policy Medical records of patientsWho to schedule (treat)? Van capacity h=health state at the last appointment n=the time since the last appointment State (h,n), capacity M, population N OBJECTIVE: maximize total benefit of the community Current guidelines (fixed duration policy) Whittle’s index policy Primal-dual index policy Greedy (myopic) policy Rolling-horizon policy H-N priority policy, N-H priority policy No-schedule [baseline]

Restless Multi-arm Bandit Problem 26/31 How Large Is It?

Restless Multi-arm Bandit Problem 27/31 Performance of Policies

Restless Multi-arm Bandit Problem 30/31 Whittle’s Index vs. Gitten’s Index (S,N,M=1) vs. (S,N,M=2) Sample 20 instances for each problem size Whittle’s Index policy vs. DP exact solution ▫ Optimality tolerance = 0.002 SNM=1M=2 350%25% 360%25% 550%40% Percentage of time when Whittle’s Index policy is NOT optimal

Restless Multi-arm Bandit Problem 31/31 Summary Whittles’ Index and Primal-dual Index work well and efficiently Relative greedy policy can work well depending on problem structure Policies perform worse on the less-connected Markov chain All policies tend to work better if capacity is tight Look ahead policies have limited marginal benefit for small discount factor

Restless Multi-arm Bandit Problem 33/31 Question 5: Does decision horizon matter? Finite horizon: # of periods

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.

Similar presentations

Presentation on theme: "Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.

Similar presentations

Presentation on theme: "Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014."— Presentation transcript:

Similar presentations

About project

Feedback