1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1 Monte-Carlo Planning: Policy Improvement Alan Fern

2 Monte-Carlo Planning  Often a simulator of a planning domain is available or can be learned from data 2 Fire & Emergency Response Conservation Planning

3 Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator 3 World Simulator Real World action State + reward

4 MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T, I:  finite state set S (|S|=n and is generally very large)  finite action set A (|A|=m and will assume is of reasonable size)  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Stochastic transition function T(s,a) = s’ (i.e. a simulator)  Stochastically returns a state s’ given input s and a  Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  Stochastic initial state function I.  Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!

5 Outline  You already learned how to evaluate a policy given a simulator  Just run the policy multiple times for a finite horizon and average the rewards  In next two lectures we’ll learn how to use the simulator in order to select good actions in the real world

6 Monte-Carlo Planning Outline  Single State Case (multi-armed bandits)  A basic tool for other algorithms  Monte-Carlo Policy Improvement  Policy rollout  Policy Switching  Monte-Carlo Tree Search  Sparse Sampling  UCT and variants Today

7 Single State Monte-Carlo Planning  Suppose MDP has a single state and k actions  Can sample rewards of actions using calls to simulator  Sampling action a is like pulling slot machine arm with random payoff function R(s,a) s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …

Multi-Armed Bandits  We will use bandit algorithms as components for multi-state Monte-Carlo planning  But they are useful in their own right  Pure bandit problems arise in many applications  Applicable whenever:  We have a set of independent options with unknown utilities  There is a cost for sampling options or a limit on total samples  Want to find the best option or maximize utility of our samples

Multi-Armed Bandits: Examples  Clinical Trials  Arms = possible treatments  Arm Pulls = application of treatment to inidividual  Rewards = outcome of treatment  Objective = determine best treatment quickly  Online Advertising  Arms = different ads/ad-types for a web page  Arm Pulls = displaying an ad upon a page access  Rewards = click through  Objective = find best add quickly (the maximize clicks)

10 Simple Regret Objective  Different applications suggest different types of bandit objectives.  Today minimizing simple regret will be the objective  Simple Regret Minimization (informal): quickly identify arm with close to optimal expected reward s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …

11 Simple Regret Objective: Formal Definition

12 UniformBandit Algorith (or Round Robin) Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852

13 Can we do better? Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence. Often is more effective than UniformBandit in practice.

14 Monte-Carlo Planning Outline  Single State Case (multi-armed bandits)  A basic tool for other algorithms  Monte-Carlo Policy Improvement  Policy rollout  Policy Switching  Monte-Carlo Tree Search  Sparse Sampling  UCT and variants Today

Policy Improvement via Monte-Carlo  Now consider a very large multi-state MDP.  Suppose we have a simulator and a non-optimal policy  E.g. policy could be a standard heuristic or based on intuition  Can we somehow compute an improved policy? 15 World Simulator + Base Policy Real World action State + reward

16 Policy Improvement Theorem

17 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) … How to implement SimQ?

18 Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q s … … … … a1 a2 Trajectory under  Sum of rewards = SimQ(s,a 1,π,h) akak Sum of rewards = SimQ(s,a 2,π,h) Sum of rewards = SimQ(s,a k,π,h)

19 Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q  Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards  Expected value of SimQ(s,a,π,h) is Q π (s,a,h)  So averaging across multiple runs of SimQ quickly converges to Q π (s,a,h)

20 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) …

21 UniformRollout s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. Samples of SimQ(s,a i,π,h)

22 s a1a1 a2a2 akak … q 11 q 12 … q 1u q 21 q 22 … q 2v q k1 … … … … …  Allocates a non-uniform number of trials across actions (focuses on more promising actions)

23 Executing Rollout in Real World …… s a1a1 a 2 a k … … … … … … … … … a1a1 a 2 a k … … … … … … … … … a2a2 akak run policy rollout Real world state/action sequence Simulated experience How much time does each decision take?

24 Policy Rollout: # of Simulator Calls Total of n SimQ calls each using h calls to simulator and policy Total of hn calls to the simulator and to the policy (dominates time to make decision) a1a1 a2a2 akak … … … …… … …… …… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. s

25 Practical Issues: Accuracy In general, larger values are better, but this increases time.

26 Practical Issues: Speed  There are three ways to speedup decision making time 1. Use a faster policy

27 Practical Issues: Speed  There are three ways to speedup decision making time 1. Use a faster policy 2. Decrease the number of trajectories n  Decreasing Trajectories:  If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often  One way to get away with a smaller n is to use an action filter  Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider  You can use your domain knowledge to filter out obviously bad actions  Rollout decides among the remaining actions returned by f(s)  Since rollout only tries actions in f(s) can use a smaller value of n

28 Practical Issues: Speed  There are three ways to speedup either rollout procedure 1. Use a faster policy 2. Decrease the number of trajectories n 3. Decrease the horizon h  Decrease Horizon h:  If h is too small compared to the “real horizon” of the problem, then the Q-estimates may not be accurate  Can get away with a smaller h by using a value estimation heuristic  Heuristic function: a heuristic function v(s) returns an estimate of the value of state s  SimQ is adjusted to run policy for h steps ending in state s’ and returns the sum of rewards up until s’ added to the estimate v(s’)

29 Multi-Stage Rollout  A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π  We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout  Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting from two iterations of policy improvement  Can nest this arbitrarily  Gives a way to use more time in order to improve performance

30 Multi-Stage Rollout a1a1 a2a2 akak … … … …… … …… … … Trajectories of SimQ(s,a i,Rollout[π,h,w],h) Each step requires nh simulator calls for Rollout policy Two stage: compute rollout policy of “rollout policy of π” Requires (nh) 2 calls to the simulator for 2 stages In general exponential in the number of stages s

31 Example: Rollout for Solitaire [Yan et al. NIPS’04]  Multiple levels of rollout can payoff but is expensive PlayerSuccess RateTime/Game Human Expert36.6%20 min (naïve) Base Policy 13.05%0.021 sec 1 rollout31.20%0.67 sec 2 rollout47.6%7.13 sec 3 rollout56.83%1.5 min 4 rollout60.51%18 min 5 rollout70.20%1 hour 45 min

32 Rollout in 2-Player Games s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… p1 p2

33 Another Useful Technique: Policy Switching  Suppose you have a set of base policies {π 1, π 2,…, π M }  Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select.  Policy switching is a simple way to select which policy to use at a given step via a simulator

34 Another Useful Technique: Policy Switching s Sim(s,π 1,h) Sim(s,π 2,h) Sim(s,π M,h) …  The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s  Implement by simply simulating π starting in s for h steps and returning discounted total reward  Use Bandit algorithm to select best policy and then select action chosen by that policy π 1 π 2 πMπM

35 PolicySwitching PolicySwitch[{π 1, π 2,…, π M },h,n](s) 1. Define bandit with M arms giving rewards Sim(s,π i,h) 2. Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials 3. Return action π i* (s) s π 1 π 2 πMπM … v 11 v 12 … v 1w v 21 v 22 … v 2w v M1 v M2 … v Mw … … … … … … ……… Sim(s,π i,h) trajectories Each simulates following π i for h steps. Discounted cumulative rewards

36 Executing Policy Switching in Real World …… s 2 k … … … … … … … … … 1 2 k … … … … … … … … … 2 (s) k (s’) run policy rollout Real world state/action sequence Simulated experience

37 Policy Switching: Quality

Policy Switching in 2-Player Games

MaxiMin Policy Switching …. Build Game Matrix Game Simulator Current State s Each entry gives estimated value (for max player) of playing a policy pair against one another Each value estimated by averaging across w simulated games.

MaxiMin Switching …. Build Game Matrix Game Simulator Current State s Can switch between policies based on state of game!

MaxiMin Switching …. Build Game Matrix Game Simulator Current State s

42 Policy Switching: Quality  MaxiMin policy switching will often do better than any single policy in practice  The theoretical guarantees for basic MaxiMin policy switching are quite weak  Tweaks to the algorithm can fix this  For single-agent MDPs, policy switching is guaranteed to improve over the best policy in the set.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Similar presentations

Presentation on theme: "1 Monte-Carlo Planning: Policy Improvement Alan Fern."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Similar presentations

Presentation on theme: "1 Monte-Carlo Planning: Policy Improvement Alan Fern."— Presentation transcript:

Similar presentations

About project

Feedback