Download presentation
Presentation is loading. Please wait.
Published byMarilynn McDaniel Modified over 9 years ago
1
1 Monte-Carlo Planning: Policy Improvement Alan Fern
2
2 Monte-Carlo Planning Often a simulator of a planning domain is available or can be learned from data 2 Fire & Emergency Response Conservation Planning
3
3 Large Worlds: Monte-Carlo Approach Often a simulator of a planning domain is available or can be learned from data Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator 3 World Simulator Real World action State + reward
4
4 MDP: Simulation-Based Representation A simulation-based representation gives: S, A, R, T, I: finite state set S (|S|=n and is generally very large) finite action set A (|A|=m and will assume is of reasonable size) Stochastic, real-valued, bounded reward function R(s,a) = r Stochastically returns a reward r given input s and a Stochastic transition function T(s,a) = s’ (i.e. a simulator) Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP Stochastic initial state function I. Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!
5
5 Outline You already learned how to evaluate a policy given a simulator Just run the policy multiple times for a finite horizon and average the rewards In next two lectures we’ll learn how to use the simulator in order to select good actions in the real world
6
6 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy Switching Monte-Carlo Tree Search Sparse Sampling UCT and variants Today
7
7 Single State Monte-Carlo Planning Suppose MDP has a single state and k actions Can sample rewards of actions using calls to simulator Sampling action a is like pulling slot machine arm with random payoff function R(s,a) s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …
8
Multi-Armed Bandits We will use bandit algorithms as components for multi-state Monte-Carlo planning But they are useful in their own right Pure bandit problems arise in many applications Applicable whenever: We have a set of independent options with unknown utilities There is a cost for sampling options or a limit on total samples Want to find the best option or maximize utility of our samples
9
Multi-Armed Bandits: Examples Clinical Trials Arms = possible treatments Arm Pulls = application of treatment to inidividual Rewards = outcome of treatment Objective = determine best treatment quickly Online Advertising Arms = different ads/ad-types for a web page Arm Pulls = displaying an ad upon a page access Rewards = click through Objective = find best add quickly (the maximize clicks)
10
10 Simple Regret Objective Different applications suggest different types of bandit objectives. Today minimizing simple regret will be the objective Simple Regret Minimization (informal): quickly identify arm with close to optimal expected reward s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …
11
11 Simple Regret Objective: Formal Definition
12
12 UniformBandit Algorith (or Round Robin) Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852
13
13 Can we do better? Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence. Often is more effective than UniformBandit in practice.
14
14 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy Switching Monte-Carlo Tree Search Sparse Sampling UCT and variants Today
15
Policy Improvement via Monte-Carlo Now consider a very large multi-state MDP. Suppose we have a simulator and a non-optimal policy E.g. policy could be a standard heuristic or based on intuition Can we somehow compute an improved policy? 15 World Simulator + Base Policy Real World action State + reward
16
16 Policy Improvement Theorem
17
17 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) … How to implement SimQ?
18
18 Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q s … … … … a1 a2 Trajectory under Sum of rewards = SimQ(s,a 1,π,h) akak Sum of rewards = SimQ(s,a 2,π,h) Sum of rewards = SimQ(s,a k,π,h)
19
19 Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards Expected value of SimQ(s,a,π,h) is Q π (s,a,h) So averaging across multiple runs of SimQ quickly converges to Q π (s,a,h)
20
20 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) …
21
21 UniformRollout s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. Samples of SimQ(s,a i,π,h)
22
22 s a1a1 a2a2 akak … q 11 q 12 … q 1u q 21 q 22 … q 2v q k1 … … … … … Allocates a non-uniform number of trials across actions (focuses on more promising actions)
23
23 Executing Rollout in Real World …… s a1a1 a 2 a k … … … … … … … … … a1a1 a 2 a k … … … … … … … … … a2a2 akak run policy rollout Real world state/action sequence Simulated experience How much time does each decision take?
24
24 Policy Rollout: # of Simulator Calls Total of n SimQ calls each using h calls to simulator and policy Total of hn calls to the simulator and to the policy (dominates time to make decision) a1a1 a2a2 akak … … … …… … …… …… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. s
25
25 Practical Issues: Accuracy In general, larger values are better, but this increases time.
26
26 Practical Issues: Speed There are three ways to speedup decision making time 1. Use a faster policy
27
27 Practical Issues: Speed There are three ways to speedup decision making time 1. Use a faster policy 2. Decrease the number of trajectories n Decreasing Trajectories: If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often One way to get away with a smaller n is to use an action filter Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider You can use your domain knowledge to filter out obviously bad actions Rollout decides among the remaining actions returned by f(s) Since rollout only tries actions in f(s) can use a smaller value of n
28
28 Practical Issues: Speed There are three ways to speedup either rollout procedure 1. Use a faster policy 2. Decrease the number of trajectories n 3. Decrease the horizon h Decrease Horizon h: If h is too small compared to the “real horizon” of the problem, then the Q-estimates may not be accurate Can get away with a smaller h by using a value estimation heuristic Heuristic function: a heuristic function v(s) returns an estimate of the value of state s SimQ is adjusted to run policy for h steps ending in state s’ and returns the sum of rewards up until s’ added to the estimate v(s’)
29
29 Multi-Stage Rollout A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting from two iterations of policy improvement Can nest this arbitrarily Gives a way to use more time in order to improve performance
30
30 Multi-Stage Rollout a1a1 a2a2 akak … … … …… … …… … … Trajectories of SimQ(s,a i,Rollout[π,h,w],h) Each step requires nh simulator calls for Rollout policy Two stage: compute rollout policy of “rollout policy of π” Requires (nh) 2 calls to the simulator for 2 stages In general exponential in the number of stages s
31
31 Example: Rollout for Solitaire [Yan et al. NIPS’04] Multiple levels of rollout can payoff but is expensive PlayerSuccess RateTime/Game Human Expert36.6%20 min (naïve) Base Policy 13.05%0.021 sec 1 rollout31.20%0.67 sec 2 rollout47.6%7.13 sec 3 rollout56.83%1.5 min 4 rollout60.51%18 min 5 rollout70.20%1 hour 45 min
32
32 Rollout in 2-Player Games s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… p1 p2
33
33 Another Useful Technique: Policy Switching Suppose you have a set of base policies {π 1, π 2,…, π M } Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select. Policy switching is a simple way to select which policy to use at a given step via a simulator
34
34 Another Useful Technique: Policy Switching s Sim(s,π 1,h) Sim(s,π 2,h) Sim(s,π M,h) … The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s Implement by simply simulating π starting in s for h steps and returning discounted total reward Use Bandit algorithm to select best policy and then select action chosen by that policy π 1 π 2 πMπM
35
35 PolicySwitching PolicySwitch[{π 1, π 2,…, π M },h,n](s) 1. Define bandit with M arms giving rewards Sim(s,π i,h) 2. Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials 3. Return action π i* (s) s π 1 π 2 πMπM … v 11 v 12 … v 1w v 21 v 22 … v 2w v M1 v M2 … v Mw … … … … … … ……… Sim(s,π i,h) trajectories Each simulates following π i for h steps. Discounted cumulative rewards
36
36 Executing Policy Switching in Real World …… s 2 k … … … … … … … … … 1 2 k … … … … … … … … … 2 (s) k (s’) run policy rollout Real world state/action sequence Simulated experience
37
37 Policy Switching: Quality
38
Policy Switching in 2-Player Games
39
MaxiMin Policy Switching …. Build Game Matrix Game Simulator Current State s Each entry gives estimated value (for max player) of playing a policy pair against one another Each value estimated by averaging across w simulated games.
40
MaxiMin Switching …. Build Game Matrix Game Simulator Current State s Can switch between policies based on state of game!
41
MaxiMin Switching …. Build Game Matrix Game Simulator Current State s
42
42 Policy Switching: Quality MaxiMin policy switching will often do better than any single policy in practice The theoretical guarantees for basic MaxiMin policy switching are quite weak Tweaks to the algorithm can fix this For single-agent MDPs, policy switching is guaranteed to improve over the best policy in the set.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.