Download presentation
Presentation is loading. Please wait.
Published byAvis Waters Modified over 9 years ago
1
1 Monte-Carlo Planning: Policy Improvement Alan Fern
2
2 Monte-Carlo Planning Often a simulator of a planning domain is available or can be learned from data 2 Fire & Emergency Response Conservation Planning
3
3 Large Worlds: Monte-Carlo Approach Often a simulator of a planning domain is available or can be learned from data Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator 3 World Simulator Real World action State + reward
4
4 MDP: Simulation-Based Representation A simulation-based representation gives: S, A, R, T, I: finite state set S (|S|=n and is generally very large) finite action set A (|A|=m and will assume is of reasonable size) Stochastic, real-valued, bounded reward function R(s,a) = r Stochastically returns a reward r given input s and a Stochastic transition function T(s,a) = s’ (i.e. a simulator) Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP Stochastic initial state function I. Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!
5
5 Outline You already learned how to evaluate a policy given a simulator Just run the policy multiple times for a finite horizon and average the rewards In next two lectures we’ll learn how to select good actions
6
6 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy Switching Monte-Carlo Tree Search Sparse Sampling UCT and variants Today
7
7 Single State Monte-Carlo Planning Suppose MDP has a single state and k actions Can sample rewards of actions using calls to simulator Sampling action a is like pulling slot machine arm with random payoff function R(s,a) s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …
8
Multi-Armed Bandits We will use bandit algorithms as components for multi-state Monte-Carlo planning But they are useful in their own right Pure bandit problems arise in many applications Applicable whenever: We have a set of independent options with unknown utilities There is a cost for sampling options or a limit on total samples Want to find the best option or maximize utility of our samples
9
Multi-Armed Bandits: Examples Clinical Trials Arms = possible treatments Arm Pulls = application of treatment to inidividual Rewards = outcome of treatment Objective = Find best treatment quickly (debatable) Online Advertising Arms = different ads/ad-types for a web page Arm Pulls = displaying an ad upon a page access Rewards = click through Objective = find best add quickly (the maximize clicks)
10
10 Simple Regret Objective Different applications suggest different types of bandit objectives. Today minimizing simple regret will be the objective Simple Regret: quickly identify arm with high reward (in expectation) s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …
11
11 Simple Regret Objective: Formal Definition
12
12 UniformBandit Algorith (or Round Robin) Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852
13
13 Can we do better? Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence. Often is seen to be more effective in practice as well.
14
14 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy Switching Monte-Carlo Tree Search Sparse Sampling UCT and variants Today
15
Policy Improvement via Monte-Carlo Now consider a very large multi-state MDP. Suppose we have a simulator and a non-optimal policy E.g. policy could be a standard heuristic or based on intuition Can we somehow compute an improved policy? 15 World Simulator + Base Policy Real World action State + reward
16
16 Policy Improvement Theorem
17
17 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) … Idea: define a stochastic function SimQ(s,a,π,h) that we can implement and whose expected value is Q π (s,a,h) Then use Bandit algorithm to select (approximately) the action with best Q-value How to implement SimQ?
18
18 Policy Improvement via Bandits SimQ(s,a,π,h) r = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 r = r + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return r Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards Expected value of SimQ(s,a,π,h) is Q π (s,a,h) which can be made arbitrarily close to Q π (s,a) by increasing h
19
19 Policy Improvement via Bandits SimQ(s,a,π,h) r = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 r = r + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return r s … … … … a1 a2 Trajectory under Sum of rewards = SimQ(s,a 1,π,h) akak Sum of rewards = SimQ(s,a 2,π,h) Sum of rewards = SimQ(s,a k,π,h)
20
20 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) …
21
21 UniformRollout s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. Samples of SimQ(s,a i,π,h)
22
22 s a1a1 a2a2 akak … q 11 q 12 … q 1u q 21 q 22 … q 2v q k1 … … … … … Allocates a non-uniform number of trials across actions (focuses on more promising actions)
23
23 Executing Rollout in Real World …… s a1a1 a 2 a k … … … … … … … … … a1a1 a 2 a k … … … … … … … … … a2a2 akak run policy rollout Real world state/action sequence Simulated experience
24
24 Policy Rollout: # of Simulator Calls Total of n SimQ calls each using h calls to simulator Total of hn calls to the simulator a1a1 a2a2 akak … … … …… … …… …… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. s
25
25 Multi-Stage Rollout A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting from two iterations of policy improvement Can nest this arbitrarily Gives a way to use more time in order to improve performance
26
26 Practical Issues There are three ways to speedup either rollout procedure 1. Use a faster policy 2. Decrease the number of trajectories n 3. Decrease the horizon h Decreasing Trajectories: If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often One way to get away with a smaller n is to use an action filter Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider You can use your domain knowledge to filter out obviously bad actions Rollout decides among the rest
27
27 Practical Issues There are three ways to speedup either rollout procedure 1. Use a faster policy 2. Decrease the number of trajectories n 3. Decrease the horizon h Decrease Horizon h: If h is too small compared to the “real horizon” of the problem, then the Q-estimates may not be accurate One way to get away with a smaller h is to use a value estimation heuristic Heuristic function: a heuristic function v(s) returns an estimate of the value of state s SimQ is adjusted to run policy for h steps ending in state s’ and returning the sum of rewards adding in estimate v(s’)
28
28 Multi-Stage Rollout a1a1 a2a2 akak … … … …… … …… … … Trajectories of SimQ(s,a i,Rollout[π,h,w],h) Each step requires nh simulator calls for Rollout policy Two stage: compute rollout policy of “rollout policy of π” Requires (nh) 2 calls to the simulator for 2 stages In general exponential in the number of stages s
29
29 Example: Rollout for Solitaire [Yan et al. NIPS’04] Multiple levels of rollout can payoff but is expensive PlayerSuccess RateTime/Game Human Expert36.6%20 min (naïve) Base Policy 13.05%0.021 sec 1 rollout31.20%0.67 sec 2 rollout47.6%7.13 sec 3 rollout56.83%1.5 min 4 rollout60.51%18 min 5 rollout70.20%1 hour 45 min
30
30 Rollout in 2-Player Games s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… p1 p2
31
31 Another Useful Technique: Policy Switching Suppose you have a set of base policies {π 1, π 2,…, π M } Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select. Policy switching is a simple way to select which policy to use at a given step via a simulator
32
32 Another Useful Technique: Policy Switching s Sim(s,π 1,h) Sim(s,π 2,h) Sim(s,π M,h) … The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s Implement by simply simulating π starting in s for h steps and returning discounted total reward Use Bandit algorithm to select best policy and then select action chosen by that policy π 1 π 2 πMπM
33
33 PolicySwitching PolicySwitch[{π 1, π 2,…, π M },h,n](s) 1. Define bandit with M arms giving rewards Sim(s,π i,h) 2. Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials 3. Return action π i* (s) s π 1 π 2 πMπM … v 11 v 12 … v 1w v 21 v 22 … v 2w v M1 v M2 … v Mw … … … … … … ……… Sim(s,π i,h) trajectories Each simulates following π i for h steps. Discounted cumulative rewards
34
34 Executing Policy Switching in Real World …… s 2 k … … … … … … … … … 1 2 k … … … … … … … … … 2 (s) k (s’) run policy rollout Real world state/action sequence Simulated experience
35
35 Policy Switching: Quality
36
Policy Switching in 2-Player Games
37
Minimax Policy Switching …. Build Game Matrix Game Simulator Current State s Each entry gives estimated value (for max player) of playing a policy pair against one another Each value estimated by averaging across w simulated games.
38
Minimax Switching …. Build Game Matrix Game Simulator Current State s
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.