1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

1 Monte-Carlo Tree Search Alan Fern

2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement (under certain conditions)  Theoretical Question: Can we develop Monte-Carlo methods that give us near optimal policies?  With computation that does NOT depend on number of states!  This was an open theoretical question until late 90’s.  Practical Question: Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time?  Multi-stage rollout can improve arbitrarily, but each stage adds a big increase in time per move.

3 Look-Ahead Trees  Rollout can be viewed as performing one level of search on top of the base policy  In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action  Can we generalize this to general stochastic MDPs?  Sparse Sampling is one such algorithm  Strong theoretical guarantees of near optimality a1a1 a 2 a k … … … … … … … … … Maybe we should search for multiple levels.

4 Online Planning with Look-Ahead Trees  At each state we encounter in the environment we build a look-ahead tree of depth h and use it to estimate optimal Q-values of each action  Select action with highest Q-value estimate  s = current state  Repeat  T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action  a = BestRootAction(T) ;; action with best Q-value  Execute action a in environment  s is the resulting state

5 Planning with Look-Ahead Trees …… s a2 a1 Build look-ahead tree Real world state/action sequence s a1a1 a2a2 … a1a1 a2a2 … s 11 a1a1 a2a2 s 1w … a1a1 a2a2 … s 21 a1a1 a2a2 s 2w … R(s 11,a 1 ) R(s 11,a 2 ) R(s 1w,a 1 ) R(s 1w,a 2 ) R(s 21,a 1 ) R(s 21,a 2 ) R(s 2w,a 1 ) R(s 2w,a 2 ) s a1a1 a2a2 … a1a1 a2a2 … s 11 a1a1 a2a2 s 1w … a1a1 a2a2 … s 21 a1a1 a2a2 s 2w … R(s 11,a 1 ) R(s 11,a 2 ) R(s 1w,a 1 ) R(s 1w,a 2 ) R(s 21,a 1 ) R(s 21,a 2 ) R(s 2w,a 1 ) R(s 2w,a 2 ) …………

Expectimax Tree (depth 1) Alternate max & expectation …… s max node (max over actions) expectation node (weighted average over states) a b States -- weighted by probability of occuring after taking a in s  After taking each action there is a distribution over next states (nature’s moves)  The value of an action depends on the immediate reward and weighted value of those states s 1 s 2 s n-1 s n

Expectimax Tree (depth 2) Alternate max & expectation …… … … … … …...... s max node (max over actions) expectation node (max over actions) a b Max Nodes: value equal to max of values of child expectation nodes Expectation Nodes: value is weighted average of values of its children Compute values from bottom up (leaf values = 0). Select root action with best value. a b a b

Expectimax Tree (depth H) …… … … … … …...... s a b a b a b … … H # states n k = # actions In general can grow tree to any horizon H Size depends on size of the state-space. Bad!

Sparse Sampling Tree (depth H) … …...... s a b a b a b H # samples w k = # actions Replace expectation with average over w samples w will typically be much smaller than n.

Sparse Sampling [Kearns et. al. 2002] SparseSampleTree(s,h,w) If h == 0 Then Return [0, null] For each action a in s Q*(s,a,h) = 0 For i = 1 to w Simulate taking a in s resulting in s i and reward r i [V*(s i,h),a*] = SparseSample(s i,h-1,w) Q*(s,a,h) = Q*(s,a,h) + r i + β V*(s i,h) Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h) V*(s,h) = max a Q*(s,a,h) ;; estimate of V*(s,h) a* = argmax a Q*(s,a,h) Return [V*(s,h), a*] The Sparse Sampling algorithm computes root value via depth first expansion Return value estimate V*(s,h) of state s and estimated optimal action a*

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. SparseSample(h=1,w=2) 10

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. SparseSample(h=1,w=2) 10 SparseSample(h=1,w=2) 0

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. SparseSample(h=1,w=2) 10 SparseSample(h=1,w=2) 0 5

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. 5 SparseSample(h=1,w=2) 4 2 b 3 Select action ‘a’ since its value is larger than b’s.

Sparse Sampling (Cont’d)  For a given desired accuracy, how large should sampling width and depth be?  Answered: Kearns, Mansour, and Ng (1999)  Good news: can gives values for w and H to achieve policy arbitrarily close to optimal  Values are independent of state-space size!  First near-optimal general MDP planning algorithm whose runtime didn’t depend on size of state-space  Bad news: the theoretical values are typically still intractably large---also exponential in H  Exponential in H is the best we can do in general  In practice: use small H and use heuristic at leaves

Sparse Sampling (Cont’d)  In practice: use small H & evaluate leaves with heuristic  For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy) …… … … … … …...... a1a1 a2a2 policy sims

18 Uniform vs. Adaptive Tree Search  Sparse sampling wastes time on bad parts of tree  Devotes equal resources to each state encountered in the tree  Would like to focus on most promising parts of tree  But how to control exploration of new parts of tree vs. exploiting promising parts?  Need adaptive bandit algorithm that explores more effectively

19 What now?  Adaptive Monte-Carlo Tree Search  UCB Bandit Algorithm  UCT Monte-Carlo Tree Search

20 Bandits: Cumulative Regret Objective s a1a1 a2a2 akak …  Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step)  Optimal (in expectation) is to pull optimal arm n times  UniformBandit is poor choice --- waste time on bad arms  Must balance exploring machines to find good payoffs and exploiting current knowledge

21 Cumulative Regret Objective

22 UCB Algorithm for Minimizing Cumulative Regret  Q(a) : average reward for trying action a (in our single state s) so far  n(a) : number of pulls of arm a so far  Action choice by UCB after n pulls:  Assumes rewards in [0,1]. We can always normalize if we know max value. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2), 235-256.

23 UCB: Bounded Sub-Optimality Value Term: favors actions that looked good historically Exploration Term: actions get an exploration bonus that grows with ln(n) Expected number of pulls of sub-optimal arm a is bounded by: where is the sub-optimality of arm a Doesn’t waste much time on sub-optimal arms, unlike uniform!

24 UCB Performance Guarantee [Auer, Cesa-Bianchi, & Fischer, 2002]

25 What now?  Adaptive Monte-Carlo Tree Search  UCB Bandit Algorithm  UCT Monte-Carlo Tree Search

 UCT is an instance of Monte-Carlo Tree Search  Applies principle of UCB  Similar theoretical properties to sparse sampling  Much better anytime behavior than sparse sampling  Famous for yielding a major advance in computer Go  A growing number of success stories UCT Algorithm [Kocsis & Szepesvari, 2006]

 Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”  During construction each tree node s stores:  state-visitation count n(s)  action counts n(s,a)  action values Q(s,a)  Repeat until time is up 1. Execute rollout policy starting from root until horizon (generates a state-action-reward trajectory) 2. Add first node not in current tree to the tree 3. Update statistics of each tree node s on trajectory  Increment n(s) and n(s,a) for selected action a  Update Q(s,a) by total reward observed after the node Monte-Carlo Tree Search

 For illustration purposes we will assume MDP is:  Deterministic  Only non-zero rewards are at terminal/leaf nodes  Algorithm is well-defined without these assumpitons Example MCTS

Current World State Rollout Policy Terminal (reward = 1) 1 1 1 Initially tree is single leaf Assume all non-zero reward occurs at terminal nodes. new tree node Iteration 1

Current World State 1 1 0 Rollout Policy Terminal (reward = 0) new tree node Iteration 2

Current World State 1 1/2 0 Iteration 3

Current World State 1 1/2 0 Rollout Policy Iteration 3

Current World State 1 1/2 0 Rollout Policy 0 new tree node Iteration 3

Current World State 1/2 1/3 0 Rollout Policy 0 Iteration 4

Current World State 1 1/2 1/3 0 0 Iteration 4

Current World State 2/3 0 Rollout Policy 0 What should the rollout policy be? 1

 Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy  Rollout policies have two distinct phases  Tree policy: selects actions at nodes already in tree (each action must be selected at least once)  Default policy: selects actions after leaving tree  Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction  Rather than uniformly explore actions at each node Rollout Policies

38  Basic UCT uses random default policy  In practice often use hand-coded or learned policy  Tree policy is based on UCB:  Q(s,a) : average reward received in trajectories so far after taking action a in state s  n(s,a) : number of times action a taken in s  n(s) : number of times state s encountered Theoretical constant that must be selected empirically in practice UCT Algorithm [Kocsis & Szepesvari, 2006]

Current World State 1 1/2 1/3 When all node actions tried once, select action according to tree policy 0 Tree Policy 0 a1a1 a2a2

Current World State 1 1/2 1/3 When all node actions tried once, select action according to tree policy 0 Tree Policy 0

Current World State 1 1/2 1/3 When all node actions tried once, select action according to tree policy 0 Tree Policy 0 0

Current World State 1 1/3 1/4 When all node actions tried once, select action according to tree policy 0 Tree Policy 0/1 0

Current World State 1 1/3 1/4 When all node actions tried once, select action according to tree policy 0 Tree Policy 0/1 0 1

Current World State 1 1/3 2/5 When all node actions tried once, select action according to tree policy 1/2 Tree Policy 0/1 0 1

45 UCT Recap  To select an action at a state s  Build a tree using N iterations of monte-carlo tree search  Default policy is uniform random  Tree policy is based on UCB rule  Select action that maximizes Q(s,a) (note that this final action selection does not take the exploration term into account, just the Q-value estimate)  The more simulations the more accurate

Some Successes Computer Go Klondike Solitaire (wins 40% of games) General Game Playing Competition Real-Time Strategy Games Combinatorial Optimization Crowd Sourcing List is growing Usually extend UCT is some ways

Practical Issues Selecting K There is no fixed rule Experiment with different values in your domain Rule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect The best value of K may depend on the number of iterations N Usually a single value of K will work well for values of N that are large enough

Practical Issues a b s UCT can have trouble building deep trees when actions can result in a large # of possible next states Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf

Practical Issues UCT can have trouble building deep trees when actions can result in a large # of possible next states Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf a b s

Practical Issues Degenerates into a depth 1 tree search Solution: We have a “sparseness parameter” that controls how many new children of an action will be generated Set sparseness to something like 5 or 10 then see if smaller values hurt performance or not a b s …..

Some Improvements Use domain knowledge to handcraft a more intelligent default policy than random E.g. don’t choose obviously stupid actions In Go a hand-coded default policy is used Learn a heuristic function to evaluate positions Use the heuristic function to initialize leaf nodes (otherwise initialized to zero)

1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Similar presentations

Presentation on theme: "1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Similar presentations

Presentation on theme: "1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement."— Presentation transcript:

Similar presentations

About project

Feedback