1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Slides:



Advertisements
Similar presentations
Artificial Intelligence Presentation
Advertisements

Markov Decision Process
Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
For Friday Finish chapter 5 Program 1, Milestone 1 due.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Games & Adversarial Search
Decision Theoretic Planning
Search in AI.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Sample-based Planning for Continuous Action Markov Decision Processes Ari Weinstein.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
Progressive Strategies For Monte-Carlo Tree Search Presenter: Ling Zhao University of Alberta November 5, 2007 Authors: G.M.J.B. Chaslot, M.H.M. Winands,
Games & Adversarial Search Chapter 6 Section 1 – 4.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Game Playing: Adversarial Search Chapter 6. Why study games Fun Clear criteria for success Interesting, hard problems which require minimal “initial structure”
Monte-Carlo Planning Look Ahead Trees
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
UCT (Upper Confidence based Tree Search) An efficient game tree search algorithm for the game of Go by Levente Kocsis and Csaba Szepesvari [1]. The UCB1.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Monte-Carlo Tree Search
Search and Planning for Inference and Learning in Computer Vision
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
1 Monte-Carlo Planning: Basic Principles and Recent Progress Most slides by Alan Fern EECS, Oregon State University A few from me, Dan Klein, Luke Zettlmoyer,
Upper Confidence Trees for Game AI Chahine Koleejan.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Game Playing. Towards Intelligence? Many researchers attacked “intelligent behavior” by looking to strategy games involving deep thought. Many researchers.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Monte-Carlo methods for Computation and Optimization Spring 2015 Based on “N-Grams and the Last-Good-Reply Policy Applied in General Game Playing” (Mandy.
Game Playing. Introduction One of the earliest areas in artificial intelligence is game playing. Two-person zero-sum game. Games for which the state space.
Quiz 4 : Minimax Minimax is a paranoid algorithm. True
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
For Friday Finish chapter 6 Program 1, Milestone 1 due.
Search CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Adversarial Search 2 (Game Playing)
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Artificial Intelligence in Game Design Board Games and the MinMax Algorithm.
CE810 / IGGI Game Design II PTSP and Game AI Agents Diego Perez.
Stochastic tree search and stochastic games
Monte-Carlo Planning:
Adversarial Search and Game Playing (Where making good decisions requires respecting your opponent) R&N: Chap. 6.
Reinforcement Learning (1)
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CMSC 471 Fall 2011 Class #4 Tue 9/13/11 Uninformed Search
Games & Adversarial Search
Markov Decision Processes
Markov Decision Processes
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

1 Monte-Carlo Tree Search Alan Fern

2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement (under certain conditions)  Theoretical Question: Can we develop Monte-Carlo methods that give us near optimal policies?  With computation that does NOT depend on number of states!  This was an open theoretical question until late 90’s.  Practical Question: Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time?  Multi-stage rollout can improve arbitrarily, but each stage adds a big increase in time per move.

3 Look-Ahead Trees  Rollout can be viewed as performing one level of search on top of the base policy  In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action  Can we generalize this to general stochastic MDPs?  Sparse Sampling is one such algorithm  Strong theoretical guarantees of near optimality a1a1 a 2 a k … … … … … … … … … Maybe we should search for multiple levels.

4 Online Planning with Look-Ahead Trees  At each state we encounter in the environment we build a look-ahead tree of depth h and use it to estimate optimal Q-values of each action  Select action with highest Q-value estimate  s = current state  Repeat  T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action  a = BestRootAction(T) ;; action with best Q-value  Execute action a in environment  s is the resulting state

5 Planning with Look-Ahead Trees …… s a2 a1 Build look-ahead tree Real world state/action sequence s a1a1 a2a2 … a1a1 a2a2 … s 11 a1a1 a2a2 s 1w … a1a1 a2a2 … s 21 a1a1 a2a2 s 2w … R(s 11,a 1 ) R(s 11,a 2 ) R(s 1w,a 1 ) R(s 1w,a 2 ) R(s 21,a 1 ) R(s 21,a 2 ) R(s 2w,a 1 ) R(s 2w,a 2 ) s a1a1 a2a2 … a1a1 a2a2 … s 11 a1a1 a2a2 s 1w … a1a1 a2a2 … s 21 a1a1 a2a2 s 2w … R(s 11,a 1 ) R(s 11,a 2 ) R(s 1w,a 1 ) R(s 1w,a 2 ) R(s 21,a 1 ) R(s 21,a 2 ) R(s 2w,a 1 ) R(s 2w,a 2 ) …………

Expectimax Tree (depth 1) Alternate max & expectation …… s max node (max over actions) expectation node (weighted average over states) a b States -- weighted by probability of occuring after taking a in s  After taking each action there is a distribution over next states (nature’s moves)  The value of an action depends on the immediate reward and weighted value of those states s 1 s 2 s n-1 s n

Expectimax Tree (depth 2) Alternate max & expectation …… … … … … … s max node (max over actions) expectation node (max over actions) a b Max Nodes: value equal to max of values of child expectation nodes Expectation Nodes: value is weighted average of values of its children Compute values from bottom up (leaf values = 0). Select root action with best value. a b a b

Expectimax Tree (depth H) …… … … … … … s a b a b a b … … H # states n k = # actions In general can grow tree to any horizon H Size depends on size of the state-space. Bad!

Sparse Sampling Tree (depth H) … … s a b a b a b H # samples w k = # actions Replace expectation with average over w samples w will typically be much smaller than n.

Sparse Sampling [Kearns et. al. 2002] SparseSampleTree(s,h,w) If h == 0 Then Return [0, null] For each action a in s Q*(s,a,h) = 0 For i = 1 to w Simulate taking a in s resulting in s i and reward r i [V*(s i,h),a*] = SparseSample(s i,h-1,w) Q*(s,a,h) = Q*(s,a,h) + r i + β V*(s i,h) Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h) V*(s,h) = max a Q*(s,a,h) ;; estimate of V*(s,h) a* = argmax a Q*(s,a,h) Return [V*(s,h), a*] The Sparse Sampling algorithm computes root value via depth first expansion Return value estimate V*(s,h) of state s and estimated optimal action a*

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. SparseSample(h=1,w=2) 10

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. SparseSample(h=1,w=2) 10 SparseSample(h=1,w=2) 0

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. SparseSample(h=1,w=2) 10 SparseSample(h=1,w=2) 0 5

SparseSample(h=2,w=2) Alternate max & averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value. 5 SparseSample(h=1,w=2) 4 2 b 3 Select action ‘a’ since its value is larger than b’s.

Sparse Sampling (Cont’d)  For a given desired accuracy, how large should sampling width and depth be?  Answered: Kearns, Mansour, and Ng (1999)  Good news: can gives values for w and H to achieve policy arbitrarily close to optimal  Values are independent of state-space size!  First near-optimal general MDP planning algorithm whose runtime didn’t depend on size of state-space  Bad news: the theoretical values are typically still intractably large---also exponential in H  Exponential in H is the best we can do in general  In practice: use small H and use heuristic at leaves

Sparse Sampling (Cont’d)  In practice: use small H & evaluate leaves with heuristic  For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy) …… … … … … … a1a1 a2a2 policy sims

18 Uniform vs. Adaptive Tree Search  Sparse sampling wastes time on bad parts of tree  Devotes equal resources to each state encountered in the tree  Would like to focus on most promising parts of tree  But how to control exploration of new parts of tree vs. exploiting promising parts?  Need adaptive bandit algorithm that explores more effectively

19 What now?  Adaptive Monte-Carlo Tree Search  UCB Bandit Algorithm  UCT Monte-Carlo Tree Search

20 Bandits: Cumulative Regret Objective s a1a1 a2a2 akak …  Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step)  Optimal (in expectation) is to pull optimal arm n times  UniformBandit is poor choice --- waste time on bad arms  Must balance exploring machines to find good payoffs and exploiting current knowledge

21 Cumulative Regret Objective

22 UCB Algorithm for Minimizing Cumulative Regret  Q(a) : average reward for trying action a (in our single state s) so far  n(a) : number of pulls of arm a so far  Action choice by UCB after n pulls:  Assumes rewards in [0,1]. We can always normalize if we know max value. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2),

23 UCB: Bounded Sub-Optimality Value Term: favors actions that looked good historically Exploration Term: actions get an exploration bonus that grows with ln(n) Expected number of pulls of sub-optimal arm a is bounded by: where is the sub-optimality of arm a Doesn’t waste much time on sub-optimal arms, unlike uniform!

24 UCB Performance Guarantee [Auer, Cesa-Bianchi, & Fischer, 2002]

25 What now?  Adaptive Monte-Carlo Tree Search  UCB Bandit Algorithm  UCT Monte-Carlo Tree Search

 UCT is an instance of Monte-Carlo Tree Search  Applies principle of UCB  Similar theoretical properties to sparse sampling  Much better anytime behavior than sparse sampling  Famous for yielding a major advance in computer Go  A growing number of success stories UCT Algorithm [Kocsis & Szepesvari, 2006]

 Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”  During construction each tree node s stores:  state-visitation count n(s)  action counts n(s,a)  action values Q(s,a)  Repeat until time is up 1. Execute rollout policy starting from root until horizon (generates a state-action-reward trajectory) 2. Add first node not in current tree to the tree 3. Update statistics of each tree node s on trajectory  Increment n(s) and n(s,a) for selected action a  Update Q(s,a) by total reward observed after the node Monte-Carlo Tree Search

 For illustration purposes we will assume MDP is:  Deterministic  Only non-zero rewards are at terminal/leaf nodes  Algorithm is well-defined without these assumpitons Example MCTS

Current World State Rollout Policy Terminal (reward = 1) Initially tree is single leaf Assume all non-zero reward occurs at terminal nodes. new tree node Iteration 1

Current World State Rollout Policy Terminal (reward = 0) new tree node Iteration 2

Current World State 1 1/2 0 Iteration 3

Current World State 1 1/2 0 Rollout Policy Iteration 3

Current World State 1 1/2 0 Rollout Policy 0 new tree node Iteration 3

Current World State 1/2 1/3 0 Rollout Policy 0 Iteration 4

Current World State 1 1/2 1/3 0 0 Iteration 4

Current World State 2/3 0 Rollout Policy 0 What should the rollout policy be? 1

 Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy  Rollout policies have two distinct phases  Tree policy: selects actions at nodes already in tree (each action must be selected at least once)  Default policy: selects actions after leaving tree  Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction  Rather than uniformly explore actions at each node Rollout Policies

38  Basic UCT uses random default policy  In practice often use hand-coded or learned policy  Tree policy is based on UCB:  Q(s,a) : average reward received in trajectories so far after taking action a in state s  n(s,a) : number of times action a taken in s  n(s) : number of times state s encountered Theoretical constant that must be selected empirically in practice UCT Algorithm [Kocsis & Szepesvari, 2006]

Current World State 1 1/2 1/3 When all node actions tried once, select action according to tree policy 0 Tree Policy 0 a1a1 a2a2

Current World State 1 1/2 1/3 When all node actions tried once, select action according to tree policy 0 Tree Policy 0

Current World State 1 1/2 1/3 When all node actions tried once, select action according to tree policy 0 Tree Policy 0 0

Current World State 1 1/3 1/4 When all node actions tried once, select action according to tree policy 0 Tree Policy 0/1 0

Current World State 1 1/3 1/4 When all node actions tried once, select action according to tree policy 0 Tree Policy 0/1 0 1

Current World State 1 1/3 2/5 When all node actions tried once, select action according to tree policy 1/2 Tree Policy 0/1 0 1

45 UCT Recap  To select an action at a state s  Build a tree using N iterations of monte-carlo tree search  Default policy is uniform random  Tree policy is based on UCB rule  Select action that maximizes Q(s,a) (note that this final action selection does not take the exploration term into account, just the Q-value estimate)  The more simulations the more accurate

Some Successes Computer Go Klondike Solitaire (wins 40% of games) General Game Playing Competition Real-Time Strategy Games Combinatorial Optimization Crowd Sourcing List is growing Usually extend UCT is some ways

Practical Issues Selecting K There is no fixed rule Experiment with different values in your domain Rule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect The best value of K may depend on the number of iterations N Usually a single value of K will work well for values of N that are large enough

Practical Issues a b s UCT can have trouble building deep trees when actions can result in a large # of possible next states Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf

Practical Issues UCT can have trouble building deep trees when actions can result in a large # of possible next states Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf a b s

Practical Issues Degenerates into a depth 1 tree search Solution: We have a “sparseness parameter” that controls how many new children of an action will be generated Set sparseness to something like 5 or 10 then see if smaller values hurt performance or not a b s …..

Some Improvements Use domain knowledge to handcraft a more intelligent default policy than random E.g. don’t choose obviously stupid actions In Go a hand-coded default policy is used Learn a heuristic function to evaluate positions Use the heuristic function to initialize leaf nodes (otherwise initialized to zero)