Download presentation
Presentation is loading. Please wait.
1
Progressive Strategies For Monte-Carlo Tree Search Presenter: Ling Zhao University of Alberta November 5, 2007 Authors: G.M.J.B. Chaslot, M.H.M. Winands, J.W.H.M. Uiterwijk, H.J. van den Herik and B. Bouzy
2
2 Outlines Monte-Carlo Tree Search (MCTS) and the implementation in MANGO. Progressive strategies: progressive bias and progressive unpruning. Experiments. Conclusions and future work.
3
3 MCTS
4
4 Selection Process: select moves in UCT tree for the best balance between exploitation and exploration. A multi-armed bandit problems. UCB formula: k: No. k child of node n, v i : value of node i n i : visit count of node i, n p : visit count of node p C: const Selection precondition: n p >= T (= 30)
5
5 Expansion Process: For a given leaf node, determine whether it will be expanded by storing one or more of its children in UCT tree. Simple rule: expand one node per simulated game (the first node encountered not in UCT tree). In MANGO, if n p = T (= 30), all its children will be expanded.
6
6 Simulation Process: self-play until the end of the game. Rules: 1. Disallow play in its eyes 2. Stop the game after a certain number of moves. In MANGO, the probability of a move being selected in simulation is proportional to its urgency, a sum of capture value, 3x3 pattern value and proximity modification.
7
7 Backpropagation Process: using the result of a simulated game to update the nodes it traverses. Result: +1 for win, -1 for loss, 0 for draw v i of node i is computed by averaging the result of all simulated games made through it.
8
8 Progressive Strategies Soft transition between selection strategy and simulation strategy. Intuition: Selection strategy becomes more accurate than simulation one only when the number of games simulated is large. Progress strategy uses the information available for the selection strategy, and some expensive domain knowledge. Progress strategy is similar to the simulation strategy when a few games have been played, and converges to selection strategy when numerous games have been played.
9
9 Progressive Bias Direct search using possibly expensive heuristic knowledge. Modify the selection strategy, and make sure the influence decreases fast when many games have been played.
10
10 Progressive Bias Formula H i is a coefficient representing knowledge For children with n i =0, is replaced by M with M>>any v i, thus the children with the highest f(n i ) is selected. If n p [30, 100], f(n i ) is dominant. If n p (100, 500], f(n i ) has partial impact. When n p > 500, f(n i ) is dominated, but can be used for tie breaker.
11
11 Alternative Approach Using prior knowledge (Gelly and Silver): “Scalability of this approach to larger board sizes is an open question”.
12
12 Progressive Unpruning Reducing the branching factor artificially when the selection strategy is used. Increase the branching factor progressively when more games are simulated. Pruning or unpruning is done according to the heuristic value of the children.
13
13 Progressive Unpruning (Details) If n p = T, only k 0 (=5) children with highest heuristic values are not pruned. If n p > T, k = lg( n p /40) * 2.67 + k 0, children will be left unpruned. k = 5 ( n p = 40), 7 ( n p = 80), 10 ( n p = 120) Similar idea used by Coulom (progressive widening).
14
14 Heuristic Values Pattern value: learned offline using pattern matching (89,119 patterns from 2000 pro games). Capture value: the number of stones to be captured or to escape a capture with the move. Proximity value: Euclidean distance to the last move.
15
15 Heuristic Value Formula C i : Capture value P i : pattern value D k,i : distance to the k th last move k = 1.25 + k/2 Computing P i the time consuming part
16
16 Time For Computing Heuristics Computing H is around 1000 times slower than playing a move in simulated game. So H is computed only once per node, when T (=30) games is played through it. Speed reduction is only 4%, since the number of nodes with visit count >= 30 is low compared to the total number of moves in simulated games.
17
17 Domain Knowledge Calls Vs. T
18
18 Visit Count Vs. Number of Nodes
19
19 Experiments Self played games on 13x13 board (10 sec per move): MANGO with progressive strategies won 91% of the 500 games against MANGO without progressive strategies. MANGO : 20,000 simulated games, 1 sec on 9x9, 2 sec on 13x13, 5 sec on 19x19. GNU Go: level 10 on 9x9 and 13x13, 0 on 19x19.
20
20 MANGO Vs. GNU Go
21
21 MANGO Vs. GNU Go Plain MCTS does not scale well to 13x13 or 19x19 board. Progressive strategies are useful on every board size. The two progressive strategies combined are most powerful, esp. in 19x19.
22
22 Tournament Results Always in the top half. But were negative results removed?
23
23 Conclusions and Future Work Two progressive strategies are useful by providing a soft transition between selection and simulation. Overhead is negligible. Combine with RAVE and UCT with prior knowledge. Combine with the advanced knowledge developed by Coulom. Using life and death information. Better progressive bias. P-A. Coquelin and R. Munos. Bandit Algorithm for Tree Search. Technical Report 6141, INRIA, 2007.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.