Download presentation
Presentation is loading. Please wait.
Published byEgbert Banks Modified over 9 years ago
1
UCT (Upper Confidence based Tree Search) An efficient game tree search algorithm for the game of Go by Levente Kocsis and Csaba Szepesvari [1]. The UCB1 (Upper Confidence Bounds) Algorithm [2] is employed to search the game tree. Features of UCT include: - balancing the exploration-exploitation dilemma, - small probability of selecting erroneous moves even if the algorithm is stopped prematurely, - convergence to the best action if sufficient time is allowed, - sampling actions selectively with a fixed look ahead depth d. - tree built incrementally by selecting a branch or an episode selectively over time (Figure 2). The Go problem seen from UCT is a multi-arm bandit problem [3], i.e. a K-armed bandit (Figure 1). Figure 1. The multi-arm bandit problem. Figure 2. The game tree expressed in terms of the Markov Decision Process. A bandit problem is similar to a slot machine with multiple levers The goal is to maximize the expected payoff The problem can be formulated by the Markov Decision Process - State s is the current board status - Action a is the choice of a bandit, action set A=(1, …, K) - Reward r a,n is the payoff of action a and index n in the range of [0,1]. In the problem of Go, the goal is to select the best move, action, or bandit to maximize the expected total reward given a state (positions on the Go board). UCT selects an action a to maximize the summation of the estimate action-value function at depth d, Q t (s,a), and the bias term, b(s,a). (1) Pseudocode for a generic Monte-Carlo planning algorithm The action-value function Q t (s,a) is estimated by averaging the rewards r(a,N s,a,t ) where N s,a,t is the number of times action a has been selected by time t. (2) The bias term used by UCT is given below where C d is a non-zero constant to prevent drifting and N s,t is the number of times state s has been visited by time t. (3) A Go Board Image source: http://upload.wikimedia.org/wikipedia/en/2/2e/Go_board.jpg GPU GO: Accelerating Monte-Carlo UCT Tae-hyung Kim*, Ryan Meuth*, Paul Robinette*, Donald C. Wunsch II* Department of Electrical and Computer Engineering University of Missouri – Rolla, Rolla, MO 65409 tk424@umr.edu, rmeuth@umr.edu, pmrmq3@umr.edu, dwunsch@umr.edu tk424@umr.edurmeuth@umr.edu pmrmq3@umr.edudwunsch@umr.edu Any solution to Go, and this algorithm in particular, can be applied to decision processes such as how to price gas for maximum profit Image Source: http://www.gasbuddy.com/gb_gastemperature map.aspx The exploration-exploitation components of this algorithm combined with the concept of territory control offered by Go can give insight in deciding where to place a new Starbucks in Manhattan. Image Source: http://gothamist.com/attachments/jake/2006_10_ starbuckslocations.jpg NVIDIA Tesla GPU Image Source: http://www.nvidia.com/docs/IO/43399/tesla_main.jpg Applications in Strategy and Economics Combining novel Computational intelligence methods and state of the art brute- force acceleration methods to solve a game with 10^60 states and wide ranging applications in science and industry. Go is considered to be the most important strategy game to be solved by computers GPU Graphics Processing Units GPU Capabilities are increasing exponentially through the utilization of parallel architectures. GPU – HDP implementation yields 22x performance increase on Legacy hardware. Modern graphics processing units (GPU) and game consoles are used for much more than 3D graphics applications and video games. From machine vision to finite element analysis, GPU’s are being used in diverse applications, collectively called General Purpose computation on Graphics Processor Units (GPGPU). Additionally, game consoles are entering the market of high performance computing as inexpensive nodes in computing clusters. Significant performance gains can be elicited from implementing Neural Networks and Approximate Dynamic Programming algorithms on graphics processing units. However, there is an amount of art to these implementations. In some cases the performance gains can be as high as 200 times, but as low as 2 times or even less than CPU operation. Thus it is necessary to understand the limitations of the graphics processing hardware, and to take these limitations into account when developing algorithms targeted at the GPU. In the context of Go, GPU’s can not only reduce the time to evaluate game states, but they can also serve as massively parallel processors, computing move trees in parallel. 0001 1011 0001 1011 0001 1011 1 ST bandit 2 nd bandit K th bandit Decision maker a1a1 a2a2 aKaK s0s0 a0a0 … a1a1 s0s0 s1s1 snsn s0s0 snsn a0a0 akak s0s0 s1s1 snsn … An episode akak s1s1 a1a1 [1] Levente Kocsis and Csaba Szepesvari, “Bandit based Monte-Carlo Planning”, Lecture Notes in Artificial Intelligence, 2006. [2] Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer, “Finite-time Analysis of the Multi-armed Bandit Problem”, Machine Learning, 2002. [3] Richard S. Sutton and Andres G. Barto, "Reinforcement Learning: An Introduction", MIT Press, 1998. The most natural application of Go is in military strategy. The UCT algorithm can be applied to maximizing zones of control, planning efficient and safe routes, and allocating resources correctly. Image Source: http://www.globalsecurity.org/military/ops/ images/oif_mil-ground-routes- map_may04.jpg
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.