Download presentation
Presentation is loading. Please wait.
Published byGloria Georgia Davidson Modified over 6 years ago
1
Mastering the game of Go with deep neural network and tree search
David Silver, Aja Huang, et al.
2
The Game of Go (围棋, 囲碁, 바둑) Invented in China more than 2000 years ago
Two simple rules: Players put black and white pieces on crosses of the board alternatively. The player controlling the most crosses in the end wins. If all liberties of a piece or of a set of connected pieces are taken by pieces of another color, it/they will be removed from the board. A huge search space The board has 361 (19*19) crosses. Approximately 250^150 possible sequence of moves.
3
Monte Carlo Tree Search (MCTS)
Selection: start from root R and select successive child nodes down to a leaf node L. The section below says more about a way of choosing child nodes that lets the game tree expand towards most promising moves, which is the essence of Monte Carlo tree search. Expansion: unless L ends the game with a win/loss for either player, either create one or more child nodes or choose from them node C. Simulation: play a random playout from node C. Backpropagation: use the result of the playout to update information in the nodes on the path from C to R.
4
MCTS
5
Supervised Learning of Policy Network
The SL policy network pσ(a|s) alternates between convolutional layers with weights σ, and rectifier nonlinearities. A final softmax layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state The policy network is trained on randomly sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s
6
Reinforcement Learning of Policy Network
play games between the current policy network pρ and a randomly selected previous iteration of the policy network. Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy. use a reward function r(s) that is zero for all non-terminal time steps t < T. outcome zt = ± r(sT) is the terminal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and −1 for losing.
7
Reinforcement Learning of Policy Network
Reinforcement Learning of Policy Network Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome evaluate the performance of the RL policy network in game play, sampling each move from its output probability distribution over actions.
8
Reinforcement Learning of Value Network
estimating a value function vp(s) that predicts the outcome from position s of games played by using policy p for both players approximate the value function using a value network vθ(s) with weights θ train the weights of the value network by regression on state- outcome pairs (s, z), using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value vθ(s), and the corresponding outcome z
9
Neural network training pipeline and architecture
D Silver et al. Nature 529, 484–489 (2016) doi: /nature16961
10
MCTS in AlphaGo
11
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.