Download presentation
Presentation is loading. Please wait.
1
AlphaGo with Deep RL Alpha GO
2
Why is Go hard for computers?
All games have an optimal value function, 𝑣 ∗ 𝑠 , from every board state s. The 𝑣 ∗ 𝑠 may be achieved by recursively computing the optimal value function in a search tree containing approximately 𝑏 𝑑 possible sequences of moves. For chess, b≈35, d≈80; For Go, b≈250, d≈150. This makes brute force search impossible.
3
http://www0. cs. ucl. ac. uk/staff/d
4
Solution: reduce search space
Reduce breadth of game: sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s. Reduce depth of game: truncating the search tree at state s and replacing the subtree below s by and approximate value function v(s) ≈ 𝑣 ∗ 𝑠 that predicts the outcome from state s.
5
http://www0. cs. ucl. ac. uk/staff/d
6
http://www0. cs. ucl. ac. uk/staff/d
7
What’s AlphaGo AlphaGo = Supervised learning of policy networks + Reinforcement learning of policy networks + Reinforcement learning of value networks + Monte Carlo tree search (MCTS)
8
Overall traning First train a supervised learning (SL) policy network 𝑃 𝜎 directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Then also train a fast policy 𝑃 𝜋 that can rapidly sample actions during rollouts. Next, train a reinforcement learning (RL) policy network 𝑃 𝜌 that improves the SL policy network by optimizing the final outcome of games of self play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, train a value network that predicts the winner of games played by the RL policy network against itself. AlphaGo efficiently combines the policy and value networks with MCTS.
9
Training pipeline
10
Supervised learning of policy networks
Policy network: 12 layer convolutional neural network Training data: 30M positions from human expert games (KGS 5+ dan) Training algorithm: maximise likelihood by stochastic gradient descent Training time: 4 weeks on 50 GPUs using Google Cloud Results: 57% accuracy on held out test data (state-of-the art was 44%)
11
Reinforcement learning of policy networks
Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan.
12
Reinforcement learning of value networks
Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimise MSE by stochastic gradient descent Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible
13
Monte Carlo tree search (MCTS)
Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts.
14
The Tree Structure MCTS encodes the game state and its potential moves into a tree. Each node in the tree represents a potential game state with the root node representing the current state. Each edge represents a legal move that can be made from one game state to another. At the start of a game of TicTacToe the root node may have up to nine children, one for each possible move. Each following child can only have one less child than its parent since the previous moves are no longer available as options.
15
https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
MCTS The most basic way to use playouts is to apply the same number of playouts after each legal move of the current player, then choosing the move which led to the most victories. Each round of Monte Carlo tree search consists of four steps: Selection, Expansion, Simulation, Backpropagation.
16
Selection In the selection process, the MCTS algorithm traverses the current tree using a tree policy. A tree policy uses an evaluation function that prioritize nodes with the greatest estimated value. Once a node is reached in the traversal that has children (or moves) left to be added, then MCTS transitions into the expansion step.
17
Expansion Expansion - In the expansion step, a new node is added to the tree as a child of the node reached in the selection step.
18
Simulation In this step, a simulation (also referred to as a playout or rollout) is performed by choosing moves until either an end state or a predefined threshold is reached.
19
Backpropagation Now that the value of the newly added node has been determined, the rest of the tree must be updated. Starting at the new node, the algorithm traverses back to the root node. During the traversal the number of simulations stored in each node is incremented, and if the new node’s simulation resulted in a win then the number of wins is also incremented.
20
MCTS in Alpha Go
21
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.