AlphaGo with Deep RL Alpha GO
Why is Go hard for computers? All games have an optimal value function, 𝑣 ∗ 𝑠 , from every board state s. The 𝑣 ∗ 𝑠 may be achieved by recursively computing the optimal value function in a search tree containing approximately 𝑏 𝑑 possible sequences of moves. For chess, b≈35, d≈80; For Go, b≈250, d≈150. This makes brute force search impossible.
http://www0. cs. ucl. ac. uk/staff/d http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf
Solution: reduce search space Reduce breadth of game: sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s. Reduce depth of game: truncating the search tree at state s and replacing the subtree below s by and approximate value function v(s) ≈ 𝑣 ∗ 𝑠 that predicts the outcome from state s.
http://www0. cs. ucl. ac. uk/staff/d http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf
http://www0. cs. ucl. ac. uk/staff/d http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf
What’s AlphaGo AlphaGo = Supervised learning of policy networks + Reinforcement learning of policy networks + Reinforcement learning of value networks + Monte Carlo tree search (MCTS)
Overall traning First train a supervised learning (SL) policy network 𝑃 𝜎 directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Then also train a fast policy 𝑃 𝜋 that can rapidly sample actions during rollouts. Next, train a reinforcement learning (RL) policy network 𝑃 𝜌 that improves the SL policy network by optimizing the final outcome of games of self play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, train a value network that predicts the winner of games played by the RL policy network against itself. AlphaGo efficiently combines the policy and value networks with MCTS. https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
Training pipeline http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf
Supervised learning of policy networks Policy network: 12 layer convolutional neural network Training data: 30M positions from human expert games (KGS 5+ dan) Training algorithm: maximise likelihood by stochastic gradient descent Training time: 4 weeks on 50 GPUs using Google Cloud Results: 57% accuracy on held out test data (state-of-the art was 44%) http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf
Reinforcement learning of policy networks Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan. http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf
Reinforcement learning of value networks Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimise MSE by stochastic gradient descent Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf
Monte Carlo tree search (MCTS) Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts. https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
The Tree Structure MCTS encodes the game state and its potential moves into a tree. Each node in the tree represents a potential game state with the root node representing the current state. Each edge represents a legal move that can be made from one game state to another. At the start of a game of TicTacToe the root node may have up to nine children, one for each possible move. Each following child can only have one less child than its parent since the previous moves are no longer available as options. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons
https://en.wikipedia.org/wiki/Monte_Carlo_tree_search MCTS The most basic way to use playouts is to apply the same number of playouts after each legal move of the current player, then choosing the move which led to the most victories. Each round of Monte Carlo tree search consists of four steps: Selection, Expansion, Simulation, Backpropagation. https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
Selection In the selection process, the MCTS algorithm traverses the current tree using a tree policy. A tree policy uses an evaluation function that prioritize nodes with the greatest estimated value. Once a node is reached in the traversal that has children (or moves) left to be added, then MCTS transitions into the expansion step. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons
Expansion Expansion - In the expansion step, a new node is added to the tree as a child of the node reached in the selection step. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons
Simulation In this step, a simulation (also referred to as a playout or rollout) is performed by choosing moves until either an end state or a predefined threshold is reached. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons
Backpropagation Now that the value of the newly added node has been determined, the rest of the tree must be updated. Starting at the new node, the algorithm traverses back to the root node. During the traversal the number of simulations stored in each node is incremented, and if the new node’s simulation resulted in a win then the number of wins is also incremented. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons
MCTS in Alpha Go https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
Thank you!