AlphaGo with Deep RL Alpha GO.

Slides:



Advertisements
Similar presentations
Learning in Computer Go David Silver. The Problem Large state space  Approximately states  Game tree of about nodes  Branching factor.
Advertisements

For Friday Finish chapter 5 Program 1, Milestone 1 due.
Monte Carlo Tree Search: Insights and Applications BCS Real AI Event Simon Lucas Game Intelligence Group University of Essex.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Game Playing CSC361 AI CSC361: Game Playing.
Solving Probabilistic Combinatorial Games Ling Zhao & Martin Mueller University of Alberta September 7, 2005 Paper link:
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
Search and Planning for Inference and Learning in Computer Vision
Lecture 6: Game Playing Heshaam Faili University of Tehran Two-player games Minmax search algorithm Alpha-Beta pruning Games with chance.
Reinforcement Learning
Upper Confidence Trees for Game AI Chahine Koleejan.
Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab.
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Monte-Carlo methods for Computation and Optimization Spring 2015 Based on “N-Grams and the Last-Good-Reply Policy Applied in General Game Playing” (Mandy.
Game Playing. Introduction One of the earliest areas in artificial intelligence is game playing. Two-person zero-sum game. Games for which the state space.
CSCI 4310 Lecture 6: Adversarial Tree Search. Book Winston Chapter 6.
Neural Networks Chapter 7
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 9 of 42 Wednesday, 14.
Adversarial Search Chapter Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent reply Time limits.
Game tree search Chapter 6 (6.1 to 6.3 and 6.6) cover games. 6.6 covers state of the art game players in particular. 6.5 covers games that involve uncertainty.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference.
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
Exploration Seminar 8 Machine Learning Roy McElmurry.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
AI: AlphaGo European champion : Fan Hui A feat previously thought to be at least a decade away!!!
ConvNets for Image Classification
Artificial Intelligence in Game Design Board Games and the MinMax Algorithm.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.
Adversarial Search and Game-Playing
Reinforcement Learning
Instructor: Vincent Conitzer
By: Casey Savage, Hayley Stueber, and James Olson
Stochastic tree search and stochastic games
Machine Learning for Big Data
Last time: search strategies
Iterative Deepening A*
Status Report on Machine Learning
Adversarial Learning for Neural Dialogue Generation
Reinforcement Learning
Mastering the game of Go with deep neural network and tree search
Status Report on Machine Learning
Reinforcement learning (Chapter 21)
Videos NYT Video: DeepMind's alphaGo: Match 4 Summary: see 11 min.
Alpha Go …and Higher Ed Reuben Ternes Oakland University
AlphaGo and learning methods
Deep reinforcement learning
AlphaGO from Google DeepMind in 2016, beat human grandmasters
AlphaGo and learning methods
Announcements Homework 3 due today (grace period through Friday)
Dr. Unnikrishnan P.C. Professor, EEE
Kevin Mason Michael Suggs
Reinforcement Learning
The Alpha-Beta Procedure
Introduction to Artificial Intelligence Lecture 9: Two-Player Games I
Adversarial Search, Game Playing
These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.
Minimax strategies, alpha beta pruning
Prabhas Chongstitvatana Chulalongkorn University
Unit II Game Playing.
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

AlphaGo with Deep RL Alpha GO

Why is Go hard for computers? All games have an optimal value function, 𝑣 ∗ 𝑠 , from every board state s. The 𝑣 ∗ 𝑠 may be achieved by recursively computing the optimal value function in a search tree containing approximately 𝑏 𝑑 possible sequences of moves. For chess, b≈35, d≈80; For Go, b≈250, d≈150. This makes brute force search impossible.

http://www0. cs. ucl. ac. uk/staff/d http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf

Solution: reduce search space Reduce breadth of game: sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s. Reduce depth of game: truncating the search tree at state s and replacing the subtree below s by and approximate value function v(s) ≈ 𝑣 ∗ 𝑠 that predicts the outcome from state s.

http://www0. cs. ucl. ac. uk/staff/d http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf

http://www0. cs. ucl. ac. uk/staff/d http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf

What’s AlphaGo AlphaGo = Supervised learning of policy networks + Reinforcement learning of policy networks + Reinforcement learning of value networks + Monte Carlo tree search (MCTS)

Overall traning First train a supervised learning (SL) policy network 𝑃 𝜎 directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Then also train a fast policy 𝑃 𝜋 that can rapidly sample actions during rollouts. Next, train a reinforcement learning (RL) policy network 𝑃 𝜌 that improves the SL policy network by optimizing the final outcome of games of self play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, train a value network that predicts the winner of games played by the RL policy network against itself. AlphaGo efficiently combines the policy and value networks with MCTS. https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf

Training pipeline http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf

Supervised learning of policy networks Policy network: 12 layer convolutional neural network Training data: 30M positions from human expert games (KGS 5+ dan) Training algorithm: maximise likelihood by stochastic gradient descent Training time: 4 weeks on 50 GPUs using Google Cloud Results: 57% accuracy on held out test data (state-of-the art was 44%) http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf

Reinforcement learning of policy networks Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan. http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf

Reinforcement learning of value networks Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimise MSE by stochastic gradient descent Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/AlphaGo-tutorial-slides.pdf

Monte Carlo tree search (MCTS) Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts. https://en.wikipedia.org/wiki/Monte_Carlo_tree_search

The Tree Structure MCTS encodes the game state and its potential moves into a tree. Each node in the tree represents a potential game state with the root node representing the current state. Each edge represents a legal move that can be made from one game state to another. At the start of a game of TicTacToe the root node may have up to nine children, one for each possible move. Each following child can only have one less child than its parent since the previous moves are no longer available as options. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons

https://en.wikipedia.org/wiki/Monte_Carlo_tree_search MCTS The most basic way to use playouts is to apply the same number of playouts after each legal move of the current player, then choosing the move which led to the most victories. Each round of Monte Carlo tree search consists of four steps: Selection, Expansion, Simulation, Backpropagation. https://en.wikipedia.org/wiki/Monte_Carlo_tree_search

Selection In the selection process, the MCTS algorithm traverses the current tree using a tree policy. A tree policy uses an evaluation function that prioritize nodes with the greatest estimated value. Once a node is reached in the traversal that has children (or moves) left to be added, then MCTS transitions into the expansion step. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons

Expansion Expansion - In the expansion step, a new node is added to the tree as a child of the node reached in the selection step. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons

Simulation In this step, a simulation (also referred to as a playout or rollout) is performed by choosing moves until either an end state or a predefined threshold is reached. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons

Backpropagation Now that the value of the newly added node has been determined, the rest of the tree must be updated. Starting at the new node, the algorithm traverses back to the root node. During the traversal the number of simulations stored in each node is incremented, and if the new node’s simulation resulted in a win then the number of wins is also incremented. http://digitalcommons.morris.umn.edu/cgi/viewcontent.cgi?article=1028&context=horizons

MCTS in Alpha Go https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf

Thank you!