Alpha Go …and Higher Ed Reuben Ternes Oakland University

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Adversarial Search We have experience in search where we assume that we are the only intelligent being and we have explicit control over the “world”. Lets.
Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Adversarial Search: Game Playing Reading: Chapter next time.
Adversarial Search CSE 473 University of Washington.
Artificial Intelligence in Game Design Heuristics and Other Ideas in Board Games.
Monte Carlo Go Has a Way to Go Haruhiro Yoshimoto (*1) Kazuki Yoshizoe (*1) Tomoyuki Kaneko (*1) Akihiro Kishimoto (*2) Kenjiro Taura (*1) (*1)University.
Lecture 6: Game Playing Heshaam Faili University of Tehran Two-player games Minmax search algorithm Alpha-Beta pruning Games with chance.
Introduction Many decision making problems in real life
Upper Confidence Trees for Game AI Chahine Koleejan.
Computer Go : A Go player Rohit Gurjar CS365 Project Proposal, IIT Kanpur Guided By – Prof. Amitabha Mukerjee.
 Summary  How to Play Go  Project Details  Demo  Results  Conclusions.
Game Playing Chapter 5. Game playing §Search applied to a problem against an adversary l some actions are not under the control of the problem-solver.
Instructor: Vincent Conitzer
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Chess and AI Group Members Abhishek Sugandhi Sanjeet Khaitan Gautam Solanki
Today’s Topics Playing Deterministic (no Dice, etc) Games –Mini-max –  -  pruning –ML and games? 1997: Computer Chess Player (IBM’s Deep Blue) Beat Human.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Graph Search II GAM 376 Robin Burke. Outline Homework #3 Graph search review DFS, BFS A* search Iterative beam search IA* search Search in turn-based.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
AI: AlphaGo European champion : Fan Hui A feat previously thought to be at least a decade away!!!
ConvNets for Image Classification
Understanding AI of 2 Player Games. Motivation Not much experience in AI (first AI project) and no specific interests/passion that I wanted to explore.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.
Artificial Intelligence AIMA §5: Adversarial Search
Game Playing Why do AI researchers study game playing?
Adversarial Search and Game-Playing
Reinforcement Learning
Instructor: Vincent Conitzer
Stochastic tree search and stochastic games
Iterative Deepening A*
Status Report on Machine Learning
Done Done Course Overview What is AI? What are the Major Challenges?
CS Fall 2016 (Shavlik©), Lecture 11, Week 6
Mastering the game of Go with deep neural network and tree search
AlphaGo with Deep RL Alpha GO.
Status Report on Machine Learning
Videos NYT Video: DeepMind's alphaGo: Match 4 Summary: see 11 min.
AlphaGo and learning methods
Deep reinforcement learning
AlphaGO from Google DeepMind in 2016, beat human grandmasters
CS 4700: Foundations of Artificial Intelligence
Data Mining (and machine learning)
Adversarial Search Chapter 5.
AlphaGo and learning methods
Artificial Intelligence and Searching
Announcements Homework 3 due today (grace period through Friday)
Objective of This Course
What Are They? Who Needs ‘em? An Example: Scoring in Tennis
Reinforcement Learning
Instructor: Vincent Conitzer
The Alpha-Beta Procedure
Introduction to Artificial Intelligence Lecture 9: Two-Player Games I
Reinforcement Learning for Adaptive Game Learner
October 6, 2011 Dr. Itamar Arel College of Engineering
Ensemble learning.
CPSC 322 Introduction to Artificial Intelligence
Instructor: Vincent Conitzer
Mini-Max search Alpha-Beta pruning General concerns on games
Strategic Thinking There are two concepts that all chess players must understand from the start; strategy and tactics. Beginners often confuse the two.
Games & Adversarial Search
These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.
Artificial Intelligence and Searching
CS51A David Kauchak Spring 2019
Games & Adversarial Search
Unit II Game Playing.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Alpha Go …and Higher Ed Reuben Ternes Oakland University MI AIR, Nov. 2017 Contact: ternes@oakland.edu

DeepMind Google’s DeepMind recently created an algorithm that has now beat the world’s highest rated Go player. Why does that matter? How did they do it? And what does it have to do with the future of Machine Learning, AI, and Higher Education?

Go Go What is it? Why is it seen as an important achievement in AI? Was it really that much of a surprise?

Go – The Game Objective: Rules To capture the most territory by placing one stone at a time Rules Black places one stone. Then white. Then black again, then white, etc. Stones placed only on intersections A stone or group of solidly connected stones are captured (removed) when all intersections directly adjacent to it are occupied by the enemy No stone may be placed to recreate a former board position.

More About Go Go falls into a class of games that are known as ‘Perfect Information’. Nothing is hidden, unlike Poker, from either player. It’s a lot like chess (in some ways). All Perfect Information games can be solved computationally by simply computing all possible moves and selecting the next move that guarantees a path to victory. Mathematically, the number of possible ‘game states’ equals Bd. Where B is the number of legal moves and D is the number of moves left in the game. (B = Breadth, D=Depth) In Chess, the search space is about 3580. In Go, that number is about 250150 . That’s more possible board states than there are suspected atoms in the universe. Solving this kind of computationally difficult problem has long been seen within the AI community as something akin to human intelligence.

Is the Achievement Over-Hyped? Not all members of the AI community think that solving Go would be that impressive. In particular, the game has simple rules, and a perfect solution to all games states is possible. Given enough computation power, previously explored algorithms would actually be sufficient to create a ‘perfect move’ algorithm. Still, the large number of potential moves, and how to prioritize strategies for exploring them, daunted many programmers for decades. Go has long had a history of active AI researcher. In the months leading up to DeepMind’s publication, several researchers noted improvements in past algorithms. These improvements were mostly lost to the public at large during the media blitz surrounded AlphaGo.

About the Algorithm Ok. Let’s get technical! Essentially, all of these details comes from “Mastering the game of Go with deep neural networks and tree search” from Silver et al.’s article in Nature (January 2016) There’s a newer paper, released just a couple of weeks ago. More on this later. The basic framework revolves around reducing the search space for the computer to make the computation of ‘best next move’ a whole lot easier.

AlphaGo Framework for Innovation The major claim for innovation that AlphaGo uses is the concept of a ‘Value Network’ and a ‘Policy Network’. These ‘networks’ reduce B and D. Efficient search becomes feasible. The Networks: Value Network = effective evaluation of the strength of a position (i.e. board state). Probability of winning at move (ai). Policy Network = A network that can be used to sample possible future moves (for both players). Probability distribution of future moves. Both of these ‘things’ are called ‘networks’ probably because the authors really like Neural Networks (NN), and used NN to create these probabilities.

Sidenote: Neural Nets I am not going to talk about Neural Nets in this topic. Describing how Neural Nets work is an entirely separate topic. For those interested, some great introduction videos and blogs to Neural Nets are listed below. https://www.youtube.com/watch?v=bxe2T-V8XRs https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/ https://www.youtube.com/watch?feature=youtu.be&utm_campaign=Da ta%2BElixir&utm_medium=email&utm_source=Data_Elixir_152&v=aircA ruvnKk This last one is especially good.

Restricting the Search Space of D Reduce D: Game length. You do this by defining a ‘Position of Power’ from which you know you can win. The further from the end game you can define this Position of Power, the smaller your search space becomes. ‘Position of Power’ can be approximated with a probability of winning. Formally: an ‘Approximate Value Function’ that predicts the outcome of the game from a particular board state ‘S’. (This will become the ‘Value Network’). For example, in Chess: Winning end game states occur when it is your turn, your opponent has only a king, and you have your King plus one of the following: Queen Rook Two Bishops And a certain number of pawns in certain positions. By knowing these endgames, you can greatly reduce your search space.

Restricting the Search Space of B Many modern Go algorithms do not restrict D. Instead, they focus on B. Basically, previous attempts created a heuristic to determine future moves. Monte Carlo ‘Rollouts’ – most popular heuristic. ‘Rollout’ = all the way to the end of the game. Step 1 – Choose a future move (ai) to take (usually at random). Step 2 – Simulate an entire game based on (ai) by selecting all additional future moves at random. Step 3 – Record for (ai) the win or loss. Step 4 – Repeat many times Step 5 – Choose the next move that has highest recorded win percentage from your random simulations. This method was used to create near perfect performance in backgammon and Scrabble. But such a procedure could not best even moderately good Go players. Note that with so many random elements only a small fraction of possible moves can be explored.

Best in Class Go Algorithms The (previously) best in class Go algorithms refined this procedure by: Using Monte Carlo Tree Search. (MCTS) MCTS basically uses decision trees to evaluate future moves. Unlike the previous example, which only evaluates the board state at the end game, MCTS evaluates the board state at every move. It does this through backpropagation – basically, it updates the value of child nodes, and then ‘guides’ the random process to select more promising victory paths. It still uses a ‘Roll out’ strategy, but child nodes with higher win rates are chosen more often to explore future moves. This approach will actually converge upon the optimal gameplay for a given move (a) as the sample approaches infinity. Not only will it tell you the optimal next move, but it will tell you the optimal future moves for both players, for all future moves! You can do other things to enhance MCTS, like guiding it to predict expert human moves and focus on those sample spaces instead of all sample spaces. This approach eventually developed into strong amateur play, but was still no match for professional play.

AlphaGo Algorithm Overview AlphaGo creates two very important improvements. It finds a way to efficiently restrict the Depth of search by approximating end game states. (‘Value Network’). It improves Breadth of search by improving the starting parameters of MCTS. (Policy Network). One of the great things about MCTS is that you can set it to create on a fixed runtime (like 5 seconds) and it will spit out the optimal result after that time frame. So, changing the starting parameters for MCTS greatly improves its runtime by helping it focus on the strongest moves first. The policy network and the value network can be thought of as ‘training’ data – computations that happens before a game even begins.

Improving MCTS – Starting Parameters To Improve MCTS: AlphaGo essentially uses three different layers. Layer #1 predicts what an expert human would do. They create both a ‘full’ version (accurate) and a ‘fast’ version of this layer (fast but less accurate). fast version = 500,000 move selections per second Full version is slow. 3000 move selections per second Full version = 56% accuracy. Fast version = 25% accuracy Known as the ‘SL Policy Network’ (Supervised Learning) Equivalent to ‘what is the probability an expert human would choose move (ai)’? Slow Version - Created from a Neural Net that uses 30 million positions from expert games. Fast Version - Created by a linear softmax function – mathematically a normalized exponential function. Used to enhance MCTS rollout strategy. This ‘network’ is not a single output (i.e. place this stone here), but rather, an entire probability distribution. (An expert human would move here 50% of the time, and here 25% of the time…)

Layer #2 Layer #2 improves on Layer #1. Still predicting expert human moves. Layer 1 is used as a direct input for Layer 2. Layer 2 known as the ‘RL Policy Network’. (Reinforcement Learning) RL = SL initially Then, play a full game using the RL policy against a randomly selected previous iteration of itself. The policy network is then updated after the game is played. Then repeat, through an arbitrarily large number of games. This final RL models beats the initial SL model 80% of the time. Essentially, it is simulating games of expert human players. Remember, a policy network selects future moves based on a probability distribution.

Layer #3 Layer 3 makes vast improvements on the Depth of games. Remember previous incarnations of MCTS simulates games all the way to the end The ‘Value Network’ basically evaluates the probability of winning a game based on the current board state. Trained using Neural Networks again, based on 30 million distinct positions. Essentially, it generates a probability of winning, based on some theoretical board state (i.e. if I move ‘here’ next, what is my win %?) It actually achieves much of the same strength of the ‘Roll Out’ strategy used in a Monte Carlo rollout (using the RL policy), but uses about 15,000 times less computational power.

Putting it all together The real innovation comes from putting the Policy Network and the Value Network together inside the MCTS. Remember, in classical MCTS, child nodes are updated through ‘backpropagation’ to update the ‘value’ of the node so the next simulation has a higher chance of choosing the node again if the node is correlated with winning. Both the policy and the value network actually reconceptualize this form of backpropagation. Ultimately, this reconceptualization leads to both a more complete search and a more efficient search.

Details on ‘The How’ Evaluate a parent node using the following strategy Step 1 – Choose a parent node that represents a possible next move. Step 2 – Select the first child node (C1) from a probability distribution based on the SL policy. Step 3 – Explore one ‘Leaf’ of the child node (C1). Do this by: Step 3A – Expand the leaf of the child node using the current policy network. Probabilities of all moves on the leaf are stored. Step 3B – Evaluate the leaf node (i.e. the ‘grandchild’ node) in two ways: using the value network (trained by the RL policy) and running one rollout to the end of the game with the fast rollout policy. Mix the two results to come up with a single index. Step 4 – Update the quality of all nodes in the system, parent, child, and grandchild. Step 5 – Repeat this process on another possible next move. Choose possible next moves based on the strength of the updated nodes, with a bonus given to nodes that have had fewer explorations – to encourage further explorations.

Results – How Much Improvement? Image taken directly from “Mastering the game of Go with deep neural networks and tree search” (2016, Nature)

Critiques Yes, the combination of MCTS + Value Network + Policy Network improved performance to near superhuman levels. BUT It required massive computational resources. A single computer, even a very powerful one, would not have likely beaten Fan Hui. Final version that beat Hui used 48 CPUs and another 8 GPUs Fan Hui actually took 2 of 5 informal games when Alpha go was required to use a 30 second timer. (technically 3 periods of 30 second byoyomi). He almost won a 3rd. He lost 5 of 5 formal games (1 hour + 3 30s byoyomi) It’s not clear that either the policy nor the value networks actually add much, conceptually, to the problem of limiting the search space. It’s mostly that they used ‘a process’ that was able to take advantage of the massive amount of computational resources Google had. Other modifications of MCTS may lead to just as good or better results – but other research teams don’t have Google’s computational resources. In other words – Google used ‘Deep Neural Networks’ to solve this problem. But it’s not clear they needed to.

New Paper! On October 19th, just a couple of weeks ago, another paper was published in Nature by the makers of AlphaGo. Mastering the game of Go without human knowledge DeepMind calls it ‘Alpha Zero’ because it does not use human games at all – the training set consists only of games it plays against itself. I haven’t had time to digest the new paper, but the broad strokes are: An elo rating of over 5,000 was achieved. AlphaZero beat AlphaGo 100 games to 0. It does not use two neural networks (a ‘policy’ network to select the next move and a ‘value’ network to predict the winner). These networks have been combined into one network. It does not use rollouts to simulate games. It simply uses the estimate of the single neural network to establish board strength.

DeepMind and Higher Ed Can AlphaGo’s Algorithm be used in higher ed? Not really It’s the ability to create such algorithms that is of interest (It’s possible that the algorithm might have some applicability to scheduling, with sufficient modification) So far, all of problems AI are solving have crystal clear outcome states. Many of these outcomes states also have crystal clear causal pathways OR are uninterested in understanding or manipulating those pathways Recommendation engines don’t care what you want and don’t want to change what you want. They just want to know what you want. But most of the problems in Higher Ed either do not have clear outcome states or need to estimate causation from hundreds of variables. There are some exceptions.

Do I Think AlphaGo Learns? I’m going to argue no. But I think my case is weak. My biggest issue is that it doesn’t learn from mistakes. It’s not clear that it makes mistakes at all. Is training data that shows a losing position a ‘mistake’? Is it too much of a pre-programmed? It is primarily a form of prediction, rather than learning. Does that even matter? Is there even a meaningful distinction between prediction and learning? AlphaZero trains against itself. So it appears to make mistakes and ‘learns’ from that. But it also takes millions of games to establish a training set. Humans need far less than that. It still does not appear to generally strategize.

Discussion That’s the end. Now let’s talk!