Alpha Go …and Higher Ed Reuben Ternes Oakland University

Alpha Go …and Higher Ed Reuben Ternes Oakland University
MI AIR, Nov. 2017 Contact:

DeepMind Google’s DeepMind recently created an algorithm that has now beat the world’s highest rated Go player. Why does that matter? How did they do it? And what does it have to do with the future of Machine Learning, AI, and Higher Education?

Go Go What is it? Why is it seen as an important achievement in AI?
Was it really that much of a surprise?

Go – The Game Objective: Rules
To capture the most territory by placing one stone at a time Rules Black places one stone. Then white. Then black again, then white, etc. Stones placed only on intersections A stone or group of solidly connected stones are captured (removed) when all intersections directly adjacent to it are occupied by the enemy No stone may be placed to recreate a former board position.

More About Go Go falls into a class of games that are known as ‘Perfect Information’. Nothing is hidden, unlike Poker, from either player. It’s a lot like chess (in some ways). All Perfect Information games can be solved computationally by simply computing all possible moves and selecting the next move that guarantees a path to victory. Mathematically, the number of possible ‘game states’ equals Bd. Where B is the number of legal moves and D is the number of moves left in the game. (B = Breadth, D=Depth) In Chess, the search space is about In Go, that number is about That’s more possible board states than there are suspected atoms in the universe. Solving this kind of computationally difficult problem has long been seen within the AI community as something akin to human intelligence.

Is the Achievement Over-Hyped?
Not all members of the AI community think that solving Go would be that impressive. In particular, the game has simple rules, and a perfect solution to all games states is possible. Given enough computation power, previously explored algorithms would actually be sufficient to create a ‘perfect move’ algorithm. Still, the large number of potential moves, and how to prioritize strategies for exploring them, daunted many programmers for decades. Go has long had a history of active AI researcher. In the months leading up to DeepMind’s publication, several researchers noted improvements in past algorithms. These improvements were mostly lost to the public at large during the media blitz surrounded AlphaGo.

About the Algorithm Ok. Let’s get technical!
Essentially, all of these details comes from “Mastering the game of Go with deep neural networks and tree search” from Silver et al.’s article in Nature (January 2016) There’s a newer paper, released just a couple of weeks ago. More on this later. The basic framework revolves around reducing the search space for the computer to make the computation of ‘best next move’ a whole lot easier.

AlphaGo Framework for Innovation
The major claim for innovation that AlphaGo uses is the concept of a ‘Value Network’ and a ‘Policy Network’. These ‘networks’ reduce B and D. Efficient search becomes feasible. The Networks: Value Network = effective evaluation of the strength of a position (i.e. board state). Probability of winning at move (ai). Policy Network = A network that can be used to sample possible future moves (for both players). Probability distribution of future moves. Both of these ‘things’ are called ‘networks’ probably because the authors really like Neural Networks (NN), and used NN to create these probabilities.

Sidenote: Neural Nets I am not going to talk about Neural Nets in this topic. Describing how Neural Nets work is an entirely separate topic. For those interested, some great introduction videos and blogs to Neural Nets are listed below. ta%2BElixir&utm_medium= &utm_source=Data_Elixir_152&v=aircA ruvnKk This last one is especially good.

Restricting the Search Space of D
Reduce D: Game length. You do this by defining a ‘Position of Power’ from which you know you can win. The further from the end game you can define this Position of Power, the smaller your search space becomes. ‘Position of Power’ can be approximated with a probability of winning. Formally: an ‘Approximate Value Function’ that predicts the outcome of the game from a particular board state ‘S’. (This will become the ‘Value Network’). For example, in Chess: Winning end game states occur when it is your turn, your opponent has only a king, and you have your King plus one of the following: Queen Rook Two Bishops And a certain number of pawns in certain positions. By knowing these endgames, you can greatly reduce your search space.

Restricting the Search Space of B
Many modern Go algorithms do not restrict D. Instead, they focus on B. Basically, previous attempts created a heuristic to determine future moves. Monte Carlo ‘Rollouts’ – most popular heuristic. ‘Rollout’ = all the way to the end of the game. Step 1 – Choose a future move (ai) to take (usually at random). Step 2 – Simulate an entire game based on (ai) by selecting all additional future moves at random. Step 3 – Record for (ai) the win or loss. Step 4 – Repeat many times Step 5 – Choose the next move that has highest recorded win percentage from your random simulations. This method was used to create near perfect performance in backgammon and Scrabble. But such a procedure could not best even moderately good Go players. Note that with so many random elements only a small fraction of possible moves can be explored.

Best in Class Go Algorithms
The (previously) best in class Go algorithms refined this procedure by: Using Monte Carlo Tree Search. (MCTS) MCTS basically uses decision trees to evaluate future moves. Unlike the previous example, which only evaluates the board state at the end game, MCTS evaluates the board state at every move. It does this through backpropagation – basically, it updates the value of child nodes, and then ‘guides’ the random process to select more promising victory paths. It still uses a ‘Roll out’ strategy, but child nodes with higher win rates are chosen more often to explore future moves. This approach will actually converge upon the optimal gameplay for a given move (a) as the sample approaches infinity. Not only will it tell you the optimal next move, but it will tell you the optimal future moves for both players, for all future moves! You can do other things to enhance MCTS, like guiding it to predict expert human moves and focus on those sample spaces instead of all sample spaces. This approach eventually developed into strong amateur play, but was still no match for professional play.

AlphaGo Algorithm Overview
AlphaGo creates two very important improvements. It finds a way to efficiently restrict the Depth of search by approximating end game states. (‘Value Network’). It improves Breadth of search by improving the starting parameters of MCTS. (Policy Network). One of the great things about MCTS is that you can set it to create on a fixed runtime (like 5 seconds) and it will spit out the optimal result after that time frame. So, changing the starting parameters for MCTS greatly improves its runtime by helping it focus on the strongest moves first. The policy network and the value network can be thought of as ‘training’ data – computations that happens before a game even begins.

Improving MCTS – Starting Parameters
To Improve MCTS: AlphaGo essentially uses three different layers. Layer #1 predicts what an expert human would do. They create both a ‘full’ version (accurate) and a ‘fast’ version of this layer (fast but less accurate). fast version = 500,000 move selections per second Full version is slow move selections per second Full version = 56% accuracy. Fast version = 25% accuracy Known as the ‘SL Policy Network’ (Supervised Learning) Equivalent to ‘what is the probability an expert human would choose move (ai)’? Slow Version - Created from a Neural Net that uses 30 million positions from expert games. Fast Version - Created by a linear softmax function – mathematically a normalized exponential function. Used to enhance MCTS rollout strategy. This ‘network’ is not a single output (i.e. place this stone here), but rather, an entire probability distribution. (An expert human would move here 50% of the time, and here 25% of the time…)

Layer #2 Layer #2 improves on Layer #1.
Still predicting expert human moves. Layer 1 is used as a direct input for Layer 2. Layer 2 known as the ‘RL Policy Network’. (Reinforcement Learning) RL = SL initially Then, play a full game using the RL policy against a randomly selected previous iteration of itself. The policy network is then updated after the game is played. Then repeat, through an arbitrarily large number of games. This final RL models beats the initial SL model 80% of the time. Essentially, it is simulating games of expert human players. Remember, a policy network selects future moves based on a probability distribution.

Layer #3 Layer 3 makes vast improvements on the Depth of games.
Remember previous incarnations of MCTS simulates games all the way to the end The ‘Value Network’ basically evaluates the probability of winning a game based on the current board state. Trained using Neural Networks again, based on 30 million distinct positions. Essentially, it generates a probability of winning, based on some theoretical board state (i.e. if I move ‘here’ next, what is my win %?) It actually achieves much of the same strength of the ‘Roll Out’ strategy used in a Monte Carlo rollout (using the RL policy), but uses about 15,000 times less computational power.

Putting it all together
The real innovation comes from putting the Policy Network and the Value Network together inside the MCTS. Remember, in classical MCTS, child nodes are updated through ‘backpropagation’ to update the ‘value’ of the node so the next simulation has a higher chance of choosing the node again if the node is correlated with winning. Both the policy and the value network actually reconceptualize this form of backpropagation. Ultimately, this reconceptualization leads to both a more complete search and a more efficient search.

Details on ‘The How’ Evaluate a parent node using the following strategy Step 1 – Choose a parent node that represents a possible next move. Step 2 – Select the first child node (C1) from a probability distribution based on the SL policy. Step 3 – Explore one ‘Leaf’ of the child node (C1). Do this by: Step 3A – Expand the leaf of the child node using the current policy network. Probabilities of all moves on the leaf are stored. Step 3B – Evaluate the leaf node (i.e. the ‘grandchild’ node) in two ways: using the value network (trained by the RL policy) and running one rollout to the end of the game with the fast rollout policy. Mix the two results to come up with a single index. Step 4 – Update the quality of all nodes in the system, parent, child, and grandchild. Step 5 – Repeat this process on another possible next move. Choose possible next moves based on the strength of the updated nodes, with a bonus given to nodes that have had fewer explorations – to encourage further explorations.

Results – How Much Improvement?
Image taken directly from “Mastering the game of Go with deep neural networks and tree search” (2016, Nature)

Critiques Yes, the combination of MCTS + Value Network + Policy Network improved performance to near superhuman levels. BUT It required massive computational resources. A single computer, even a very powerful one, would not have likely beaten Fan Hui. Final version that beat Hui used 48 CPUs and another 8 GPUs Fan Hui actually took 2 of 5 informal games when Alpha go was required to use a 30 second timer. (technically 3 periods of 30 second byoyomi). He almost won a 3rd. He lost 5 of 5 formal games (1 hour s byoyomi) It’s not clear that either the policy nor the value networks actually add much, conceptually, to the problem of limiting the search space. It’s mostly that they used ‘a process’ that was able to take advantage of the massive amount of computational resources Google had. Other modifications of MCTS may lead to just as good or better results – but other research teams don’t have Google’s computational resources. In other words – Google used ‘Deep Neural Networks’ to solve this problem. But it’s not clear they needed to.

New Paper! On October 19th, just a couple of weeks ago, another paper was published in Nature by the makers of AlphaGo. Mastering the game of Go without human knowledge DeepMind calls it ‘Alpha Zero’ because it does not use human games at all – the training set consists only of games it plays against itself. I haven’t had time to digest the new paper, but the broad strokes are: An elo rating of over 5,000 was achieved. AlphaZero beat AlphaGo 100 games to 0. It does not use two neural networks (a ‘policy’ network to select the next move and a ‘value’ network to predict the winner). These networks have been combined into one network. It does not use rollouts to simulate games. It simply uses the estimate of the single neural network to establish board strength.

DeepMind and Higher Ed Can AlphaGo’s Algorithm be used in higher ed?
Not really It’s the ability to create such algorithms that is of interest (It’s possible that the algorithm might have some applicability to scheduling, with sufficient modification) So far, all of problems AI are solving have crystal clear outcome states. Many of these outcomes states also have crystal clear causal pathways OR are uninterested in understanding or manipulating those pathways Recommendation engines don’t care what you want and don’t want to change what you want. They just want to know what you want. But most of the problems in Higher Ed either do not have clear outcome states or need to estimate causation from hundreds of variables. There are some exceptions.

Do I Think AlphaGo Learns?
I’m going to argue no. But I think my case is weak. My biggest issue is that it doesn’t learn from mistakes. It’s not clear that it makes mistakes at all. Is training data that shows a losing position a ‘mistake’? Is it too much of a pre-programmed? It is primarily a form of prediction, rather than learning. Does that even matter? Is there even a meaningful distinction between prediction and learning? AlphaZero trains against itself. So it appears to make mistakes and ‘learns’ from that. But it also takes millions of games to establish a training set. Humans need far less than that. It still does not appear to generally strategize.

Discussion That’s the end. Now let’s talk!

Alpha Go …and Higher Ed Reuben Ternes Oakland University

Similar presentations

Presentation on theme: "Alpha Go …and Higher Ed Reuben Ternes Oakland University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alpha Go …and Higher Ed Reuben Ternes Oakland University

Similar presentations

Presentation on theme: "Alpha Go …and Higher Ed Reuben Ternes Oakland University"— Presentation transcript:

Similar presentations

About project

Feedback