Applications of Machine Learning to the Game of Go David Stern Applied Games Group Microsoft Research Cambridge (working with Thore Graepel, Ralf Herbrich,

Slides:

Advertisements

Similar presentations

Lecture 18: Temporal-Difference Learning

Advertisements

Learning in Computer Go David Silver. The Problem Large state space  Approximately states  Game tree of about nodes  Branching factor.

Adversarial Search We have experience in search where we assume that we are the only intelligent being and we have explicit control over the “world”. Lets.

Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.

AI for Connect-4 (or other 2-player games) Minds and Machines.

Games & Adversarial Search

Monte Carlo Tree Search: Insights and Applications BCS Real AI Event Simon Lucas Game Intelligence Group University of Essex.

ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.

CS 484 – Artificial Intelligence

Adversarial Search Chapter 5.

Adversarial Search: Game Playing Reading: Chapter next time.

Adversarial Search Chapter 6.

Artificial Intelligence in Game Design

This time: Outline Game playing The minimax algorithm

Progressive Strategies For Monte-Carlo Tree Search Presenter: Ling Zhao University of Alberta November 5, 2007 Authors: G.M.J.B. Chaslot, M.H.M. Winands,

Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.

Game-Playing Read Chapter 6 Adversarial Search. Game Types Two-person games vs multi-person –chess vs monopoly Perfect Information vs Imperfect –checkers.

ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.

Third Generation Machine Intelligence Christopher M. Bishop Microsoft Research, Cambridge Microsoft Research Summer School 2009.

Monte Carlo Go Has a Way to Go Haruhiro Yoshimoto (*1) Kazuki Yoshizoe (*1) Tomoyuki Kaneko (*1) Akihiro Kishimoto (*2) Kenjiro Taura (*1) (*1)University.

Othello Sean Farrell June 29, Othello Two-player game played on 8x8 board All pieces have one white side and one black side Initial board setup.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Game Playing State-of-the-Art  Checkers: Chinook ended 40-year-reign of human world champion Marion Tinsley in Used an endgame database defining.

Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.

Go An ancient Oriental board game Andrew Simons. Introduction 2 player game of skill. Popular in the Far East, growing in the West. Simple rules, extremely.

Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.

EM and expected complete log-likelihood Mixture of Experts

Upper Confidence Trees for Game AI Chahine Koleejan.

1 Learning CRFs with Hierarchical Features: An Application to Go Scott Sanner Thore Graepel Ralf Herbrich Tom Minka TexPoint fonts used in EMF. Read the.

Computer Go : A Go player Rohit Gurjar CS365 Project Proposal, IIT Kanpur Guided By – Prof. Amitabha Mukerjee.

Development of a Machine-Learning-Based AI For Go By Justin Park.

 Summary  How to Play Go  Project Details  Demo  Results  Conclusions.

Mark Dunlop, Computer and Information Sciences, Strathclyde University 1 Algorithms & Complexity 5 Games Mark D Dunlop.

Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab.

Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.

Games 1 Alpha-Beta Example [-∞, +∞] Range of possible values Do DF-search until first leaf.

CHAPTER 4 PROBABILITY THEORY SEARCH FOR GAMES. Representing Knowledge.

David Stern, Thore Graepel, Ralf Herbrich Online Services and Advertising Group MSR Cambridge.

Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.

Parallelization in Computer Board Games Ian Princivalli.

Learning to Play the Game of GO Lei Li Computer Science Department May 3, 2007.

Artificial Intelligence

ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

A Scalable Machine Learning Approach to Go Pierre Baldi and Lin Wu UC Irvine.

Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

AI: AlphaGo European champion : Fan Hui A feat previously thought to be at least a decade away!!!

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

ConvNets for Image Classification

Artificial Intelligence in Game Design Board Games and the MinMax Algorithm.

Luca Weibel Honors Track: Competitive Programming & Problem Solving Partisan game theory.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

1 Chapter 6 Game Playing. 2 Chapter 6 Contents l Game Trees l Assumptions l Static evaluation functions l Searching game trees l Minimax l Bounded lookahead.

Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Game Playing Why do AI researchers study game playing?

Matchbox Large Scale Online Bayesian Recommendations

Stochastic tree search and stochastic games

D1 Miwa Makoto Chikayama & Taura Lab

Status Report on Machine Learning

Mastering the game of Go with deep neural network and tree search

AlphaGo with Deep RL Alpha GO.

Status Report on Machine Learning

AlphaGO from Google DeepMind in 2016, beat human grandmasters

Adversarial Search Chapter 5.

Game Playing Fifth Lecture 2019/4/11.

These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.

Presentation transcript:

Applications of Machine Learning to the Game of Go David Stern Applied Games Group Microsoft Research Cambridge (working with Thore Graepel, Ralf Herbrich, David MacKay)

Contents The Game of Go Uncertainty in Go Move Prediction Territory Prediction Monte Carlo Go

The Game of Go Started about 4000 years ago in ancient China. About 60 million players worldwide. 2 Players: Black and White. Board: 19×19 grid. Rules: –Turn: One stone placed on vertex. –Capture. Aim: Gather territory by surrounding it. 1 A stone is captured by completely surrounding it. White territory Black territory 1 A chain is captured all together.

One eye = death

Two eyes = life ee

Computer Go 5th November 1997: Gary Kasparov beaten by Deep Blue. Best Go programs cannot beat amateurs. Go recognised as grand challenge for AI. Müller, M. (2002). Computer Go. Artificial Intelligence, 134.

Computer Go Minimax search defeated. High Branching Factor. –Go: ~200 –Chess: ~35 Complex Position Evaluation. –Stone’s value derived from configuration of surrounding stones. lookahead Evaluation Minimax

Use of Knowledge in Computer Go Trade off between search and knowledge. Most current Go programs use hand-coded knowledge. 1.Slow knowledge acquisition. 2.Tuning hand-crafted heuristics difficult. Previous approaches to automated knowledge acquisition: –Neural Networks (Erik van der Werf et al., 2002). –Exact pattern matching (Frank de Groot, 2005), (Bouzy, 2002)

Uncertainty in Go Go is game of perfect information. Complexity of game tree + limited computer speed → uncertainty. ‘aji’ = ‘taste’. Represent uncertainty using probabilities.

Move Prediction Learning from Expert Game Records

Move associated with a set of patterns. –Exact arrangement of stones. –Centred on proposed move. Sequence of nested templates. Pattern Matching for Move Prediction

Patterns ?

? ?

? ? ?

? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? 13 Pattern Sizes –Smallest is vertex only. –Biggest is full board.

Pattern Matching Pattern information stored in hash table. –Constant time access. –No need to store patterns explicitly. Need rapid incremental hash function. –Commutative. –Reversible. 64 bit random numbers for each template vertex: One for each of {black, white, empty, off}. Combine with XOR (Zobrist, 1970). ? Key XOR Black at (-1,2) = Empty at (1,1) =

Pattern Hash Key ? ? ? ? MIN Transformation Invariant Key Pattern information stored in hash table. –Access in constant time. –No need to store patterns explicitly. Need rapid incremental hash function. –Commutative. –Reversible. 64 bit random numbers for each template vertex: One for each of {black, white, empty, off}. Combine with XOR (Zobrist, 1970). Min of transformed patterns gives invariance. …

Harvesting Automatically Harvest from Game Records. 180,000 games × 250 moves × 13 pattern sizes… …gives 600 million potential patterns. Need to limit number stored. –Space. –Generalisation. Keep all patterns played more than n times. Bloom filter: Approximate test for set membership with minimal memory footprint. (Bloom, 1970).

Relative Frequencies of Pattern Sizes Smaller patterns matched later in game. Big patterns matched at beginning of game

Training First Phase: Harvest Second Phase: Training Use same games for both. Represent move by largest pattern only.

Matching Largest Pattern 1 Skill Mu=4.5, Sigma=0.2 Table

Bayesian Ranking Model Pattern value: Latent urgency:

Bayesian Ranking Model

Training Example: Chosen move (pattern). Set of moves (patterns) not chosen.

Bayesian Ranking Model

Online Learning from Expert Games Training Example: –Chosen move (pattern). –Set of moves (patterns) not chosen. Posterior: Approximate (Gaussian) posterior determined by Gaussian message passing. Online Learning (Assumed Density Filtering): –After each training example we have new and for each pattern. –Replace values and go to next position.

Message Passing f

Marginal Calculation

Gaussian Message Passing All messages Gaussian! Most factors Gaussian. Messages from ‘ordering’ factors are approximated: –Expectation propagation. True Marginal Distribution: Approximation: Moment Match:

Gaussian Message Passing

Gaussian Message Passing – EP Approximation

Gaussian Message Passing - Posterior Calculation

Move Prediction Performance 100 Moves / second

Rank Error vs Game Phase

Rank Error vs Pattern Size

Hierarchical Gaussian Model of Move Values Use all patterns that match – not just biggest. –Evidence from larger pattern should dominate small pattern at same location. –However – big patterns seen less frequently. Hierarchical Model of move values.

Pattern Hierarchy ? ? ? ? ? ? ?? ?

Hierarchical Gaussian Model ?

Move Prediction Performance

Predictive Probability

Territory Prediction

Territory 1.Empty intersections surrounded. 2.The stones themselves (Chinese method)

Predicting Territory ? Territory Hypothesis 1Go Position

Predicting Territory ? Territory Hypothesis 2

Predicting Territory ? Territory Hypothesis 3

Board Position: Territory Outcome: Model Distribution: E(Black Score) Boltzmann Machine (CRF): Predicting Territory

Territory Prediction Model Couplings e.g. or Biases e.g. or Boltzmann Machine (Ising model) ij

Game Records 1,000,000 expert games. –Some labelled with final territory outcome. Training Data is pair of game position, c i, + territory outcome, s i. Maximise Log-Likelihood: Inference by Swendsen-Wang sampling

So Yokoku [6d] (black) –VS- (white) [9d] Kato Masao 1000,000 Expert Games…

Final Position

Final Position + Territory

Final Territory Outcome

Position + Final Territory Train CRF by Maximum Likelihood

(Generated by Swendsen-Wang) Hypothesis: These stones are not captured Hypothesis: These stones are captured Hypothesis: This area is owned by white This side is owned by black Position + Boltzmann Sample

(Generated by Swendsen-Wang) Hypothesis: These stones are captured black controls this region Position + Boltzmann Sample

Bonds link regions of common fate Sample with Swendsen-Wang Clusters Shown

Size of squares indicates degree of certainty No squares indicates uncertainty These stones are probably going to be captured Boltzmann Machine – Expectation over Swendsen-Wang Samples

(Generated by Swendsen-Wang) Illegal! Position + Boltzmann Sample

Monte Carlo Go (work in progress)

Monte Carlo Go Territory Hypothesis Go Position Boltzmann

Monte Carlo Go Boltzmann Territory Hypothesis Go Position Territory Hypothesis MC

Monte Carlo Go Territory Hypothesis Go Position … … ‘Rollouts’

Monte Carlo Go ‘Rollout’ or ‘Playout’ –Complete game from current position to end. –Not fill in own eyes. –Score at final position easily calculated. 1 sample = 1 rollout Brugmann’s Monte Carlo Go (MC) –For each available move m sample s rollouts. –At each position, moves selected uniformly at random. –Rollout Value: (win or loss) –Move Value:

Adaptive Monte Carlo Planning … Update values of all positions in rollout. –Store value (distribution) for each node. –Store tree in memory. Bootstrap policy. –UCT. –Strong Play on small boards E.g. ‘MoGo’ (Silvain Gelly)

Adaptive Monte Carlo Go … L … W … W

… …… L W W … W

… …… L W W … L … W

… …… … L L W W … L … W This node Seen: 3 times Win: 2/3 times

Bayesian Model For Policy Prior: Moment Match Gaussian marginal Result of Rollout = WIN (TRUE | FALSE)

‘Bayesian’ Adaptive MC …

‘Bayesian’ Adaptive MC (BMC) Policy: –Sample from the distribution for each available move. –Pick best. Exploitation vs Exploration –Automatically adapted.

Exploitation vs Exploration

P-Game Trees Moves have numerical values –MAX moves drawn uniformly from [0,1] –MIN moves drawn uniformly from [-1,0] Value of leaf is sum of moves from root. If leaf value > 0 then win for MAX. If leaf value < 0 then loss for MAX. Assign win/loss to all nodes via Minimax. Qualitatively more like real Go game tree. Can simulate the addition of domain knowledge.

iteration mean error AB MC UCT BayesMC Monte Carlo Planning On P-Game Trees (B=2, D=20)

MC Planning on P-Game Trees (B=7,D=7)

Conclusions Areas addressed: –Move Prediction –Territory Prediction –Monte Carlo Go Probabilities good for modelling uncertainty in Go. Go is a good test bed for machine learning. –Wide range of sub-tasks. –Complex game yet simple rules. –Loads of training data. –Humans play the game well. … …… … L L W W … L … W