AlphaGo and learning methods

Slides:



Advertisements
Similar presentations
Adversarial Search Chapter 6 Sections 1 – 4. Outline Optimal decisions α-β pruning Imperfect, real-time decisions.
Advertisements

Learning in Computer Go David Silver. The Problem Large state space  Approximately states  Game tree of about nodes  Branching factor.
Adversarial Search Chapter 6 Section 1 – 4. Types of Games.
Adversarial Search We have experience in search where we assume that we are the only intelligent being and we have explicit control over the “world”. Lets.
Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
Games & Adversarial Search
CS 484 – Artificial Intelligence
Adversarial Search Chapter 5.
Adversarial Search: Game Playing Reading: Chapter next time.
1 Adversarial Search Chapter 6 Section 1 – 4 The Master vs Machine: A Video.
Games CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
This time: Outline Game playing The minimax algorithm
Learning Shape in Computer Go David Silver. A brief introduction to Go Black and white take turns to place down stones Once played, a stone cannot move.
Adversarial Search: Game Playing Reading: Chess paper.
Monte Carlo Go Has a Way to Go Haruhiro Yoshimoto (*1) Kazuki Yoshizoe (*1) Tomoyuki Kaneko (*1) Akihiro Kishimoto (*2) Kenjiro Taura (*1) (*1)University.
Games & Adversarial Search Chapter 6 Section 1 – 4.
Game Playing State-of-the-Art  Checkers: Chinook ended 40-year-reign of human world champion Marion Tinsley in Used an endgame database defining.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
CSC 412: AI Adversarial Search
Lecture 6: Game Playing Heshaam Faili University of Tehran Two-player games Minmax search algorithm Alpha-Beta pruning Games with chance.
Game Playing.
Adversarial Search Chapter 6 Section 1 – 4. Outline Optimal decisions α-β pruning Imperfect, real-time decisions.
Learning to Play the Game of GO Lei Li Computer Science Department May 3, 2007.
Adversarial Search Chapter 6 Section 1 – 4. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent reply Time.
Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
ConvNets for Image Classification
Adversarial Search Chapter 5 Sections 1 – 4. AI & Expert Systems© Dr. Khalid Kaabneh, AAU Outline Optimal decisions α-β pruning Imperfect, real-time decisions.
ADVERSARIAL SEARCH Chapter 6 Section 1 – 4. OUTLINE Optimal decisions α-β pruning Imperfect, real-time decisions.
Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.
Game Playing Why do AI researchers study game playing?
Adversarial Search and Game-Playing
Games and adversarial search (Chapter 5)
Announcements Homework 1 Full assignment posted..
Instructor: Vincent Conitzer
Stochastic tree search and stochastic games
4. Games and adversarial search
Machine Learning for Big Data
Last time: search strategies
Status Report on Machine Learning
PENGANTAR INTELIJENSIA BUATAN (64A614)
Games and adversarial search (Chapter 5)
CS Fall 2016 (Shavlik©), Lecture 11, Week 6
Mastering the game of Go with deep neural network and tree search
Pengantar Kecerdasan Buatan
AlphaGo with Deep RL Alpha GO.
Adversarial Search.
Status Report on Machine Learning
AlphaGo and learning methods
AlphaGO from Google DeepMind in 2016, beat human grandmasters
Adversarial Search Chapter 5.
Dakota Ewigman Jacob Zimmermann
Games & Adversarial Search
Games & Adversarial Search
Adversarial Search.
Games & Adversarial Search
Games & Adversarial Search
Instructor: Vincent Conitzer
Game Playing Fifth Lecture 2019/4/11.
Instructor: Vincent Conitzer
Games & Adversarial Search
Adversarial Search CMPT 420 / CMPG 720.
These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.
Minimax strategies, alpha beta pruning
CS51A David Kauchak Spring 2019
Games & Adversarial Search
Adversarial Search Chapter 6 Section 1 – 4.
Prabhas Chongstitvatana Chulalongkorn University
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

AlphaGo and learning methods Machine learning AlphaGo and learning methods

Ranking in Miyagi prefecture 2016 An animation hero resemble to my youth

Rule of Go Go: A game between two persons: Bob(black) and Willy(white) Bob and Willy alternatingly place stones on 19 x 19 grid board: Bob places black and Willy places white stones Stones cannot move, but stones completely surrounded by opponent’s stones are removed from the board The game terminate if there is no ”profitable” move. Or by resignation The one with more “territory” wins Bob plays first, and gives 6.5 handicap points to Willy There is an additional rule named “Ko(劫, meaning infitite time) ”to avoid infinite loop.

Mathematical formulation Finite deterministic two-person’s zero-sum game with complete information Theorem: There is a winning strategy (different from Poker, Mahjong etc) s: state of go = placement of stones on 19 x 19 board s is a 361 dimensional vector of (1, -1, 0) . 1 means black, -1 means white, 0 means no stone. (special treatment for “Ko”) The number is bounded by 3 361 a: next move. A grid point not occupied in the current state. F(s): Function indicating Bob win (1, or b) or Willy win (-1, or w) or draw (0) Draw happens very rare (once in 10000 games or so), and let us ignore it. Winning move a (for Bob) at s satisfies F(s ∪ {a}) = b If F(s) = b and its Bob’s turn, there exists a such that F(s ∪ {a}) = b If F(s)= b and its Willy’s turn, F(s ∪ {a}) = b for every possibly Willy’s move a. Practically, we consider a function f(s) to indicate the “advantage” for Bob at s E.g., winning probability, difference of territory, etc.

Game tree Tree indicating how the play proceeds The depth of the tree is bounded by 361 (a bit cheating) Each node has at most 361 children nodes Indicate the function F (or f) Game= path from root to a leaf node at odd depth : Bob’s play node at even depth: Willy’s play Each leaf indicates Bob win (F=b) or Willy win (F=w) If an odd depth node u has a child node u of F(v)= b, then F(u)= b, else F(u)= w If an even depth node u has a child node v of F(v)=w, then F(u)=w, else F(u)= b

A part of the game tree of Tic Tac Toe

Method outline of Go softwares Game-tree search Exhaustive search Depth-bounded exhaustive search with position evaluation function Find the play to lead to the “best leaf “ in a subtree Strategy of Deep Blue Chess machine, which won against Kasparov in 1997. How to give the position evaluation Linear combination of several basic functions in Chess In Bonanza system which won against professional Japanese Chess, machine learning was introduced to give the position evaluation. Monte Carlo rollout Try real play many times, and find out the probability of win Looks silly, but gave the first human level Go machine If player in the rollout is stronger, then the probability becomes more accurate Monte Carlo Tree Search (MCTS) Give/improve evaluation of position by using Monte Carlo rollou If players become strong, then it further improve the system by using stronger players in Monte Carlo rollout Idea of Alpha Go: Use deep neural network to represent evaluation functions

Evaluation functions Chess: Evaluation of pieces and position Pone= 1, Knight = 3, Bishop=3, Rook=5, Queen=9, King=∞ Position: How many possible moves?  Connection of pieces? Go: Difficult to find a good function Objective: Territory (which part is your territory?) Stones : Alive or Dead

Function represented by a network Linear function: y = Ax (y: m-dim vector, x: n-dim vector) Find the best weights: linear equation??? Regression analysis Multi-layer network  matrix multiplication

Neural network Represent nonlinear functions (nonlinear) node functions Threshold function (step function) If the incoming value exceeds a threshold, it becomes 1, otherwise 0 Sigmoid function Node functions are fixed Edge weights are “learned by data” Back propagation algorithms

Convolutional Neural Network (CNN) Good for image recognition Input: A vector forming in the shape of rectangular array K-th layer node: A template set of smaller arrays Feature vectors of textures, shape, direction, etc Compute the matching function of input and templates

Go game as image processing What are good moves? Easy cases for human/machine Capturing opponent’s stones Local and featured patterns Ladder (Local but global) Guarding territory “Jyoseki” Frequent patterns developed by human (2500 years history)

Global pattern recognition Opening patterns What is the value of move 7 A decent move Probably not the best move No one knows! Good evaluation function is hard to design But, good player feels This move is beautiful, thus should be nice! Global features and local features Similar to image processing Thus, CNN looks nice to use!

Humans evaluation (what I do) Imagine expected variations Compare with games you know or think “even” Do relative evaluation

Small difference matters a lot Fighting ! Death or alive Different from image processing In image processing, small bit flips do not matter But in Go, small difference matters a lot. Ladder is a typical example. Thus, MCTS tree search is necessary Human players also do it!

Basic design You can use CNN for the learning mechanism to give pattern matching of go position p(a,s) : the move probability of a at the position s v(s): the “value” of position s The parameter θ of CNN can be modified. f(s)= f θ(s) = (p(a,s), v(s)) depends on θ How to design the system?? (question to students) Supervised learning : AlphaGo Utilize 160000 games in the database of strong human players Unsupervised learning : AlphaGo zero Reinforced learning -> to make a weak learning system stronger

Short of data and two solutions Obstacle: short of data Human play data is too small compared with the number of possible positions Data of good players is too biased (only nearly tie positions are given) But, random data is not suitable, considering huge number of possibility So, we are short of data for the complete supervised learning So, we should make “good data” by using computer simulation Solution 1. Semi-supervised learning with “Simulation data with data assimilation” Solution 2. Unsupervised learning

AlphaGo: Semi-supervised learning with simulation Method: A kind of “data assimilation”. Make a system to find the next move (with probability), named policy network Learn 30,000,000 positions (170,000 games) of human professional (or near- professional) Go players to have an initial policy network Make “old version crones” of policy network, and play with them Divided into 10,000 stages In each stage, the player (which we want to educate) plays 128 games with randomly chosen players among its old versions. Seeing the outcome at each position, tune the parameter of CNN Make a value-network to show winning probability of each positions. Monte-Carlo tree search by using strong policy networks to find winning probability of many positions. Use CNN to lean from this (huge) data Final play: combine policy network, value network, and MCTS

Solution 2. AlphaGo-Zero : Unsupervised leaning CNN combining policy network and value network Start from many CNN with random parameters. Leaning is divided into periods Choose an “evaluator” for each “period” The evaluator make 25000 self play in parallel Use 1600 MCTS simulations at each move (only spending 0.4 second) Other player learns from the data of evaluator’s play Each player challenges to the evaluator (250 games each) The most winning one replaces the evaluator How it improves