Understanding AlphaGo
Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19 grid board playing pieces “stones“ Turn = place a stone or pass The game ends when both players pass
Go Overview Only two basic rules 1.Capture rule: stones that have no liberties ->captured and removed from board 2.ko rule: a player is not allowed to make a move that returns the game to the previous position X
Go Overview Final position Who won? White Score: 12Black Score: 13
Go In a Reinforcement Set-Up Environment states Actions Transition between states Reinforcement function S = A = r(s)= 0 if s is not a terminal state 1 o.w Goal : find policy that maximize the expected total payoff
Why it is hard for computers to play GO? Possible configuration of board extremely high ~10^700 Impossible to use brute force exhaustive search Chess (b≈35, d≈80) Go (b≈250, d≈150) main challenges Branching factor Value function
game-go.html game-go.html
Training the Deep Neural Networks Human experts (state,action) (state, win/loss) Monte Carlo Tree Search
Training the Deep Neural Networks Policy Value
~30 million (state, action) Goal:maximize the log likelihood of an action Input : 48 feature planes Output: action probability map 19X19X48 12 convolutional + rectifier layers Softmax Probability map
Bigger -> better and slower Accuracy AlphaGo (all input features)57.0% AlphaGo (only raw board position)55.7% state of the art44.4%
12 convolutional + rectifier layers Softmax Probability map ForwardingAccuracy 3 milliseconds55.4% 2 microseconds24.2%
SGA 19X19X48 12 convolutional + rectifier layers Softmax Probability map Preventing overfitting RL policy Won more then 80% of the games against SL policy
Training the Deep Neural Networks Monte Carlo Tree Search Human experts (state,action) (state, win/loss)
Training the Deep Neural Networks ~30m position states Monte Carlo Tree Search ~30m Human experts positions
Position evaluation Approximating optimal value function Input : state, output: probability to win Goal: minimize MSE Overfitting - position within games are strongly correlated 19X19X48 convolutional + rectifier layers fc scalar
Training the Deep Neural Networks ~30m Human expert (state,action) (state,won/loss) Monte Carlo Tree Search
Monte Carlo Experiments : repeated random sampling to obtain numerical results Search method Method for making optimal decisions in artificial intelligence (AI) problems The strongest Go AIs (Fuego, Pachi, Zen, and Crazy Stone) all rely on MCTS
Monte Carlo Tree Search Each round of Monte Carlo tree search consists of four steps 1.Selection 2.Expansion 3.Simulation 4.Backpropagation
MCTS – Upper Confidence Bounds for Trees Exploration Exploitation Tradeoff Kocsis, L. & Szepesvári, C. Bandit based Monte- Carlo planning (2006) Convergence to the optimal solution ExplorationExploitation W i #wins after visiting the node i n i #times node i has been visited C exploration parameter t #times node i parent has been visited
AlphaGo MCTS Selection Expansion Evaluation Backpropagation Each edge (s,a) stores: Q(s,a) - action value (avrerage value of sub tree) N(s,a) – visit count P(s,a) – prior probability Why not using the RL policy??
AlphaGo MCTS Selection Expansion Evaluation Backpropagation
AlphaGo MCTS Selection Expansion Evaluation Backpropagation Leaf evaluation: 1.Value network 2.Random rollout played until terminal
AlphaGo MCTS Selection Expansion Evaluation Backpropagation How to choose the next move? Maximum visit count Less sensitive to outliers than maximum action value
AlphaGo
AlphaGo VS Experts 5:04:1
Take Home Modular system Reinforcement and Deep learning Generic VS
Critical difference between alphaGo & Deep blue Used general purpose algorithms Not a set of handcraft rules Modular system combining planning and pattern recognition Think like human LEE sedol – words best player. South korea. 5 game macth. 5.3 Fan hui wind 5/5. Europe champion