Download presentation
Presentation is loading. Please wait.
Published byShauna Charles Modified over 8 years ago
1
Understanding AlphaGo
3
Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19 grid board playing pieces “stones“ Turn = place a stone or pass The game ends when both players pass
4
Go Overview Only two basic rules 1.Capture rule: stones that have no liberties ->captured and removed from board 2.ko rule: a player is not allowed to make a move that returns the game to the previous position X
5
Go Overview Final position Who won? White Score: 12Black Score: 13
6
Go In a Reinforcement Set-Up Environment states Actions Transition between states Reinforcement function S = A = r(s)= 0 if s is not a terminal state 1 o.w Goal : find policy that maximize the expected total payoff
7
Why it is hard for computers to play GO? Possible configuration of board extremely high ~10^700 Impossible to use brute force exhaustive search Chess (b≈35, d≈80) Go (b≈250, d≈150) main challenges Branching factor Value function
8
https://googleblog.blogspot.co.il/2016/01/alphago-machine-learning- game-go.html https://googleblog.blogspot.co.il/2016/01/alphago-machine-learning- game-go.html
9
Training the Deep Neural Networks Human experts (state,action) (state, win/loss) Monte Carlo Tree Search
10
Training the Deep Neural Networks Policy Value
11
~30 million (state, action) Goal:maximize the log likelihood of an action Input : 48 feature planes Output: action probability map 19X19X48 12 convolutional + rectifier layers Softmax Probability map
12
Bigger -> better and slower Accuracy AlphaGo (all input features)57.0% AlphaGo (only raw board position)55.7% state of the art44.4%
16
12 convolutional + rectifier layers Softmax Probability map ForwardingAccuracy 3 milliseconds55.4% 2 microseconds24.2%
18
SGA 19X19X48 12 convolutional + rectifier layers Softmax Probability map Preventing overfitting RL policy Won more then 80% of the games against SL policy
19
Training the Deep Neural Networks Monte Carlo Tree Search Human experts (state,action) (state, win/loss)
20
Training the Deep Neural Networks ~30m position states Monte Carlo Tree Search ~30m Human experts positions
21
Position evaluation Approximating optimal value function Input : state, output: probability to win Goal: minimize MSE Overfitting - position within games are strongly correlated 19X19X48 convolutional + rectifier layers fc scalar
23
Training the Deep Neural Networks ~30m Human expert (state,action) (state,won/loss) Monte Carlo Tree Search
24
Monte Carlo Experiments : repeated random sampling to obtain numerical results Search method Method for making optimal decisions in artificial intelligence (AI) problems The strongest Go AIs (Fuego, Pachi, Zen, and Crazy Stone) all rely on MCTS
25
Monte Carlo Tree Search Each round of Monte Carlo tree search consists of four steps 1.Selection 2.Expansion 3.Simulation 4.Backpropagation
26
MCTS – Upper Confidence Bounds for Trees Exploration Exploitation Tradeoff Kocsis, L. & Szepesvári, C. Bandit based Monte- Carlo planning (2006) Convergence to the optimal solution ExplorationExploitation W i #wins after visiting the node i n i #times node i has been visited C exploration parameter t #times node i parent has been visited
27
AlphaGo MCTS Selection Expansion Evaluation Backpropagation Each edge (s,a) stores: Q(s,a) - action value (avrerage value of sub tree) N(s,a) – visit count P(s,a) – prior probability Why not using the RL policy??
28
AlphaGo MCTS Selection Expansion Evaluation Backpropagation
29
AlphaGo MCTS Selection Expansion Evaluation Backpropagation Leaf evaluation: 1.Value network 2.Random rollout played until terminal
30
AlphaGo MCTS Selection Expansion Evaluation Backpropagation How to choose the next move? Maximum visit count Less sensitive to outliers than maximum action value
31
AlphaGo
32
AlphaGo VS Experts 5:04:1
33
Take Home Modular system Reinforcement and Deep learning Generic VS
35
Critical difference between alphaGo & Deep blue Used general purpose algorithms Not a set of handcraft rules Modular system combining planning and pattern recognition Think like human LEE sedol – words best player. South korea. 5 game macth. 5.3 Fan hui wind 5/5. Europe champion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.