Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.

Understanding AlphaGo

Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19 grid board playing pieces “stones“ Turn = place a stone or pass The game ends when both players pass

Go Overview Only two basic rules 1.Capture rule: stones that have no liberties ->captured and removed from board 2.ko rule: a player is not allowed to make a move that returns the game to the previous position X

Go Overview Final position Who won? White Score: 12Black Score: 13

Go In a Reinforcement Set-Up Environment states Actions Transition between states Reinforcement function S = A = r(s)= 0 if s is not a terminal state 1 o.w Goal : find policy that maximize the expected total payoff

Why it is hard for computers to play GO? Possible configuration of board extremely high ~10^700 Impossible to use brute force exhaustive search Chess (b≈35, d≈80) Go (b≈250, d≈150) main challenges Branching factor Value function

https://googleblog.blogspot.co.il/2016/01/alphago-machine-learning- game-go.html https://googleblog.blogspot.co.il/2016/01/alphago-machine-learning- game-go.html

Training the Deep Neural Networks Human experts (state,action) (state, win/loss) Monte Carlo Tree Search

Training the Deep Neural Networks Policy Value

~30 million (state, action) Goal:maximize the log likelihood of an action Input : 48 feature planes Output: action probability map 19X19X48 12 convolutional + rectifier layers Softmax Probability map

Bigger -> better and slower Accuracy AlphaGo (all input features)57.0% AlphaGo (only raw board position)55.7% state of the art44.4%

12 convolutional + rectifier layers Softmax Probability map ForwardingAccuracy 3 milliseconds55.4% 2 microseconds24.2%

SGA 19X19X48 12 convolutional + rectifier layers Softmax Probability map Preventing overfitting RL policy Won more then 80% of the games against SL policy

Training the Deep Neural Networks Monte Carlo Tree Search Human experts (state,action) (state, win/loss)

Training the Deep Neural Networks ~30m position states Monte Carlo Tree Search ~30m Human experts positions

Position evaluation Approximating optimal value function Input : state, output: probability to win Goal: minimize MSE Overfitting - position within games are strongly correlated 19X19X48 convolutional + rectifier layers fc scalar

Training the Deep Neural Networks ~30m Human expert (state,action) (state,won/loss) Monte Carlo Tree Search

Monte Carlo Experiments : repeated random sampling to obtain numerical results Search method Method for making optimal decisions in artificial intelligence (AI) problems The strongest Go AIs (Fuego, Pachi, Zen, and Crazy Stone) all rely on MCTS

Monte Carlo Tree Search Each round of Monte Carlo tree search consists of four steps 1.Selection 2.Expansion 3.Simulation 4.Backpropagation

MCTS – Upper Confidence Bounds for Trees Exploration Exploitation Tradeoff Kocsis, L. & Szepesvári, C. Bandit based Monte- Carlo planning (2006) Convergence to the optimal solution ExplorationExploitation W i #wins after visiting the node i n i #times node i has been visited C exploration parameter t #times node i parent has been visited

AlphaGo MCTS Selection Expansion Evaluation Backpropagation Each edge (s,a) stores: Q(s,a) - action value (avrerage value of sub tree) N(s,a) – visit count P(s,a) – prior probability Why not using the RL policy??

AlphaGo MCTS Selection Expansion Evaluation Backpropagation

AlphaGo MCTS Selection Expansion Evaluation Backpropagation Leaf evaluation: 1.Value network 2.Random rollout played until terminal

AlphaGo MCTS Selection Expansion Evaluation Backpropagation How to choose the next move? Maximum visit count Less sensitive to outliers than maximum action value

AlphaGo

AlphaGo VS Experts 5:04:1

Take Home Modular system Reinforcement and Deep learning Generic VS

Critical difference between alphaGo & Deep blue Used general purpose algorithms Not a set of handcraft rules Modular system combining planning and pattern recognition Think like human LEE sedol – words best player. South korea. 5 game macth. 5.3 Fan hui wind 5/5. Europe champion

Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.

Similar presentations

Presentation on theme: "Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.

Similar presentations

Presentation on theme: "Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19."— Presentation transcript:

Similar presentations

About project

Feedback