Applied Neuro-Dynamic Programming in the Game of Chess James Gideon.

Slides:



Advertisements
Similar presentations
Anthony Cozzie. Quite possibly the nerdiest activity in the world But actually more fun than human chess Zappa o alpha-beta searcher, with a lot of tricks.
Advertisements

Markov Decision Process
Adversarial Search We have experience in search where we assume that we are the only intelligent being and we have explicit control over the “world”. Lets.
Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
Games & Adversarial Search
10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
CS 484 – Artificial Intelligence
Adversarial Search Chapter 6 Section 1 – 4.
Adversarial Search Chapter 5.
Adversarial Search: Game Playing Reading: Chapter next time.
Application of Artificial intelligence to Chess Playing Capstone Design Project 2004 Jason Cook Bitboards  Bitboards are 64 bit unsigned integers, with.
MINIMAX SEARCH AND ALPHA- BETA PRUNING: PLAYER 1 VS. PLAYER 2.
Search Strategies.  Tries – for word searchers, spell checking, spelling corrections  Digital Search Trees – for searching for frequent keys (in text,
Games CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
Problem Solving Using Search Reduce a problem to one of searching a graph. View problem solving as a process of moving through a sequence of problem states.
This time: Outline Game playing The minimax algorithm
Games and adversarial search
Min-Max Trees Based on slides by: Rob Powers Ian Gent Yishay Mansour.
1 search CS 331/531 Dr M M Awais A* Examples:. 2 search CS 331/531 Dr M M Awais 8-Puzzle f(N) = g(N) + h(N)
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
Games Search Neil Heffernan Some of these slides are screen shots from the the slides my professor at CMU (Andrew Moore) used. (Sorry for the low resolution)
Games & Adversarial Search Chapter 6 Section 1 – 4.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Lecture 6: Game Playing Heshaam Faili University of Tehran Two-player games Minmax search algorithm Alpha-Beta pruning Games with chance.
Game Playing.
1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 3. Rote Learning.
Game Playing Chapter 5. Game playing §Search applied to a problem against an adversary l some actions are not under the control of the problem-solver.
Chapter 12 Adversarial Search. (c) 2000, 2001 SNU CSE Biointelligence Lab2 Two-Agent Games (1) Idealized Setting  The actions of the agents are interleaved.
Chapter 6 Adversarial Search. Adversarial Search Problem Initial State Initial State Successor Function Successor Function Terminal Test Terminal Test.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab.
Instructor: Vincent Conitzer
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 2.
Chess and AI Group Members Abhishek Sugandhi Sanjeet Khaitan Gautam Solanki
GAME PLAYING 1. There were two reasons that games appeared to be a good domain in which to explore machine intelligence: 1.They provide a structured task.
Adversarial Search Chapter Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent reply Time limits.
Game tree search Chapter 6 (6.1 to 6.3 and 6.6) cover games. 6.6 covers state of the art game players in particular. 6.5 covers games that involve uncertainty.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Graph Search II GAM 376 Robin Burke. Outline Homework #3 Graph search review DFS, BFS A* search Iterative beam search IA* search Search in turn-based.
Adversarial Search Chapter 6 Section 1 – 4. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent reply Time.
Adversarial Search 2 (Game Playing)
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
February 25, 2016Introduction to Artificial Intelligence Lecture 10: Two-Player Games II 1 The Alpha-Beta Procedure Can we estimate the efficiency benefit.
Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Understanding AI of 2 Player Games. Motivation Not much experience in AI (first AI project) and no specific interests/passion that I wanted to explore.
Adversarial Search Chapter Two-Agent Games (1) Idealized Setting – The actions of the agents are interleaved. Example – Grid-Space World – Two.
Artificial Intelligence AIMA §5: Adversarial Search
Game Playing Why do AI researchers study game playing?
Mastering the game of Go with deep neural network and tree search
AlphaGo with Deep RL Alpha GO.
Adversarial Search Chapter 5.
Games & Adversarial Search
Games & Adversarial Search
Announcements Homework 3 due today (grace period through Friday)
Games & Adversarial Search
Games & Adversarial Search
Kevin Mason Michael Suggs
Instructor: Vincent Conitzer
The Alpha-Beta Procedure
Games & Adversarial Search
CS51A David Kauchak Spring 2019
Games & Adversarial Search
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Applied Neuro-Dynamic Programming in the Game of Chess James Gideon

Dynamic Programming (DP) Family of algorithms applied to problems where decisions are made in stages and a reward or cost is received at each stage that is additive over time Optimal control method Example: Traveling Salesman Problem

Bellman’s Equation Stochastic DP Deterministic DP

Key Aspects of DP Problem must be structured into overlapping sub-problems Storage and retrieval of intermediate results is necessary (tabular method) State space must be manageable Objective is to calculate numerically the state value function, J * (s), and optimize the right hand side of Bellman’s equation so that the optimal decision can be made for any given state

Neuro-Dynamic Programming (NDP) Family of algorithms applied to DP-like problems with either a very large state- space or an unknown environmental model Sub-optimal control method Example: Backgammon (TD-Gammon)

Key Aspects of NDP Rather than calculating the optimal state value function, J * (s), the objective is to calculate the approximate state value function J ~ (s,w) Neural Networks are used to represent J ~ (s,w) Reinforcement learning is used to improve the decision making policy Can be an on-line or off-line learning approach The Q-Factors of the state-action value function, Q * (s,a), could be calculated or approximated ( Q * (s,a,w) ) instead of J ~ (s,w)

The Game of Chess Played on 8x8 board with 6 types of pieces per side (8 pawns, 2 knights, 2 bishops, 2 rooks, 1 queen and 1 king) each with its own rules of movement The two sides (black and white) alternate turns Goal is to capture the opposing side’s king Initial Position

The Game of Chess Very complex with approximately states and possible games Has clearly defined rules and is easy to simulate making it an ideal problem for exploring and testing the ideas in NDP Despite recent successes in computer chess there is still much room for improvement, particularly in learning methodologies

The Problem Given any legal initial position choose the move leading to the largest long term reward

Bellman’s Equation

A Theoretical Solution Solved with a direct implementation of the DP algorithm (a simple recursive implementation of Bellman’s Equation, e.g. the Minimax algorithm with last stage reward evaluation) Results in an optimal solution, J * (s) Computationally intractable (would take roughly MB of memory and centuries of calculation)

A Practical Solution Solved with a limited look-ahead version of the Minimax algorithm with approximated last stage reward evaluation Results in a sub-optimal solution, J ~ (s,w) Useful because an arbitrary amount of time or look-ahead can be allocated to the computation of the solution

The Minimax Algorithm

Alpha-Beta Minimax By adding lower (alpha) and upper (beta) bounds on the possible range of scores a branch can return, based on scores from previously analyzed branches, complete branches can be removed from the look-ahead without being expanded

Alpha-Beta Minimax with Move Ordering Works best when moves at each node are tried in a reasonably good order Use iterative deepening look-ahead –Rather than analyzing a position at an arbitrary Minimax depth of n, analyze iteratively and incrementally at depth 1, 2, 3, …, n –Then try best move at previous iteration first in next iteration –Counter-intuitive, but very good in practice!

Alpha-Beta Minimax with Move Ordering MVV/LVA – Most Valuable Victim, Least Valuable Attacker –First sort all capture moves based on value of capturing piece and value of captured piece then try in that order Next try Killer Moves –Moves that have caused an alpha or beta cutoff at the current depth in a previous iteration of iterative deepening History Moves (History Heuristic) –Finally try rest of moves based on historical results during the entire course of the iterative deepening Minimax algorithm and try in order based on “Q-Factors” (sort of)

Hash Tables Minimax alone is not a DP algorithm because it does not reuse previously computed results The Minimax algorithm frequently re-expands and recalculates the values of chess positions Zobrist hashing is an efficient method of storing scores of previously analyzed positions in a table for reuse Combined with hash tables, Minimax becomes a DP algorithm!

Minimal Window Alpha-Beta Minimax NegaScout/PVS – Principal Variation Search –Expands decision tree with infinite alpha-beta bounds for the first move at each depth of recursion, subsequent expansions are performed with alpha, alpha+1 bounds –Works best when moves are ordered well in an iterative deepening framework MTD(f) – Memory Enhanced Test Driver –Very sophisticated, can be thought of as a “binary search” into the decision tree space by continuously probing state-space with alpha-beta window equal to 1 and adjusting additional parameters accordingly –DP algorithm by design, requires a hash table –Works best with good first guess f and well ordered moves

Other Minimax Enhancements Quiescence Search –At leaf positions run Minimax search to conclusion while only generating capture moves at each position –Avoids a n-ply look-ahead from terminating in the middle of a capture sequence and misevaluating the leaf position –Results in increased accuracy of the position evaluation, J ~ (s,w)

Other Minimax Enhancements Null-Move Forward Pruning –During certain positions in the decision tree let the current player “pass” the move to the other player, perform Minimax algorithm at a reduced look-ahead, then if score returned is still greater than the upper bound it is assumed that if the current player had actually moved then the resulting Minima score would still be greater than the upper bound, so take the beta cutoff immediately –Results in excellent reduction of nodes expanded in the decision tree

Other Minimax Enhancements Selective Extensions –At “interesting” positions in the decision tree extend the look-ahead by additional stages Futility Pruning –Based on alpha-beta values at leaf nodes it can sometimes be reasonably assumed that if the quiescence look-ahead was run it would still return a result lower than alpha, so take an alpha cutoff immediately

Evaluating a Position The approximate state (position) value function, J ~ (s,w), can be approximated with a “smoother” feature value function J ~ (f(s),w) where f(s) is the function that maps states into feature vectors Process is called feature extraction Could also calculate the approximate state-feature value function J ~ (s,f(s),w)

Evaluating a Position Most chess systems use only approximate DP when implementing the decision making policy, that is the weight vector w of J ~ (-,w) is predefined and constant In a true NDP implementation the weight vector w is adjusted through reinforcements to improve the decision making policy

Evaluating a Position

General Positional Evaluation Architecture White Approximator –Fully connected MLP neural network –Inputs of state and feature vectors specific to white –One output indicating favorability (+/-) of white positional structure Black Approximator –Fully connected MLP neural network –Inputs of state and feature vectors specific to black –One output indicating favorability (+/-) of black positional structure Final output is the difference between both network outputs

Material Balance Evaluation Architecture Two simple linear tabular evaluators, one for white and one for black

Pawn Structure Evaluation Architecture White Approximator –Fully connected MLP neural network –Inputs of state and feature vectors specific to white –One output indicating favorability (+/-) of white positional structure Black Approximator –Fully connected MLP neural network –Inputs of state and feature vectors specific to black –One output indicating favorability (+/-) of black positional structure Final output is the difference between both network outputs

The Learning Algorithm Reinforcement learning method Temporal difference learning –Use difference of two time successive approximations of position value to adjust the weights of neural networks –Value of final position is a value suitably representative of the outcome of the game

The Learning Algorithm TD(λ) –Algorithm that applies the temporal difference error correction to decisions arbitrarily far back in time discounted by a factor of λ at each stage –λ must be in the interval [0,1]

The Learning Algorithm Presentation of training samples is provided by the TDLeaf(λ) algorithm (uses look-ahead evaluation for training targets) Weights for all networks are adjusted according to Backpropagation algorithm Neuron j local fieldNeuron j output

Self Play Training vs. On-Line Play Training In self play simulation the system will play itself to train the position evaluator neural networks –Policy of move selection should randomly select non-greedy actions a small percentage of the time so that there is a non-zero probability of exploring all actions (e.g. the Epsilon-Greedy algorithm) –System can be fully trained before deployment

Self Play Training vs. On-Line Play Training In on-line play the system will play other opponents to train the position evaluator neural networks –Requires no randomization of the decision making policy since opponent will provide sufficient exploration of the state-space –System will be untrained initially at deployment