Download presentation
Presentation is loading. Please wait.
Published byPhyllis Perkins Modified over 9 years ago
1
Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab
2
Outline 1. Introduction 2. Parameter tuning 1. Supervised learning 2. Comparison training 3. Reinforcement learning 3. Conclusion
3
Introduction
4
Game Playing Program Simple implementation of real-world problem - If one player wins, the other must lose. Very large search spaces - We can’t get complete information in limited time.
5
Game tree search 7 987654321 741 Root node: current position Node: position Branch: legal move MINIMAX SEARCH
6
Requirements for a evaluation function accuracy efficiency Static Evaluation Function 452637-21 Static Evaluation Function Evaluated by
7
Features and Weights Feature f (The number of pieces of each side, etc.) Weight w (The weight of a important feature must be big.) Linear Non-linear Neural Network, etc.
8
Parameter tuning
9
Machine learning in games In simple game like Othello and backgammon, parameter tuning by machine learning has been successful. In complex games like Shogi, hand-crafted evaluation function is still better. Machine learning is used only in limited domains. (only value of materials, etc.)
10
Outline 1. Introduction 2. Parameter tuning 1. Supervised learning 2. Comparison training 3. Reinforcement learning 3. Conclusion
11
Training sample: Minimize the error of the evaluation function on these positions. Supervised learning (Position, Score)
12
Supervised learning(1) Backgammon program [Tesauro 1989] Score is given by human experts. Standard back-propagation Far from human expert-level. Out In 1 In 2 In 458 In 459 ・・ ・ Input : Position and move (459 hand-crafted feature of Boolean value) Score of Move w1w1 w2w2 w3w3 w5w5 w4w4
13
Supervised learning Difficulties in supplying training data by experts Consuming much time of experts to create a database. Human experts don’t think in terms of absolute scores.
14
Supervised learning(2) Bayesian learning [Lee et. al. 1988] Training position is labeled win or lose. Estimate the mean feature vector and the covariance matrix for each label from training data. x1x1 x3x3 x4x4 x2x2 μ win μ lose Test sample
15
Supervised learning(3) LOGISTELLO [Buro 1998] Build different classifiers for different stages of the game. (Othello is a game of finite plies.) last stage → middle/first stage. (Scores of last stage is more reliable than middle/first stage.)
16
Outline 1. Introduction 2. Parameter tuning 1. Supervised learning 2. Comparison training 3. Reinforcement learning 3. Conclusion
17
Comparison training Training sample: Evaluation function learns to satisfy the constraint of these training sample. Expert’s move is preferred above all other moves. (Position_1, Position_2, which is preferable)
18
Backgammon program [Tesauro 1989] Consistency Transitivity Standard back-propagation Simpler and stronger than preceding versions of supervised learning. Comparison training Final position (a)Final position (b) W1W1 W2W2 W3W3 W4W4 W 1 =W 2 W 3 =-W 4 Which is preferable
19
Comparison training Problems of Comparison training Is the assumption “human expert’s move is the best“ correct? A program trained on experts’ games will imitate a human playing style, which makes it harder for the program to surprise a human being.
20
Outline 1. Introduction 2. Parameter tuning 1. Supervised learning 2. Comparison training 3. Reinforcement learning 3. Conclusion
21
Reinforcement learning No training information from a domain expert. Program explores the different actions. It will receive feedback from the environment (reward). Win or Lose By which margin the program won/lost. Program (Learner) Environment action reward position
22
TD(λ) Temporal Difference Learning w t : Weight vector at time t. F: Evaluation function (Function of vector W and position x) x t : Position at time t. α: Learning rate. λ: Influence of the current evaluation function value for weight updates of previous moves
23
Temporal Difference Learning F(x t-3 )F(x t-2 )F(x t-1 ) F(x t )F(x t+1 ) TD(λ) λ= 0 λ= 0.5 λ= 1 F(x t+1 ) F(x t ) F(x t-1 ) F(x t-2 ) F(x t-3 )
24
Temporal Difference Learning(1) TD-Gammon [Tesauro 1992-] Neural Network (Input: raw board information) TD(λ) Self-play (300,000games) Human expert level Program action
25
Self-play in other games None of those successors achieved a performance as impressive as TD-Gammon’s. In case of backgammon, dice before each move ensured a sufficient variety. exploration-exploitation dilemma Program action
26
Temporal Difference learning(2) KNIGHT CAP [Baxter et al. 1998] Learned on Internet chess server 1468 features (linearly combined) × 4 stages TDLeaf (λ)
27
Knight Cap’s rating All but the material parameters are initially set to zero. After about 1000 games on the server, its rating had improved to exceed 2150, which is an improvement from an average amateur to a strong expert.
28
Conclusion
29
Machine learning in games Successful in simple game Used in limited domains in complex game such as Shogi Reinforcement learning is successful is stochastic game such as Backgammon.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.