Monte-Carlo methods for Computation and Optimization Spring 2015 Based on “N-Grams and the Last-Good-Reply Policy Applied in General Game Playing” (Mandy.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.
Adversarial Search We have experience in search where we assume that we are the only intelligent being and we have explicit control over the “world”. Lets.
Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
Game Theory An Overview of Game Theory by Kian Mirjalali.
Two-Player Zero-Sum Games
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
CS 484 – Artificial Intelligence
Search in AI.
Games CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Best-First Search: Agendas
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
Artificial Intelligence in Game Design
Planning under Uncertainty
CPSC 322 Introduction to Artificial Intelligence October 25, 2004.
Game Intelligence: The Future Simon M. Lucas Game Intelligence Group School of CS & EE University of Essex.
Progressive Strategies For Monte-Carlo Tree Search Presenter: Ling Zhao University of Alberta November 5, 2007 Authors: G.M.J.B. Chaslot, M.H.M. Winands,
November 10, 2009Introduction to Cognitive Science Lecture 17: Game-Playing Algorithms 1 Decision Trees Many classes of problems can be formalized as search.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
CIS 310: Visual Programming, Spring 2006 Western State College 310: Visual Programming Othello.
Alpha-Beta Search. 2 Two-player games The object of a search is to find a path from the starting position to a goal position In a puzzle-type problem,
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Lecture 5 Note: Some slides and/or pictures are adapted from Lecture slides / Books of Dr Zafar Alvi. Text Book - Aritificial Intelligence Illuminated.
Brian Duddy.  Two players, X and Y, are playing a card game- goal is to find optimal strategy for X  X has red ace (A), black ace (A), and red two (2)
Game Playing Chapter 5. Game playing §Search applied to a problem against an adversary l some actions are not under the control of the problem-solver.
Game Playing.
Upper Confidence Trees for Game AI Chahine Koleejan.
Game Playing Chapter 5. Game playing §Search applied to a problem against an adversary l some actions are not under the control of the problem-solver.
October 3, 2012Introduction to Artificial Intelligence Lecture 9: Two-Player Games 1 Iterative Deepening A* Algorithm A* has memory demands that increase.
Instructor: Vincent Conitzer
Game Playing. Towards Intelligence? Many researchers attacked “intelligent behavior” by looking to strategy games involving deep thought. Many researchers.
Minimax with Alpha Beta Pruning The minimax algorithm is a way of finding an optimal move in a two player game. Alpha-beta pruning is a way of finding.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
Evaluation-Function Based Monte-Carlo LOA Mark H.M. Winands and Yngvi Björnsson.
Game Playing. Introduction One of the earliest areas in artificial intelligence is game playing. Two-person zero-sum game. Games for which the state space.
GAME PLAYING 1. There were two reasons that games appeared to be a good domain in which to explore machine intelligence: 1.They provide a structured task.
A RTIFICIAL I NTELLIGENCE Games. G AMES AND C OMPUTERS Games offer concrete or abstract competitions “I’m better than you!” Some games are amenable to.
Part 3 Linear Programming
Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.
Game tree search Thanks to Andrew Moore and Faheim Bacchus for slides!
Game tree search Chapter 6 (6.1 to 6.3 and 6.6) cover games. 6.6 covers state of the art game players in particular. 6.5 covers games that involve uncertainty.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Graph Search II GAM 376 Robin Burke. Outline Homework #3 Graph search review DFS, BFS A* search Iterative beam search IA* search Search in turn-based.
The Standard Genetic Algorithm Start with a “population” of “individuals” Rank these individuals according to their “fitness” Select pairs of individuals.
Adversarial Search 2 (Game Playing)
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Artificial Intelligence in Game Design Board Games and the MinMax Algorithm.
Understanding AI of 2 Player Games. Motivation Not much experience in AI (first AI project) and no specific interests/passion that I wanted to explore.
Chapter 5 Adversarial Search. 5.1 Games Why Study Game Playing? Games allow us to experiment with easier versions of real-world situations Hostile agents.
1 Chapter 6 Game Playing. 2 Chapter 6 Contents l Game Trees l Assumptions l Static evaluation functions l Searching game trees l Minimax l Bounded lookahead.
Adversarial Search and Game-Playing
Iterative Deepening A*
Mastering the game of Go with deep neural network and tree search
AlphaGo with Deep RL Alpha GO.
Alpha-Beta Search.
NIM - a two person game n objects are in one pile
Alpha-Beta Search.
Introduction to Artificial Intelligence Lecture 9: Two-Player Games I
Alpha-Beta Search.
Alpha-Beta Search.
Alpha-Beta Search.
CS51A David Kauchak Spring 2019
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Monte-Carlo methods for Computation and Optimization Spring 2015 Based on “N-Grams and the Last-Good-Reply Policy Applied in General Game Playing” (Mandy J. W. Tak, Mark H. M. Winands, Yngvi Björnsson), IEEE Transactions on Computational Intelligence and AI in games, Vol. 4 No. 2, June 2012 N-Grams and the Last-Good- Reply Policy in MCTS Presentation by Ayal Shwartz

Games  A game is structured playing, usually undertaken for enjoyment, and sometimes used as an educational tool  Key components of games are goals, rules, challenge, and interaction Source: Wikipedia  In a game, we have players, each with their own reward function.  Usually the only reward that interests us is the one at the end of the game – when no one can play anymore, and thus, the rewards are set.  Each player wants to maximize their own reward, defined by the game.

Turn-Based Zero-Sum Games  Many games are turn-based, meaning every player has his turn to play, and cannot play during someone else’s turn.  Many games of interest are zero-sum games, meaning the sum of the reward is 0 (or some constant).  This means that if one player has positive reward, someone else must have negative reward.  In two-player games, this means that if your opponent wins, you lose!

Worst-Case scenario  We assume that our opponents will always choose the move that guarantees them maximal reward in the worst- case scenario.  And we should do the same.  In other words, we should choose the moves for which the minimal reward we can guarantee will be maximal

Minimax trees  In our turn, we want to choose the move that gives us the highest worst-case reward  In his turn, our opponent wants to choose the move that gives him the highest worst- case reward  Since this is a zero-sum game, maximal reward for us is minimal reward for the opponent, and minimal reward for us is maximal reward for the opponent < 2

>

 In almost any game of interest, there are too many possible states. We cannot consider all possible ways to continue the game from a given state, unless there aren’t too many ways to end the game from the given state.  Not enough memory.  Not enough time to preform necessary computations.  We are limited in how deep we can develop the tree.

 In General Game Playing (GPP), the objective is to be able to play any game  Given a description of the game (in Game Description Language), we would like to be able to play the game as soon as we have the game’s description (with some time for pre-processing allowed). We won’t have time to come up with a heuristic reward function.

Monte-Carlo Tree Search  Rather than look at all the options, we can probabilistically sample moves, and simulate a game from the chosen moves until it ends.  Based on the results of the simulations, we decide which moves are best.  This (potentially) saves us some time and memory.  Hopefully, we will get a good estimation of the (true) victory probability  Note that the victory probability is also based on how we select the next move we will play.  This also allows us to interrupt the tree search any time and come up with an action (thought we cannot make the same guarantees made in minimax trees).

Monte-Carlo Tree Search (2)  If we bias ourselves towards good moves in the simulation, we might get a different probability distribution over the states in which the game ends.  We are more likely to win if we bias our choice of moves to ones that give us a better victory probability  (assuming we also select the next move to play appropriately)

Selection and Expansion  Selection:  Starting at the root (representing current state), traverse the tree until you reach a leaf.  Expansion:  If the leaf doesn’t represent a state in which the game ends, select the leaf, and create one (or more) children for the leaf.  Each child represents a state reachable from it’s parent using an action that can be preformed from the parent state. 1/23/4 2/21/2 0/0

Simulation and Backpropagation  Simulation:  From newly created node, simulate a game. That is, sample moves for each player, create a child node representing the resulting state, and repeat from new node until game end is reached.  Backpropagation:  Propagate the result of the simulation (victory/defeat, or reward) back to the node created in the expansion phase. Update parents. 4/6 1/23/4 2/21/2 0/0 1:0 2/3 4/5 5/7 1/1

Upper confidence bound  In the selection phase, we would like to balance two elements: exploration and exploitation.  We should explore new moves.  Shorter paths to victory are preferable, as they contain less uncertainty.  Perhaps we will find a path with a higher reward.  We should exploit promising moves.  We don’t have enough time or memory to explore too many options.  There may not be any better options which we are likely to reach.

Upper confidence bound (2)

Upper confidence bound (3) Trade-off between exploration and exploitation And the chance to choose any other action increases slightly, to promote exploration of other actions. Every time an action is taken from a certain state, the chance of this action being taken again decreases, thus increasing the importance of it’s reward and decreasing the importance of exploring it again

Upper confidence bound (4)

Simulation strategies  The simulation strategy is a key component of MCTS.  A basic simulation strategy would be to select random moves.  This, however, might not be such a good idea: we can generally assume that our opponent will favour moves which give him a good chance at victory.  We would also like to choose good moves when we actually play the game.

Scoring  If we have a “score” for each move, we can use that score when considering which move to select next.  We will consider a simple scoring function: Average reward returned when the move (or series of moves) we’re considering was used.  For convenience, let’s assume our rewards are 1 (victory) or 0 (defeat).  Can easily be generalized to games with different rewards/non-zero-sum games.

Scoring (2)

Move-Average Sampling  The Move-Average Sampling Technique (MAST) uses the average reward from play-outs (simulations) in which a move was used to calculate the move’s score in order to determine if it’s a good move or not. 0:1

N-Grams

N-Grams (2) Reward Lookup (Average reward when series of moves was used)

Last-Good-Reply Policy  During a game, we can consider moves to be “replies” to the previous move (or series of moves).  If a reply to a series of moves was successful, we should try it again when we are able to.  Like in N-Grams, we can look back a few steps and try to consider the context.  If we lost in one of our simulations, all of the “replies” which appeared in the simulation are removed from memory, since they are no longer considered good replies.  If we won in the simulation, all the replies that appeared in the simulation will be added, since they seem to be good replies.

Some fine details  Expansion when a node has unvisited moves: The UCT cannot be applied when we have actions for which we have no score. In such a case, we will use one of the simulation strategies to select an action from the set of unexplored actions that can be taken.  Last-Good-Reply: If we cannot use a reply (if it isn’t a legal move), we use a fall-back strategy – either N-Gram, or MAST in this case. Basic MCTS algorithms might randomly select moves with uniform probabilities.

Some fine details (2)

Some fine details (3)  Concurrent move games: We can look at a turn-based game as a concurrent-move game, with a special action called noop (No operation), which each player must play until a certain condition is met (signalling that it’s his turn). Note that even in concurrent-move games, the N-Grams look back on previous turns, not on multiple moves in the same turn, though they are calculated from the perspective of each player.

Some fine details (4)  And finally, how do we select the move when we actually need to play?  Option 1: Select the move which leads to the highest reward we’ve seen so far.  This is the strategy employed by CadiaPlayer – a player who won multiple General Game Playing competitions, using MCTS with MAST and Gibbs measure.  Option 2: Select the move which has the highest average reward.  Both have their advantages and disadvantages.

Some fine details (5)

Compared Checkers Connect5 against MAST 74.4Score72.7Score N-Gram (49.9) (0.1) LGR, MAST (55.3) (0.1) LGR, N-Gram (43.5) (0.1) (85.7) (0.1)

Connect Checkers It seems that in the case of Connect5, exploration is more important than exploitation, as opposed to Checkers.