Adversarial Reasoning: Sampling-Based Search with the UCT algorithm Joint work with Raghuram Ramanujan and Ashish Sabharwal.

Slides:



Advertisements
Similar presentations
Adversarial Search Chapter 6 Sections 1 – 4. Outline Optimal decisions α-β pruning Imperfect, real-time decisions.
Advertisements

Adversarial Search Chapter 6 Section 1 – 4. Types of Games.
Adversarial Search We have experience in search where we assume that we are the only intelligent being and we have explicit control over the “world”. Lets.
Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
This lecture topic: Game-Playing & Adversarial Search
Games & Adversarial Search
Monte Carlo Tree Search: Insights and Applications BCS Real AI Event Simon Lucas Game Intelligence Group University of Essex.
© 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
CS 484 – Artificial Intelligence
Adversarial Search Chapter 6 Section 1 – 4.
Adversarial Search Chapter 5.
Lecture 12 Last time: CSPs, backtracking, forward checking Today: Game Playing.
MINIMAX SEARCH AND ALPHA- BETA PRUNING: PLAYER 1 VS. PLAYER 2.
Adversarial Search 對抗搜尋. Outline  Optimal decisions  α-β pruning  Imperfect, real-time decisions.
1 Adversarial Search Chapter 6 Section 1 – 4 The Master vs Machine: A Video.
Games CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
This time: Outline Game playing The minimax algorithm
Lecture 02 – Part C Game Playing: Adversarial Search
Progressive Strategies For Monte-Carlo Tree Search Presenter: Ling Zhao University of Alberta November 5, 2007 Authors: G.M.J.B. Chaslot, M.H.M. Winands,
1 search CS 331/531 Dr M M Awais A* Examples:. 2 search CS 331/531 Dr M M Awais 8-Puzzle f(N) = g(N) + h(N)
How computers play games with you CS161, Spring ‘03 Nathan Sturtevant.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
Adversarial Search: Game Playing Reading: Chess paper.
Games & Adversarial Search Chapter 6 Section 1 – 4.
Game Playing: Adversarial Search Chapter 6. Why study games Fun Clear criteria for success Interesting, hard problems which require minimal “initial structure”
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Monte-Carlo Tree Search
Search and Planning for Inference and Learning in Computer Vision
Lecture 5 Note: Some slides and/or pictures are adapted from Lecture slides / Books of Dr Zafar Alvi. Text Book - Aritificial Intelligence Illuminated.
KU NLP Heuristic Search Heuristic Search and Expert Systems (1) q An interesting approach to implementing heuristics is the use of confidence.
Upper Confidence Trees for Game AI Chahine Koleejan.
Game Playing Chapter 5. Game playing §Search applied to a problem against an adversary l some actions are not under the control of the problem-solver.
Game-playing AIs Part 1 CIS 391 Fall CSE Intro to AI 2 Games: Outline of Unit Part I (this set of slides)  Motivation  Game Trees  Evaluation.
Adversarial Search Chapter 6 Section 1 – 4. Outline Optimal decisions α-β pruning Imperfect, real-time decisions.
Game Playing. Towards Intelligence? Many researchers attacked “intelligent behavior” by looking to strategy games involving deep thought. Many researchers.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
1 Adversarial Search CS 171/271 (Chapter 6) Some text and images in these slides were drawn from Russel & Norvig’s published material.
Monte-Carlo methods for Computation and Optimization Spring 2015 Based on “N-Grams and the Last-Good-Reply Policy Applied in General Game Playing” (Mandy.
CSCI 4310 Lecture 6: Adversarial Tree Search. Book Winston Chapter 6.
GAME PLAYING 1. There were two reasons that games appeared to be a good domain in which to explore machine intelligence: 1.They provide a structured task.
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Adversarial Search Chapter 6 Section 1 – 4. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent reply Time.
Adversarial Search 2 (Game Playing)
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)
Artificial Intelligence in Game Design Board Games and the MinMax Algorithm.
Adversarial Search Chapter 5 Sections 1 – 4. AI & Expert Systems© Dr. Khalid Kaabneh, AAU Outline Optimal decisions α-β pruning Imperfect, real-time decisions.
ADVERSARIAL SEARCH Chapter 6 Section 1 – 4. OUTLINE Optimal decisions α-β pruning Imperfect, real-time decisions.
1 Chapter 6 Game Playing. 2 Chapter 6 Contents l Game Trees l Assumptions l Static evaluation functions l Searching game trees l Minimax l Bounded lookahead.
Adversarial Search and Game-Playing
Stochastic tree search and stochastic games
Adversarial Search and Game Playing (Where making good decisions requires respecting your opponent) R&N: Chap. 6.
AlphaGo with Deep RL Alpha GO.
Adversarial Search Chapter 5.
Games & Adversarial Search
Games & Adversarial Search
Adversarial Search.
Games & Adversarial Search
Games & Adversarial Search
Games & Adversarial Search
Minimax strategies, alpha beta pruning
Games & Adversarial Search
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Adversarial Reasoning: Sampling-Based Search with the UCT algorithm Joint work with Raghuram Ramanujan and Ashish Sabharwal

Upper Confidence bounds for Trees (UCT) The UCT algorithm (Kocsis and Szepesvari, 2006), based on the UCB1 multi-armed bandit algorithm (Auer et al, 2002) has changed the landscape of game-playing programs in recent years.  First program capable of master level play in 9x9 Go (Gelly and Silver, 2007)  UCT-based agent is a two-time winner of the AAAI General Game Playing Contest (Finnsson and Bjornsson, 2008)  Also successful in Kriegspiel (Ciancarini and Favini, 2009) and real-time strategy games (Balla and Fern, 2009)  Key to Google’s AlphaGo success.

Understanding UCT Despite its impressive track record, our current understanding of UCT is mostly anecdotal. Our work focuses on gaining better insights into UCT by studying its behavior in search spaces where comparisons to Minimax search are feasible.

“Find the best slot machine” --- while trying to make as much money as possible

s s' Every node in the search tree is treated like a multi-armed bandit. “Find the best slot machine” --- while trying to make as much money as possible The UCB score is computed for each child and the best scoring child is chosen for expansion The UCT Algorithm Exploitation term Q(s’) is the estimated utility of state s’ Exploration term n(s) is the number of visits to state s Estimated utility = the average payout of arm so far Thm: Non-optimal arms are sampled at most log(n) times! Strategy? What would you want to prove?

R A random playout is performed to estimate the utility R (+1,0, or -1) of this state The UCT Algorithm At the opponent’s nodes, a symmetric lower confidence bound is minimized to pick the best move We descend the tree by starting at the root node and repeatedly selecting the best scoring node Node with unexplored child, a new child is created Incrementally grow game tree by creating “lines of play” from the top. & Update estimates of board values.

R R R R An averaging backup is used to update the value estimates of all nodes on the path from the root to the new node The UCT Algorithm Visit count update State utility update Select each child node based on UCB scheme.

UCT versus Minimax UCT TreeMinimax Tree Asymmetric tree Best-performing method for Go Full-width tree up to some depth bound k Best-performing method for eg Chess

Search Spaces Minimax performs poorly in Go due to the large branching factor and lack of good board eval To understand UCT better, we need domain with another competitive method. Considered domains where Minimax search produces good play  Chess A domain where Minimax yields world-class play but UCT performs poorly  Mancala A domain where both Minimax and UCT yield competent players ‘out-of-the-box’  Synthetic games (ongoing work)

Mancala A move consists of picking up all the stones from a pit and ‘sowing’ (distributing) them in a counter-clockwise fashion Stones placed in the stores are captured and taken out of circulation The game ends when one of the sides has no stones left Player with more captured stones at the end wins Both UCT and Minimax search produce good players. store pit

UCT on Mancala We examine the three key components of UCT I. Exploration vs. Exploitation II. Random playouts III. Info backup: Averaging versus minimax Insights lead to improved UCT player We then deploy the winning UCT agent in a novel partial game setting to understand how it wins

I) Full-Width versus Selective Search Vary the value of the exploration constant c and measure the win- rate of UCT against a standardized Minimax player Optimal setting for c

Lesson: UCT challenges general notion that forward (i.e. unsafe) pruning is a bad idea in game tree search – not necessarily so! Smart exploration/exploitation driven search can work. Previous experience game tree search: “full width” + selective extensions outperforms “trying to be smart in node expansion.”

II) Random Playouts (Samples) During search tree construction, UCT estimates the value of a newly added node using a “random playout” (resulting in a +1, -1, or 0). Appeal: Very general. Don’t need to construct a board/state evaluation function. But, random games look very different from actual games. Any real info in the result of random play? A: Yes, but it’s a very “faint” signal. Other aspects of UCT compensate. Also, how many playouts (“samples”) per node? (surprise)

How many samples per leaf is optimal? Given a fixed budget, should we examine fewer nodes more carefully or more nodes, with each node evaluated less accurately? 4000 nodes in tree, 5 playouts per node 2000 nodes in tree, 10 playouts per node It is better to quickly examine many nodes than to get more random playout samples per node. a single playout per leaf optimal! Lesson: There is a weak but useful signal in random game playout. *** Single *** sample is enough.

III) Utility estimates backup: Averaging versus Minimax UCT uses an averaging strategy to back-up reward values. What about using a minimax backup strategy? For random playout information (i.e., a very noisy reward signal) averaging is best. Confirmed on Mancala and with synthetic game tree model.

Aside: Related to “infamous” game tree pathology: Deeper minimax searches can lead to worse performance! (Nau 1982; Pearl 1983). Phenomenon is related to very noisy heuristics --- such as random playout information. Minimax uses heuristic estimates only from fringe of the explored tree --- ignoring any heuristic information from internal nodes. UCT’s averaging over all nodes (including internal ones) can compensate for pathology and near-pathology effects.

Improving over random playout info: Heuristic evaluation Unlike Go, we do have a good heuristic available for Mancala  Replace playouts with heuristics: UCT H Heuristic much less noisy than random playout. Therefore, consider minimax backup.  We call it UCTMAX H

Allow both algorithms to build the same sized trees and compare their win- rates Lesson: When a state evaluation heuristics is available, use minimax back-up in conjunction with UCT.

Final step: UCTMAX H versus Minimax We study the win-rate of UCTMAX H against Minimax players searching to various depths  Both agents use the same state heuristic We consider a hybrid strategy X X X X X X X XX X X X Region with no search traps, play UCTMAX H versus Minimax Region with search traps, play MM-16 versus MM-16

Mancala: UCTMAX H vs. Minimax (w. alpha-beta) Minimax Look-ahead Depth Win-rate of UCTMAX H 6 Pits, 4 Stones per Pit8 Pits, 6 Stones per Pit Hybrid k = k = k = k = UCTMAX H outperforms MMUCTMAX H hybrid outperforms+ MM Note: exactly same number of nodes expanded. Difference fully due to exploration / exploitation strategy of UCT

UCT recap: First head-to-head comparison UCT vs. Minimax and first competitive domain. Multi-armed bandit inspired game tree exploration is effective. Random playouts (fully domain-independent) contain weak but useful information. Averaging backup of UCT is the right strategy. When domain information is available, use UCTMAX H. Outperforms Minimax given exactly the same information and # node expansions. But, what about Chess?

Chess What is it about the nature of the Chess search space that makes it so difficult for sampling-style algorithms, such as UCT? We will look at the role of (shallow) search traps

Search Traps Level-3 search trap for white State at risk

Search Traps in Chess Random Chess PositionsPositions from GM Games Search traps in Chess are sprinkled throughout the search space.

Identifying Traps Given the ubiquity of traps in Chess, how good is UCT at detecting them?  Run UCT search on states at risk  Examine the values to which the utilities of the children of the root node converge  Compare this to the recommendations of a deep Minimax search (the ‘gold-standard’)

Identifying Traps UCT-bestMinimax-bestTrap move Level-1 trap Level-3 trap Level-5 trap Level-7 trap Utility of move that UCT thinks is the best Utility assigned by UCT to the move deemed best by Minimax Utility assigned by UCT to the trap move. True utility: -1 For very shallow traps, UCT has little trouble distinguishing between good and bad moves With deeper traps, good and bad moves start looking the same to UCT. UCT will start making fatal mistakes.

UCT conclusions: Promising alternative to minimax for adversarial search. Highly domain-independent. Hybrid strategies needed when search traps are present. Currently exploring applications to generic reasoning using SAT and QBF. Also, applications in general optimization and constraint reasoning. The exploration-exploitation strategy has potential in any branching search setting!