Parallel Monte-Carlo Tree Search with Simulation Servers H IDEKI K ATO †‡ and I KUO T AKEUCHI † † The University of Tokyo ‡ Fixstars Corporation November.

Slides:

Advertisements

Similar presentations

Parallel Processing with OpenMP

Advertisements

Introduction to Openmp & openACC

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Structure of Computer Systems

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Types of Parallel Computers

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

History of Distributed Systems Joseph Cordina

Progressive Strategies For Monte-Carlo Tree Search Presenter: Ling Zhao University of Alberta November 5, 2007 Authors: G.M.J.B. Chaslot, M.H.M. Winands,

Chapter Hardwired vs Microprogrammed Control Multithreading

Monte Carlo Go Has a Way to Go Haruhiro Yoshimoto (*1) Kazuki Yoshizoe (*1) Tomoyuki Kaneko (*1) Akihiro Kishimoto (*2) Kenjiro Taura (*1) (*1)University.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

The hybird approach to programming clusters of multi-core architetures.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.

Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.

1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.

Multi-core architectures. Single-core computer Single-core CPU chip.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-Core Architectures

Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Parallelization of 2D Lid-Driven Cavity Flow

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.

Distributed Programming CA107 Topics in Computing Series Martin Crane Karl Podesta.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

Pigeon Problems Revisited Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

The Effects of Parallel Programming on Gaming Anthony Waterman.

Parallel Programming in Chess Simulations Part 2 Tyler Patton.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Processor Level Parallelism 1

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Distributed Processors

Parallel Density-based Hybrid Clustering

Parallel Nested Monte-Carlo Search

Linchuan Chen, Xin Huo and Gagan Agrawal

Yiannis Nikolakopoulos

Fast Communication and User Level Parallelism

Hybrid Programming with OpenMP and MPI

CSC3050 – Computer Architecture

Cluster Computers.

Presentation transcript:

Parallel Monte-Carlo Tree Search with Simulation Servers H IDEKI K ATO †‡ and I KUO T AKEUCHI † † The University of Tokyo ‡ Fixstars Corporation November 7 th, 2008

Contents Computer Go Monte-Carlo Tree Search Parallel Monte-Carlo Tree Search Client-Server Approach Experiments and Discussion Conclusion and Future Work

Computer Go The game of Go –Task par excellence for AI (H. Berliner 1978) –Most challenging; largest search space 19 x , 9 x cf. Chess –Minimax tree search and a static evaluation function with domain knowledge was used so far without major success The Monte-Carlo Go revolution –MoGo beat an 8-dan professional player on 9 x 9 –Crazy Stone beat a 4-dan professional player with 8 stones handicap

Monte-Carlo Tree Search (MCTS) Descend tree from root to leaf Update values of the moves Repeat until time-up Play most visited move in root Add a node Simulate a game

Parallel MCTS (PMCTS) Lock Search tree (shared) Thread 1 Symmetrical multi-thread (SMT) PMCTS –Commonly used straightforward implementation –MCTS threads share a search tree Thread 3 Thread 2 Thread 4

Related Work S. Gelly et al. introduced SMT PMCTS for shared-memory SMP systems (2006) T. Cazenave et al. proposed and evaluated three PMCTS algorithms on a 16 Intel Pentium-4 MPI cluster (2007) G. Chaslot et al. evaluated root, leaf and tree parallelization on 2 x 8 core IBM Power5 (2008) S. Gelly et al. proposed SMT PMCTS for MPI clusters of shared-memory SMP nodes (2008)

Problems Number of processors –Shared tree PMCTS can run only on shared- memory systems; currently up to 16 or 32 processors –PMCTS algorithms for clusters of computers connected through networks is necessary –Longer communication time decreases performance like other parallel applications –Increasing the threads increases the overhead of the locks to share search tree

MoGo ’ s Solution Combine fine and coarse grain PMCTS –For MPI clusters with shared-memory SMP nodes (S. Gelly et al. 2008) –Runs SMT PMCTS on each node –Periodically exchanges and merges values in the tree –Excellent performance MoGoTitan beat an 8-dan Korean Professional Go player with 9 stones handicap (2008) Huygens super computer at SARA in Amsterdam, the Netherlands 25 out of 104 SMP nodes were used Each node consists of 16 dual core Power6 processors at 4.7 GHz

MoGo ’ s Solution (cont ’ d) Disadvantages –Expensive High speed network interfaces such as InfiniBand are very expensive (so are the clusters) –Lack of flexibility MPI does not allow to add or remove computers on the fly MPI requires special setup; must be pre-configured Applicable to non-MPI clusters on moderate speed networks? –Nobody tried yet

Client-Server Approach Recent success of grid computing achieved one petaflop with major benefits by 41,145 Sony Playstation 3 consoles all over the world (2007) –Less expensive massive parallel approach –Applicable to PMCTS? Basic idea –Separate tree search part and simulation part –Broadcast positions to be simulated using UDP/IP –Don ’ t wait the end of slow simulations

Client-Server Approach (cont ’ d) Client-server PMCTS –A client searches tree and send a position; a server simulates a game from the position and sends back the result –Runs on a cluster of loosely-coupled computers –Servers can run on small memory computers even if the tree is going to be huge –No special set-up for servers; just a small application –Longer communication time due to moderate speed networks –Performance? Scales well?

Client-Server PMCTS Descend tree from root to leaf Update values of the moves Repeat until time-up Select most visited move in root Add a node Broadcast the position Receive a result (no wait) Repeat forever Send the result Simulate a game Repeat forever Receive positions Send the result Receive positions Simulate a game Server 1 Server 2 Client Search tree Loop

Experimental System CPU:Q9550/3GHz (400 x 7.5) OS:Ubuntu Linux 8.04 M/B:ASUS P5K-VM (G33) RAM:PC3200 4GiB NIC:Intel EXP9300PT (PCI-Ex x1) CPU:Q9550/3GHz (400 x 7.5) OS:Ubuntu Linux 8.04 M/B:DFI LP JR P45-T2RS (P45) RAM:PC3200 4GiB NIC:Intel EXP9300PT (PCI-Ex x1) RTT:151±22  1 kB CPU:Q6600/3GHz (333 x 9) OS:Ubuntu Linux 8.04 M/B:ASUS P5K-VM (G33) RAM:PC3200 4GiB NIC:Intel EXP9300PT (PCI-Ex x1) RTT:154±20  1 kB CPU:Q6600/3GHz (333 x 9) OS:Ubuntu Linux 8.04 M/B:ASUS P5WDG2-WS Pro (975X) RAM:PC3200 4GiB NIC:Intel EXP9300GT (PCI) RTT:159±22  1 kB PC1 (1 client and 3 servers)PC2 (4 servers) PC3 (4 servers)PC4 (4 servers) Allied Telesis GS908XL Switching delay: byte Switch

Experiments A tree searcher or a simulator exclusively uses a core One core or other on the client computer is used for a tree searcher or a simulator thread, respectively The simulators on the server computers run as individual processes All results are ELO ratings against GNU Go level 0 9 x 913 x 13 Games2, Time per move (s)0.005 to to 6.4 Simulation servers1 to 15

How to evaluate the results? Simulations per second? –Commonly used for shared memory SMP systems but not a good measure for clusters –The benefits of simulations are not the same Use equivalent-strength speed-up –The ratio of time-per-move settings that give the same strength at different number of simulators –“ Equivalent speed-up ” for short Number of simulators or cores –The number of simulators is used to evaluate scalability while the number of all cores is used to evaluate performance

Equivalent Speed-up Time per move (s) 84211/21/41/81/16 ELO rating 4 core (13 x 13) 16 core (13 x 13) core (9 x 9) 16 core (9 x 9)

Performance (4 core vs. 16 core) 9 x 9 13 x 13

Scalability 9 x 9 (0.08 s/move) 13 x 13 (0.4 s/move) Number of simulators ELO rating

Conclusion and Future Work Client-server parallel Monte-Carlo tree search –Runs on a cluster of loosely coupled computers –Small memory computers such as game consoles can be used for simulation servers –Allows servers to connect or disconnect on-the-fly –Reduced communication by broadcasting –No overhead to share search tree –Scales well on 13 x 13 with 15 simulators Future work –Multiple clients for single or multiple users