PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT 18.337 5/13/2009.

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

More on threads, shared memory, synchronization

Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Overview of the New Blue Gene/L Computer Dr. Richard D. Loft Deputy Director of R&D Scientific Computing Division National Center for Atmospheric Research.

Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

Interconnection network network interface and a case study.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Operating Systems A Biswas, Dept. of Information Technology.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Computer Engg, IIT(BHU)

NFV Compute Acceleration APIs and Evaluation

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

What is Fibre Channel? What is Fibre Channel? Introduction

Appro Xtreme-X Supercomputers

Basic CUDA Programming

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

Clusters of Computational Accelerators

CS 179: GPU Programming Lecture 7.

BlueGene/L Supercomputer

Parallel Computation Patterns (Scan)

SiCortex Update IDC HPC User Forum

Parallel Computation Patterns (Reduction)

Hybrid Programming with OpenMP and MPI

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Parallel Programming in C with MPI and OpenMP

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Emulating Massively Parallel (PetaFLOPS) Machines

Cluster Computers.

Presentation transcript:

PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009

Outline Motivation Motivation Model Model GPU Implementation GPU Implementation Blue Gene Implementation Blue Gene Implementation Hardware Hardware Results Results Future Work Future Work

Motivation Why does cooperation evolve? Why does cooperation evolve? Examples: Examples: Total War vs. Limited War Total War vs. Limited War Quorum Sensing Bacteria Quorum Sensing Bacteria Pathogens Pathogens Goal of the project: Goal of the project: Create computational model to test role of behavioral strategies and related variables Create computational model to test role of behavioral strategies and related variables

Model Focus on finding evolutionarily stable strategies Focus on finding evolutionarily stable strategies Five strategies: Five strategies: Mouse Mouse Hawk Hawk Bully Bully Retaliator Retaliator Prober-Retaliator Prober-Retaliator Payoffs Payoffs Win +60 Win +60 Seriously Injured -100 Seriously Injured -100 Small Injuries Each -2 Small Injuries Each -2 Emerge from Short Game uninjured +20 Emerge from Short Game uninjured +20

Why parallelize it? Reduce computational time Reduce computational time Enable trials of more strategies Enable trials of more strategies Enable analysis of different variables roles Enable analysis of different variables roles Introduce more actions to the action space Introduce more actions to the action space

CUDA Implementation Embarrassingly parallel code Embarrassingly parallel code Distribute rounds of the game to different threads Distribute rounds of the game to different threads Only payoff array in global memory Only payoff array in global memory Copy it back for post processing Copy it back for post processing

Sample Code __global__ void gameGPU(int player1, int player2, float* d_payoff1, float* d_payoff2,float* rand_si, int max_rounds){ //Thread index __global__ void gameGPU(int player1, int player2, float* d_payoff1, float* d_payoff2,float* rand_si, int max_rounds){ //Thread index __global__ void gameGPU(int player1, int player2, float* d_payoff1, float* d_payoff2,float* rand_si, int max_rounds){ //Thread index //Thread index const int tid=blockDim.x * blockIdx.x + threadIdx.x; const int tid=blockDim.x * blockIdx.x + threadIdx.x; //Total number of threads in grid //Total number of threads in grid const int THREAD_N = blockDim.x * gridDim.x; const int THREAD_N = blockDim.x * gridDim.x; int max_moves=500; int max_moves=500; for (int round = tid; round < max_rounds; round += THREAD_N) for (int round = tid; round < max_rounds; round += THREAD_N) { play_round(player1, player2, d_payoff1[round], d_payoff2[round], rand_si[round],max_moves); play_round(player1, player2, d_payoff1[round], d_payoff2[round], rand_si[round],max_moves); }}

Blue Gene Implementation

System Overview

Design Fundamentals Low Power PPC440 Processing Core Low Power PPC440 Processing Core System-on-a-chip ASIC Technology System-on-a-chip ASIC Technology Dense Packaging Dense Packaging Ducted, Air Cooled, 25 kW Racks Ducted, Air Cooled, 25 kW Racks Standard proven components for reliability and cost Standard proven components for reliability and cost

BG/P 2.8/5.6 GF/s 4 MB 2 processors 2 chips, 1x2x1 5.6/11.2 GF/s 1.0 GB (32 chips 4x4x2) 16 compute, 0-2 IO cards 90/180 GF/s 16 GB 32 node cards 2.8/5.6 TF/s 512 GB 180/360 TF/s 32 TB (For the original 64 rack system) Rack System Node card Compute card Chip Blue Gene/L

13.6 GF/s 8 MB EDRAM 4 processors 1 chip, 20 DRAMs 13.6 GF/s 2.0 (or 4.0) GB DDR 32 Node Cards 14 TF/s 2 TB System 1 PF/s 144 TB Cabled 8x8x16 Rack Compute Card Chip 435 GF/s 64 GB (32 chips 4x4x2) 32 compute, 0-1 IO cards Node Card Blue Gene/P Key Differences:  4 cores per chip  Speed bump  72 racks (+8)

BG System Overview: Integrated system Lightweight kernel on compute nodes Linux on I/O nodes handling syscalls Optimized MPI library for high speed messaging Control system on Service Node with private control network Compilers and job launch on Front End Nodes

Blue Gene/L interconnection networks 3 Dimensional Torus Interconnects all compute nodes (65,536) Interconnects all compute nodes (65,536) Virtual cut-through hardware routing Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth Global Collective Network One-to-all broadcast functionality One-to-all broadcast functionality Reduction operations functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs ~23TB/s total binary tree bandwidth (64k machine) ~23TB/s total binary tree bandwidth (64k machine) Interconnects all compute and I/O nodes (1024) Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt Round trip latency 1.3 µs Round trip latency 1.3 µs Control Network Boot, monitoring and diagnostics Boot, monitoring and diagnosticsEthernet Incorporated into every node ASIC Incorporated into every node ASIC Active in the I/O nodes (1:64) Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.) All external comm. (file I/O, control, user interaction, etc.)

C/MPI Implementation of Code Static Partitioning of work units Static Partitioning of work units work_unit = number_rounds/partition_size work_unit = number_rounds/partition_size Each node will get a chunk of the data Each node will get a chunk of the data Loops that in serial iterate over the length of the game will now be split up to handle specific rounds Loops that in serial iterate over the length of the game will now be split up to handle specific rounds ‘Bookkeeping Node’ ‘Bookkeeping Node’ MPI Collectives to coalesce data MPI Collectives to coalesce data

Pseudo Code Foreach species: Foreach species: gamePlay(var1…); gamePlay(var1…); MPI_Reduce(var1…); MPI_Reduce(var1…); If (rank==0) Calculate_averages(); If (rank==0) Calculate_averages(); If (rank==0) Print_game_results; If (rank==0) Print_game_results;

Results

Game Dynamics Evolutionarily Stable Strategies: Retaliator ~Prober-Retaliator Result: ‘Limited War’ is a stable and dominant strategy given individual selection

CUDA Implementation 97% time reduction

CUDA Implementation

Blue Gene Implementation 99% time reduction

Blue Gene Implementation

Future Directions Investigate more behavioral strategies Investigate more behavioral strategies Increase action space Increase action space CUDA implementation: data management CUDA implementation: data management Blue Gene implementation: Blue Gene implementation: Examine superlinearity Examine superlinearity Test larger problem sizes Test larger problem sizes Optimize single node performance Optimize single node performance