Peter Barnes & David Jefferson LLNL/CASC

Slides:



Advertisements
Similar presentations
Parallel Discrete Event Simulation Richard Fujimoto Communications of the ACM, Oct
Advertisements

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Misbah Mubarak, Christopher D. Carothers
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Parallel and Distributed Simulation Global Virtual Time - Part 2.
Parallel and Distributed Simulation Time Warp: Basic Algorithm.
Optimistic Parallel Discrete Event Simulation Based on Multi-core Platform and its Performance Analysis Nianle Su, Hongtao Hou, Feng Yang, Qun Li and Weiping.
Parallel Research at Illinois Parallel Everywhere
Warp Speed: Executing Time Warp on 1,966,080 Cores Chris Carothers Justin LaPre RPI {chrisc, Peter Barnes David Jefferson LLNL {barnes26,
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Seven O’Clock: A New Distributed GVT Algorithm using Network Atomic Operations David Bauer, Garrett Yaun Christopher Carothers Computer Science Murat Yuksel.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
FLANN Fast Library for Approximate Nearest Neighbors
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Time Warp OS1 Time Warp Operating System Presenter: Munehiro Fukuda.
ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers Christopher D. Carothers Department of Computer Science Rensselaer Polytechnic.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Parallel Computing With High Performance Computing Clusters (HPCs) By Jeremy Cathey.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Outline Why this subject? What is High Performance Computing?
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Clock Synchronization (Time Management) Deadlock Avoidance Using Null Messages.
Petascale Computing Resource Allocations PRAC – NSF Ed Walker, NSF CISE/ACI March 3,
A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
PDES Introduction The Time Warp Mechanism
Diskpool and cloud storage benchmarks used in IT-DSS
Parallel and Distributed Simulation
PDES: Time Warp Mechanism Computing Global Virtual Time
CMSC 611: Advanced Computer Architecture
CPSC 531: System Modeling and Simulation
湖南大学-信息科学与工程学院-计算机与科学系
Department of Computer Science University of California, Santa Barbara
BlueGene/L Supercomputer
Predictive Performance
Parallel and Distributed Simulation
Hybrid Programming with OpenMP and MPI
BigSim: Simulating PetaFLOPS Supercomputers
On the Role of Burst Buffers in Leadership-Class Storage Systems
Parallel Exact Stochastic Simulation in Biochemical Systems
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

Peter Barnes & David Jefferson LLNL/CASC Using Charm++ to Improve Extreme Parallel Discrete-Event Simulation (XPDES) Performance and Capability Chris Carothers, Elsa Gonsiorowski, & Justin LaPre Center for Computational Innovations/RPI Nikhil Jain, Laxmikant Kale & Eric Mikida Charm++ Group/UIUC Peter Barnes & David Jefferson LLNL/CASC

Outline The Big Push… Blue Gene/Q ROSS Implementation PHOLD Scaling Results Overview of LLNL Project PDES Miniapp Results Impacts and Synergies

The Big Push… David Jefferson, Peter Barnes (left) and Richard Linderman (right) contacted Chris to see about doing a repeat of the 2009 ROSS/PHOLD performance study using the “Sequoia” Blue Gene/Q supercomputer AFRL’s purpose was to use the scaling study as a basis for obtaining a Blue Gene/Q system as part of HPCMO systems Goal: (i) to push the scaling limits of massively parallel OPTIMISTIC discrete-event simulation and (ii) determine if the new Blue Gene/Q could continue the scaling performance obtained on BG/L and BG/P. We thought it would be easy and straight forward …

IBM Blue Gene/Q Architecture 1.6 GHz IBM A2 processor 16 cores (4-way threaded) + 17th core for OS to avoid jitter and an 18th to improve yield 204.8 GFLOPS (peak) 16 GB DDR3 per node 42.6 GB/s bandwidth 32 MB L2 cache @ 563 GB/s 55 watts of power 5D Torus @ 2 GB/s per link for all P2P and collective comms 1 Rack = 1024 Nodes, or 16,384 Cores, or Up to 65,536 threads or MPI tasks 1.6 GHz IBM A2 processor 16 cores (4-way threaded) 16 GB DDR3 per node 42.6 GB/s bandwidth 32 MB L2 cache 204.8 GFLOPS (peak) 55 watts of power 5D Torus @ 2 GB/s network 1 Node 1 Rack = 1024 Nodes, or 16,384 Cores, or Up to 65,536 threads or MPI tasks

LLNL’s “Sequoia” Blue Gene/Q Sequoia: 96 racks of IBM Blue Gene/Q 1,572,864 A2 cores @ 1.6 GHz 1.6 petabytes of RAM 16.32 petaflops for LINPACK/Top500 20.1 petaflops peak 5-D Torus: 16x16x16x12x2 Bisection bandwidth  ~49 TB/sec Used exclusively by DOE/NNSA Power  ~7.9 Mwatts “Super Sequoia” @ 120 racks 24 racks from “Vulcan” added to the existing 96 racks Increased to 1,966,080 A2 cores 5-D Torus: 20x16x16x12x2 Bisection bandwidth did not increase

ROSS: Local Control Implementation ROSS written in ANSI C & executes on BGs, Cray XT3/4/5, SGI and Linux clusters GIT-HUB URL: ross.cs.rpi.edu Reverse computation used to implement event “undo”. RNG is 2^121 CLCG MPI_Isend/MPI_Irecv used to send/recv off core events. Event & Network memory is managed directly. Pool is allocated @ startup AVL tree used to match anti-msgs w/ events across processors Event list keep sorted using a Splay Tree (logN). LP-2-Core mapping tables are computed and not stored to avoid the need for large global LP maps. Local Control Mechanism: error detection and rollback V i r t u a l T m e (1) undo state D’s (2) cancel “sent” events LP 1 LP 2 LP 3

ROSS: Global Control Implementation GVT (kicks off when memory is low): Each core counts #sent, #recv Recv all pending MPI msgs. MPI_Allreduce Sum on (#sent - #recv) If #sent - #recv != 0 goto 2 Compute local core’s lower bound time-stamp (LVT). GVT = MPI_Allreduce Min on LVTs Algorithms needs efficient MPI collective LC/GC can be very sensitive to OS jitter (17th core should avoid this) Global Control Mechanism: compute Global Virtual Time (GVT) V i r t u a l T m e collect versions of state / events & perform I/O operations that are < GVT GVT LP 1 LP 2 LP 3 So, how does this translate into Time Warp performance on BG/Q

PHOLD Configuration PHOLD Synthetic “pathelogical” benchmark workload model 40 LPs for each MPI tasks, ~251 million LPs total Originally designed for 96 racks running 6,291,456 MPI tasks At 120 racks and 7.8M MPI ranks, yields 32 LPs per MPI task. Each LP has 16 initial events Remote LP events occur 10% of the time and scheduled for random LP Time stamps are exponentially distributed with a mean of 0.9 + fixed time of 0.10 (i.e., lookahead is 0.10). ROSS parameters GVT_Interval (512)  number of times thru “scheduler” loop before computing GVT. Batch(8)  number of local events to process before “check” network for new events. Batch X GVT_Interval events processed per GVT epoch KPs (16 per MPI task)  kernel processes that hold the aggregated processed event lists for LPs to lower search overheads for fossil collection of “old” events. RNGs: each LP has own seed set that are ~2^70 calls apart

PHOLD Implementation void phold_event_handler(phold_state * s, tw_bf * bf, phold_message * m, tw_lp * lp) { tw_lpid dest; if(tw_rand_unif(lp->rng) <= percent_remote) { bf->c1 = 1; dest = tw_rand_integer(lp->rng, 0, ttl_lps - 1); } else { bf->c1 = 0; dest = lp->gid; } if(dest < 0 || dest >= (g_tw_nlp * tw_nnodes())) tw_error(TW_LOC, "bad dest"); tw_event_send( tw_event_new(dest, tw_rand_exponential(lp->rng, mean) + LA, lp) );

CCI/LLNL Performance Runs CCI Blue Gene/Q runs Used to help tune performance by “simulating” the workload at 96 racks 2 rack runs (128K MPI tasks) configured with 40 LPs per MPI task. Total LPs: 5.2M Sequoia Blue Gene/Q runs Many, many pre-runs and failed attempts Two sets of experiments runs Late Jan./ Early Feb, 2013: 1 to 48 racks Mid March, 2013: 2 to 120 racks Sequoia went down for “CLASSIFIED” service on March ~14th, 2013 All runs where fully deterministic across all core counts

Impact of Multiple MPI Tasks per Core Each line starts at 1 MPI tasks per core and move to 2 MPI tasks per core and finally 4 MPI tasks per core At 2048 nodes, observed a ~260% performance increase from 1 to 4 tasks/core Predicts we should obtain ~384 billion ev/sec at 96 racks

Detailed Sequoia Results: Jan 24 - Feb 5, 2013 75x speedup in scaling from 1 to 48 racks w/ peak event rate of 164 billion!!

Excitement, Warp Speed & Frustration At 786,432 cores and 3.1M MPI tasks, we where extremely encouraged by ROSS’ performance From this, we defined “Warp Speed” to be: Log10(event rate) – 9.0 Due to 5000x increase, plotting historic speeds no longer makes sense on a linear scale. Metric scales 10 billion events per second as a Warp 1.0 However…we where unable to obtain a full machine run!!!! Was it a ROSS bug?? How to debug at O(1M) cores?? Fortunately NOT a problem w/i ROSS! The PAMI low-level message passing system would not allow jobs larger than 48 racks to run. Solution: wait for IBM Efix, but time was short..

Detailed Sequoia Results: March 8 – 11, 2013 With Efix #15 coupled with some magic env settings: 2 rack performance was nearly 10% faster 48 rack performance improved by 10B ev/sec 96 rack performance exceeds prediction by 15B ev/sec 120 racks/1.9M cores  504 billion ev/sec w/ ~93% efficiency

ROSS/PHOLD Strong Scaling Performance 97x speedup for 60x more hardware Why? Believe it is due to much improved cache performance at scale E.g, at 120 racks each node only requires ~65MB, thus most data is fitting within the 32 MB L2 cache

PHOLD Performance History “Jagged” phenomena attributed to different PHOLD config 2005: first time a large supercomputer reports PHOLD performance 2007: Blue Gene/L PHOLD performance 2009: Blue Gene/P PHOLD performance 2011: CrayXT5 PHOLD performance 2013: Blue Gene/Q

LLNL/LDRD: Planetary Scale Simulation Project Summary: Demonstrated highest PHOLD performance to date 504 billion ev/sec on 1,966,080 cores  Warp 2.7 PHOLD has 250x more LPs and yields 40x improvement over previous BG/P performance (2009) Enabler for thinking about billion object simulations LLNL/LDRD 3 year project: “Planetary Scale Simulation” App1: DDoS attack on big networks App2: Pandemic spread of flu virus Opportunities to Improve ROSS capabilities: Shift from MPI to Charm++

Shifting ROSS from MPI to Charm++ Why shift? Potential for 25% to 50% performance improvement over all-MPI code base BG/Q single node performance: ~4M ev/sec MPI vs. ~7M ev/sec using all threads Gains: Uses of threads and shared memory internal to a nodes lower latency P2P messages via direct access to PAMI Asynchronous GVT Scalable, near seamless dynamic load balancing via Charm++ RTS. Initial results: PDES miniapp in Charm++ Quickly gain real knowledge about how best leverage Charm++ for PDES Uses YAWNS windowing conservative protocol Groups of LPs implemented as Chares Charm messages used to transmit events TACC Stampede cluster used in first experiments to 4K cores TRAM used to “aggregate” messages to lower comm overheads

PDES Miniapp: LP Density

PDES Miniapp: Event Density

Impact on Research Activities With ROSS DOE CODES Project Continues New focus on design trade-offs for Virtual Data Facilities PI: Rob Ross @ ANL LLNL: Massively Parallel KMC PI: Tomas Oppelstrup @ LLNL IBM/DOE Design Forward Co-Design of Exascale networks ROSS as core simulation engine for Venus models PI: Phil Heidelberger @ IBM Use of Charm++ can improve all these activities Virtual Data Facility is a COORDINATED, MULTI-SITE FACILITY whose purpose is to address the SHARED DATA INFRASTRUCTURE NEEDS of the Office of Science. ESNet == Energy Sciences Network!!

Thank You!