Gengbin Zheng and Eric Bohm

Slides:



Advertisements
Similar presentations
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Advertisements

MPI Message Passing Interface
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Cactus in GrADS (HFA) Ian Foster Dave Angulo, Matei Ripeanu, Michael Russell.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Estimating Flight CPU Utilization of Algorithms in a Desktop Environment Using Xprof Dr. Matt Wette 2013 Flight Software Workshop.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Introduction to Charm++ Machine Layer Gengbin Zheng Parallel Programming Lab 4/3/2002.
Charm++ Workshop 2008 BigSim Tutorial Presented by Eric Bohm Charm++ Workshop 2008 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.
Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
1 Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001.
BigSim Tutorial Presented by Eric Bohm LACSI Charm++ Workshop 2005 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
BigNetSim Tutorial Presented by Gengbin Zheng & Eric Bohm Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Introduction to Operating Systems Concepts
cs612/2002sp/projects/ CS612 Term Projects cs612/2002sp/projects/
Outline Installing Gem5 SPEC2006 for Gem5 Configuring Gem5.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Creating Simulations with POSE
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Self Healing and Dynamic Construction Framework:
In-situ Visualization using VisIt
CS399 New Beginnings Jonathan Walpole.
Structural Simulation Toolkit / Gem5 Integration
Introduction to SimpleScalar
Workshop in Nihzny Novgorod State University Activity Report
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
Logistic Regression and Perceptron Prediction of Instruction Branches
Performance Evaluation of Adaptive MPI
Programmable Logic Controllers (PLCs) An Overview.
Chapter 3: Windows7 Part 2.
CMSC 611: Advanced Computer Architecture
Operation System Program 4
Chapter 2: System Structures
Chapter 3: Windows7 Part 2.
A Simulator to Study Virtual Memory Manager Behavior
Chapter 2: Operating-System Structures
MPJ: A Java-based Parallel Computing System
Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
BigSim: Simulating PetaFLOPS Supercomputers
System calls….. C-program->POSIX call
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Support for Adaptivity in ARMCI Using Migratable Objects
Types of Parallel Computers
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

Gengbin Zheng and Eric Bohm BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2007 Parallel Programming Laboratory University of Illinois at Urbana-Champaign 11/26/2018 Charm++ Workshop 2007

Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework Post-mortem simulation Network simulation Performance analysis/visualization 11/26/2018 Charm++ Workshop 2007

BigSim Simulation Toolkit BigSim emulator Standalone emulator API Charm++ on emulator BigSim simulator Network simulator Early techniques before machines are built. 11/26/2018 Charm++ Workshop 2007

Architecture of BigSim (postmortem mode) Performance visualization (Projections) Offline PDES Network Simulator BigNetSim (POSE) Simulation output trace logs Charm++ Runtime Load Balancing Module Performance counters Instruction Sim (RSim, IBM, ..) Simple Network Model BigSim Emulator Charm++ and MPI applications 11/26/2018 Charm++ Workshop 2007 6

Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework Online mode simulation Post-mortem simulation Network simulation Performance analysis/visualization 11/26/2018 Charm++ Workshop 2007

Emulator Started with mimicking Blue Gene/C low level API Emulate full machine on existing parallel machines Actually run a parallel program with multi-million way parallelism Started with mimicking Blue Gene/C low level API Machine layer abstraction Many multiprocessor (SMP) nodes connected via message passing When BigSim project starts, we aim to be able to run application on non- existing bluegene/C machine using clusters. The idea is that we develop an emulator to mimic the BlueGene/C low level premitives. And we found this API is general for a class of next generation supercomputers. 11/26/2018 Charm++ Workshop 2007

BigSim Emulator: functional view Real Processor Communication processors Worker processors inBuf f Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Non-affinity message queues Correctio nQ Affinity message queues Affinity message queues 8 million emulation on 90 ASCI processors Target Node Target Node Converse scheduler Converse Q 11/26/2018 Charm++ Workshop 2007 9

BigSim Programming API out Machine initialization Set/get machine configuration Get node ID: (x, y, z) Message passing Register handler functions on node Send packets to other nodes (x,y,z) with a handler ID With the abstraction, we emulate the BG machine API. 11/26/2018 Charm++ Workshop 2007 10

User’s API BgEmulatorInit(), BgNodeStart() BgGetXYZ() BgGetSize(), BgSetSize() BgGetNumWorkThread(), BgSetNumWorkThread() BgGetNumCommThread(), BgSetNumCommThread() BgGetNodeData(), BgSetNodeData() BgGetThreadID(), BgGetGlobalThreadID() BgGetTime() BgRegisterHandler() BgSendPacket(), etc BgShutdown() BgEmulatorInit() is called once per emulator node to configure the emulator; Emulator programs start at BgNodeStart(). See next slide for an example. 11/26/2018 Charm++ Workshop 2007 11

Examples charm/examples/bigsim/emulator ring jacobi3D maxReduce prime octo line littleMD 11/26/2018 Charm++ Workshop 2007

BigSim application example - Ring typedef struct { char core[CmiBlueGeneMsgHeaderSizeBytes]; int data; } RingMsg; void BgNodeStart(int argc, char **argv) { int x,y,z, nx, ny, nz; BgGetXYZ(&x, &y, &z); nextxyz(x, y, z, &nx, &ny, &nz); if (x == 0 && y==0 && z==0) { RingMsg msg = new RingMsg; msg->data = 888; BgSendPacket(nx, ny, nz, passRingID, LARGE_WORK, sizeof(RingMsg), (char *)msg); } void passRing(char *msg) { int x, y, z, nx, ny, nz; BgGetXYZ(&x, &y, &z); nextxyz(x, y, z, &nx, &ny, &nz); if (x==0 && y==0 && z==0) if (++iter == MAXITER) BgShutdown(); BgSendPacket(nx, ny, nz, passRingID, LARGE_WORK, sizeof(RingMsg), msg); Flavor of how a program written on BG Emulator’s API. 11/26/2018 Charm++ Workshop 2007 13

Emulator Compilation Emulator libraries implemented on top of Converse/machine layer: libconv-bigsim.a libconv-bigsim-logs.a Compile with normal Charm++ with “bigemulator” target ./build bigemulator net-linux Compile an application with emulator API charmc -o ring ring.C -language bigsim 11/26/2018 Charm++ Workshop 2007

Execute Application on the Emulator Define machine configuration Function API BgSetSize(x, y, z), BgSetNumWorkThread(), BgSetNumCommThread() Command line options +x +y +z +cth +wth E.g. charmrun +p4 ring +x10 +y10 +z10 +cth2 +wth4 Config file +bgconfig config 11/26/2018 Charm++ Workshop 2007

Running with bgconfig file +bgconfig ./bg_config x 10 y 10 z 10 cth 2 wth 4 stacksize 4000 timing walltime #timing bgelapse #timing counter #cpufactor 1.0 fpfactor 5e-7 traceroot /tmp log yes correct no network bluegene Explain stacksize 11/26/2018 Charm++ Workshop 2007

Ring Output clarity>./ring 2 2 2 2 2 Charm++: standalone mode (not using charmrun) BG info> Simulating 2x2x2 nodes with 2 comm + 2 work threads each. BG info> Network type: bluegene. alpha: 1.000000e-07 packetsize: 1024 CYCLE_TIME_FACTOR:1.000000e-03. CYCLES_PER_HOP: 5 CYCLES_PER_CORNER: 75. 0 0 0 => 0 0 1 0 0 1 => 0 1 0 0 1 0 => 0 1 1 0 1 1 => 1 0 0 1 0 0 => 1 0 1 1 0 1 => 1 1 0 1 1 0 => 1 1 1 1 1 1 => 0 0 0 BG> BlueGene emulator shutdown gracefully! BG> Emulation took 0.000265 seconds! Program finished. 11/26/2018 Charm++ Workshop 2007

Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework Online mode simulation Post-mortem simulation Network simulation Performance analysis/visualization 11/26/2018 Charm++ Workshop 2007

BigSim Charm++/AMPI Charm++/AMPI implemented on top of BigSim emulator, using it as another machine layer Support frameworks and libraries Load balancing framework Communication optimization library (comlib) FEM Multiphase Shared Array (MSA) 11/26/2018 Charm++ Workshop 2007

BigSim Charm++ Charm++ NS Selector Charm++ BGConverse Emulator UDP/TCP, MPI, Myrinet, etc NS Selector BGConverse Emulator Charm++ Converse UDP/TCP, MPI, Myrinet, etc Not straightforward 11/26/2018 Charm++ Workshop 2007

Build Charm++ on BigSim Compile Charm++ on top of BigSim emulator Build option “bigemulator” E.g. Charm++: ./build charm++ net-linux bigemulator AMPI: ./build AMPI net-linux bigemulator 11/26/2018 Charm++ Workshop 2007

Running Charm++/AMPI Applications Compile Charm++/AMPI applications Same as normal Charm++/AMPI Just use charm/net-linux-bigsim/bin/charmc Running BigSim Charm++ applications Same as running on emulator Use command line option, or Use bgconfig file Straightforward, same as normal charm++ Running is like running an emulator application 11/26/2018 Charm++ Workshop 2007

Example – AMPI Cjacobi3D cd charm/net-linux-bigemulator/examples/ampi/Cjacobi3D Make charmc -o jacobi jacobi.o -language ampi - module EveryLB 11/26/2018 Charm++ Workshop 2007

./charmrun +p2 ./jacobi 2 2 2 +vp8 +bgconfig ~/bg_config +balancer GreedyLB +LBDebug 1 [0] GreedyLB created iter 1 time: 1.022634 maxerr: 2020.200000 iter 2 time: 0.814523 maxerr: 1696.968000 iter 3 time: 0.787009 maxerr: 1477.170240 iter 4 time: 0.825189 maxerr: 1319.433024 iter 5 time: 1.093839 maxerr: 1200.918072 iter 6 time: 0.791372 maxerr: 1108.425519 iter 7 time: 0.823002 maxerr: 1033.970839 iter 8 time: 0.818859 maxerr: 972.509242 iter 9 time: 0.826524 maxerr: 920.721889 iter 10 time: 0.832437 maxerr: 876.344030 [GreedyLB] Load balancing step 0 starting at 11.647364 in PE0 n_obj:8 migratable:8 ncom:24 GreedyLB: 5 objects migrating. [GreedyLB] Load balancing step 0 finished at 11.777964 [GreedyLB] duration 0.130599s memUsage: LBManager:800KB CentralLB:0KB iter 11 time: 1.627869 maxerr: 837.779089 iter 12 time: 0.951551 maxerr: 803.868831 iter 13 time: 0.960144 maxerr: 773.751705 iter 14 time: 0.952085 maxerr: 746.772667 iter 15 time: 0.956356 maxerr: 722.424056 iter 16 time: 0.965365 maxerr: 700.305763 iter 17 time: 0.947866 maxerr: 680.097726 iter 18 time: 0.957245 maxerr: 661.540528 iter 19 time: 0.961152 maxerr: 644.421422 iter 20 time: 0.960874 maxerr: 628.564089 BG> Bigsim mulator shutdown gracefully! BG> Emulation took 36.762261 seconds! 11/26/2018 Charm++ Workshop 2007

Performance Prediction How to predict performance? Different levels of fidelity Sequential portion: User supplied timing expression Wall clock time Performance counters Instruction level simulation Message passing: Simple latency-based network model Contention-based network simulation Example, previous time is time on host machine. To predict parallel applications, we need to predict different parallel components. listed in the increasing order of accuracy and the complexity involved. 11/26/2018 Charm++ Workshop 2007

How to Ensure Simulation Accuracy The idea: Take advantage of inherent determinacy of an application Don’t need rollback - same user function then is executed only once In case of out of order delivery, only timestamps of events are adjusted Inherent determinacy simply is no matter which order messages are delivered, applications are written to produce same result. That is application has inherent determinacy to tolerate out of order messages. 11/26/2018 Charm++ Workshop 2007

Timestamp Correction (Jacobi1D) T(e1) T(e2) Original Timeline T(e2) T”(e1) Incorrect Updated Timeline Parallel simulation is challenging, how to ensure simulation accuracy? First figure shows the original timeline when messages are delivered in simulator It turned out message e1 came too early, it actually should received at a later time as in second figure. This timeline is not correct because the doWork() is not executed even before message e1 is received T’’’(e1) T(e2) Correct Updated Timeline LEGEND: getStripFromRight (e1) getStripFromLeft (e2) doWork 11/26/2018 Charm++ Workshop 2007

Structured Dagger (Jacobi1D) entry void jacobiLifeCycle() { for (i=0; i<MAX_ITER; i++) atomic {sendStripToLeftAndRight();} overlap when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); /* Jacobi Relaxation */ } 11/26/2018 Charm++ Workshop 2007

Sequential time - BgElapse entry void jacobiLifeCycle() { for (i=0; i<MAX_ITER; i++) atomic {sendStripToLeftAndRight();} overlap when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); BgElapse(10e-3);} 11/26/2018 Charm++ Workshop 2007

Sequential Time – using Wallclock Wallclock measurement of the time can be used via a suitable multiplier (scale factor) Run application with +bgwalltime and +bgcpufactor, or +bgconfig ./bgconfig: timing walltime cpufactor 0.7 Good for predicting a larger machine using a fraction of the machine 11/26/2018 Charm++ Workshop 2007

Sequential Time – performance counters Count floating-point, integer, memory and branch instructions (for example) with hardware counters with a simple heuristic, use the expected time for each of these operations on the target machine to give the predicted total computation time. Cache performance and the memory footprint effects can be approximated by percentage of memory accesses and cache hit/miss ratio. Perfex and PAPI are supported Example of use, for a floating-point intensive code: +bgconfig ./bg_config timing counter fpfactor 5e-7 11/26/2018 Charm++ Workshop 2007

Sequential Time – Instruction level simulation Run instruction-level simulator separately to get accurate timing information (sampling) An interpolation-based scheme Use result of a smaller scale instruction level simulation to interpolate for large dataset do a least-squares fit to determine the coefficients of an approximation polynomial function This work was motivated by the recent track 1 proposal when we want to predict performance of NAMD on a proposed IBM power7 machine. We found it is 11/26/2018 Charm++ Workshop 2007

Case study: BigSim / Mambo void func( ) { StartBigSim( ) … EndBigSim( ) } Mambo Prediction for Target System Cycle-accurate prediction of sequential blocks on POWER7 processor BigSim Parallel Emulation Interpolation BigSim Parallel Simulation Results are not available + Replace sequential timing Trace files Parameter files for sequential blocks Adjusted trace files 11/26/2018 Charm++ Workshop 2007

Using interpolation tool Compile interpolation tool Install GSL, the GNU Scientific Library cd charm/examples/bigsim/tools/rewritelog Modify the file interpolatelog.C to match your particular tastes. OUTPUTDIR specifies a directory for the new logfiles CYCLE_TIMES_FILE specifies the file which contains accurate timing information Make Modify source code Insert startTraceBigSim() call before a compute kernel. Add an endTraceBigSim() call after the kernel. Currently the first call takes between 0 and 20 parameters describing the computation. startTraceBigSim(param1, param2, param3, …); // Some serial computational kernel goes here endTraceBigSim("EventName"); 11/26/2018 Charm++ Workshop 2007

Using interpolation tool (cont.) Run the application through emulator, generating trace logs (bgTrace*)and parameter files (param.*) Run the same application with instruction- level simulator, get accurate timing indexed by parameters Run interpolation tool under bgTrace dir: ./interpolatelog 11/26/2018 Charm++ Workshop 2007

Out-of-core Emulation Motivation Physical memory is shared VM system would not handle well Message driven execution Peek msg queue => what execute next? (prefetch) 11/26/2018 Charm++ Workshop 2007

Using Out-of-core Compile an application with bigemulator Run the application through the emulator, and command line option: +ooc 512 11/26/2018 Charm++ Workshop 2007

How to Obtain Predicted Time (cont.) BgPrint (char *) Bookmarking events E.g. BgPrint(“start at %f\n”); Output to bgPrintFile.0 when simulation finishes Look back these bookmarks Replace “%f” with the committed time 11/26/2018 Charm++ Workshop 2007

BigSim Trace Log Execution of messages on each target processor is stored in trace logs (binary format) named bgTrace[#], # is simulating processor number. Can be used for Visualization/Performance study Post-mortem simulation with different network models Loadlog tool Binary to human readable ascii format conversion charm/examples/bigsim/tools/loadlog with machine translation capability Each simulating processor stores trace logs of all target processors simulated by this processor 11/26/2018 Charm++ Workshop 2007

ASCII Log Sample [22] 0x80a7a60 name:msgep (srcnode:0 msgID:21) ep:1 [[ recvtime:0.000498 startTime:0.000498 endTime:0.000498 ]] backward: forward: [0x80a7af0 23] [23] 0x80a7af0 name:Chunk_atomic_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000498 endTime:0.000503 ]] msgID:3 sent:0.000498 recvtime:0.000499 dstPe:7 size:208 msgID:4 sent:0.000500 recvtime:0.000501 dstPe:1 size:208 backward: [0x80a7a60 22] forward: [0x80a7ca8 24] [24] 0x80a7ca8 name:Chunk_overlap_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000503 endTime:0.000503 ]] backward: [0x80a7af0 23] forward: [0x80a7dc8 25] [0x80a8170 28] 11/26/2018 Charm++ Workshop 2007