BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

MPI Message Passing Interface
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Router Architecture : Building high-performance routers Ian Pratt
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Processes CSCI 444/544 Operating Systems Fall 2008.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
OMNET++. Outline Introduction Overview The NED Language Simple Modules.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 LiveViz – What is it? Charm++ library Visualization tool Inspect your program’s current state Client runs on any machine (java) You code the image generation.
1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Synchronization and Communication in the T3E Multiprocessor.
1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
FINAL MPX DELIVERABLE Due when you schedule your interview and presentation.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
Charm++ Workshop 2008 BigSim Tutorial Presented by Eric Bohm Charm++ Workshop 2008 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Advanced / Other Programming Models Sathish Vadhiyar.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
1 Optical Packet Switching Techniques Walter Picco MS Thesis Defense December 2001 Fabio Neri, Marco Ajmone Marsan Telecommunication Networks Group
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Lab 2 Parallel processing using NIOS II processors
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:
Interconnection network network interface and a case study.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
1 Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001.
Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury.
BigSim Tutorial Presented by Eric Bohm LACSI Charm++ Workshop 2005 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
1 Creating Simulations with POSE Terry L. Wilmarth, Nilesh Choudhury, David Kunzman, Eric Bohm Parallel Programming Laboratory University of Illinois at.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Array computers. Single Instruction Stream Multiple Data Streams computer There two types of general structures of array processors SIMD Distributerd.
1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
BigNetSim Tutorial Presented by Gengbin Zheng & Eric Bohm Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Introduction to Operating Systems Concepts
Creating Simulations with POSE
Chapter 3: Process Concept
Performance Evaluation of Adaptive MPI
Peter Poplavko, Saddek Bensalem, Marius Bozga
Gengbin Zheng and Eric Bohm
Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale
BigSim: Simulating PetaFLOPS Supercomputers
Outline Chapter 2 (cont) Chapter 3: Processes Virtual machines
An Orchestration Language for Parallel Objects
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Charm++ Workshop Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework  Online mode simulation  Post-mortem simulation  Network simulation (Eric) Performance analysis/visualization

Charm++ Workshop Simulation-based Performance Prediction Extremely large parallel machines are being built with enormous compute power  Very large number of processors with petaflops level peak performance Are existing software environments ready for these new machines?  How to write a peta-scale parallel application?  What will be the performance like? Can these applications scale?

Charm++ Workshop BigSim Objective Aim at developing techniques and methods to facilitate the development of efficient peta-scale applications on very large parallel machines. Based on performance prediction via simulation

Charm++ Workshop Simulation-based Performance Prediction With focus on Charm++ and AMPI programming models Performance prediction is based on Parallel Discrete Event Simulation (PDES) Simulation is challenging, aims at different levels of fidelity  Processor prediction  Network prediction Two approaches  Direct execution (online mode)  Trace-driven (post-mortem mode)

Charm++ Workshop Charm++ and MPI applications Simulation output trace logs Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Architecture of BigSim (online mode)

Charm++ Workshop Charm++ and MPI applications Simulation output trace logs BigNetSim (POSE) Network Simulator Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Offline PDES Architecture of BigSim (postmortem mode)

Charm++ Workshop Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework  Online mode simulation  Post-mortem simulation  Network simulation (Eric) Performance analysis/visualization

Charm++ Workshop Emulator Emulate full machine on existing parallel machines  Actually run a parallel program with multi-million way parallelism Started with mimicking Blue Gene/C low level API Machine layer abstraction  Many multiprocessor (SMP) nodes connected via message passing

Charm++ Workshop BigSim Emulator: functional view Affinity message queues Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Converse scheduler Converse Q Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Affinity message queues Real Processor Target Node

Charm++ Workshop BigSim Programming API Machine initialization  Set/get machine configuration  Get node ID: (x, y, z) Message passing  Register handler functions on node  Send packets to other nodes (x,y,z) with a handler ID in out

Charm++ Workshop User ’ s API BgEmulatorInit(), BgNodeStart() BgGetXYZ() BgGetSize(), BgSetSize() BgGetNumWorkThread(), BgSetNumWorkThread() BgGetNumCommThread(), BgSetNumCommThread() BgGetNodeData(), BgSetNodeData() BgGetThreadID(), BgGetGlobalThreadID() BgGetTime() BgRegisterHandler() BgSendPacket(), etc BgShutdown()

Charm++ Workshop Examples charm/examples/bigsim/emulator  ring  jacobi3D  maxReduce  prime  octo  line  littleMD

Charm++ Workshop BigSim application example - Ring typedef struct { char core[CmiBlueGeneMsgHeaderSizeBytes]; int data; } RingMsg; void BgNodeStart(int argc, char **argv) { int x,y,z, nx, ny, nz; BgGetXYZ(&x, &y, &z); nextxyz(x, y, z, &nx, &ny, &nz); if (x == 0 && y==0 && z==0) { RingMsg msg = new RingMsg; msg->data = 888; BgSendPacket(nx, ny, nz, passRingID, LARGE_WORK, sizeof(RingMsg), (char *)msg); } void passRing(char *msg) { int x, y, z, nx, ny, nz; BgGetXYZ(&x, &y, &z); nextxyz(x, y, z, &nx, &ny, &nz); if (x==0 && y==0 && z==0) if (++iter == MAXITER) BgShutdown(); BgSendPacket(nx, ny, nz, passRingID, LARGE_WORK, sizeof(RingMsg), msg); }

Charm++ Workshop Emulator Compilation Emulator libraries implemented on top of Converse/machine layer:  libconv-bluegene.a  libconv-bluegene-logs.a Compile with normal Charm++ with “bluegene” target ./build bluegene net-linux Compile an application with emulator API  charmc -o ring ring.C -language bluegene

Charm++ Workshop Execute Application on the Emulator Define machine configuration  Function API BgSetSize(x, y, z), BgSetNumWorkThread(), BgSetNumCommThread()  Command line options +x +y +z +cth +wth E.g. charmrun +p4 ring +x10 +y10 +z10 +cth2 +wth4  Config file +bgconfig config

Charm++ Workshop Running with bgconfig file +bgconfig./bg_config x 10 y 10 z 10 cth 2 wth 4 stacksize 4000 timing walltime #timing bgelapse #timing counter #cpufactor 1.0 fpfactor 5e-7 traceroot /tmp log yes correct no network bluegene

Charm++ Workshop Ring Output clarity>./ring Charm++: standalone mode (not using charmrun) BG info> Simulating 2x2x2 nodes with 2 comm + 2 work threads each. BG info> Network type: bluegene. alpha: e-07 packetsize: 1024 CYCLE_TIME_FACTOR: e-03. CYCLES_PER_HOP: 5 CYCLES_PER_CORNER: => => => => => => => => BG> BlueGene emulator shutdown gracefully! BG> Emulation took seconds! Program finished.

Charm++ Workshop Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework  Online mode simulation  Post-mortem simulation  Network simulation (Eric) Performance analysis/visualization

Charm++ Workshop BigSim Charm++/AMPI Need high level programming language such as Charm++/AMPI Charm++/AMPI implemented on top of BigSim emulator, using it as another machine layer Support frameworks and libraries  Load balancing framework  Communication optimization library (comlib)  FEM  Multiphase Shared Array (MSA)

Charm++ Workshop BigSim Charm++ Charm++ Converse UDP/TCP, MPI, Myrinet, etc Converse Charm++ UDP/TCP, MPI, Myrinet, etc NS Selector BGConverse Emulator

Charm++ Workshop Build Charm++ on BigSim Compile Charm++ on top of BigSim emulator  Build option “bluegene”  E.g. Charm++:./build bluegene net-linux bluegene AMPI:./build bgampi net-linux bluegene

Charm++ Workshop Running Charm++/AMPI Applications Compile Charm++/AMPI applications  Same as normal Charm++/AMPI  Just use charm/net-inux-bluegene/bin/charmc Running BigSim Charm++ applications  Same as running on emulator Use command line option, or Use bgconfig file

Charm++ Workshop Example - simplearrayhello cd charm/net-linux-bluegene/pgms/charm++/simplearrayhello Make  charmc -language charm++ -o hello hello.o Output: clarity>./hello +bgconfig ~/bg_config Charm++: standalone mode (not using charmrun) Reading Bluegene Config file /expand8/home/gzheng/bg_config... BG info> Simulating 2x2x1 nodes with 1 comm + 1 work threads each. BG info> Network type: bluegene. BG info> Generating timing log. Running Hello on 4 processors for 5 elements Hello 0 created Hello 4 created Hi[17] from element 0 Hello 1 created Hello 2 created Hello 3 created Hi[18] from element 1 Hi[19] from element 2 Hi[20] from element 3 Hi[21] from element 4 All done BG> BlueGene emulator shutdown gracefully! BG> Emulation took seconds! Program finished.

Charm++ Workshop Example – AMPI Cjacobi3D cd charm/net-linux- bluegene/pgms/charm++/ampi/Cjacobi3D Make  charmc -o jacobi jacobi.o -language ampi - module EveryLB

Charm++ Workshop /charmrun +p2./jacobi vp8 +bgconfig ~/bg_config +balancer GreedyLB +LBDebug 1 [0] GreedyLB created iter 1 time: maxerr: iter 2 time: maxerr: iter 3 time: maxerr: iter 4 time: maxerr: iter 5 time: maxerr: iter 6 time: maxerr: iter 7 time: maxerr: iter 8 time: maxerr: iter 9 time: maxerr: iter 10 time: maxerr: [GreedyLB] Load balancing step 0 starting at in PE0 n_obj:8 migratable:8 ncom:24 GreedyLB: 5 objects migrating. [GreedyLB] Load balancing step 0 finished at [GreedyLB] duration s memUsage: LBManager:800KB CentralLB:0KB iter 11 time: maxerr: iter 12 time: maxerr: iter 13 time: maxerr: iter 14 time: maxerr: iter 15 time: maxerr: iter 16 time: maxerr: iter 17 time: maxerr: iter 18 time: maxerr: iter 19 time: maxerr: iter 20 time: maxerr: BG> BlueGene emulator shutdown gracefully! BG> Emulation took seconds!

Charm++ Workshop Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework  Online mode simulation  Post-mortem simulation  Network simulation Performance analysis/visualization

Charm++ Workshop Performance Prediction How to predict performance?  Different levels of fidelity  Processor model: User supplied timing expression Wallclock time Performance counters Instruction level simulation Not supported yet  Network model: Simple latency-based network model Contention-based network simulation

Charm++ Workshop How to Ensure Simulation Accuracy The idea:  Take advantage of inherent determinacy of an application  Don’t need rollback - same user function then is executed only once  In case of out of order delivery, only timestamps of events are adjusted

Charm++ Workshop T(e1) T(e2) T ” (e1) Original Timeline Incorrect Updated Timeline T(e2) T ’’’ (e1) Correct Updated Timeline LEGEND: getStripFromRight (e1) getStripFromLeft (e2) doWork Timestamp Correction (Jacobi1D)

Charm++ Workshop Structured Dagger (Jacobi1D) entry void jacobiLifeCycle() { for (i=0; i<MAX_ITER; i++) { atomic {sendStripToLeftAndRight();} overlap { when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); /* Jacobi Relaxation */ } }

Charm++ Workshop Timestamp correction Needed for out-of-order message delivery  Two messages are not executed in the order of their timestamps Need to capture event dependency  Use structured dagger  Only timestamp needs to be changed, no need to execute same function twice

Charm++ Workshop Structured Dagger Express order of message passing Four categories of control structures are provided for expressing dependencies:  When-Block  Ordering construct Overlap  Conditional and Looping Constructs If construct While, for/forall construct  Atomic Construct

Charm++ Workshop Sequential time - BgElapse BgElapse entry void jacobiLifeCycle() { for (i=0; i<MAX_ITER; i++) { atomic {sendStripToLeftAndRight();} overlap { when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); BgElapse(10e-3);} }

Charm++ Workshop Sequential Time – using Wallclock Wallclock measurement of the time can be used via a suitable multiplier (scale factor) Run application with +bgwalltime and +bgcpufactor, or +bgconfig./bgconfig: timing walltime cpufactor 0.7 Good for predicting a larger machine using a fraction of the machine

Charm++ Workshop Sequential Time – performance counters Count floating-point, integer, memory and branch instructions (for example) with hardware counters  with a simple heuristic, use the expected time for each of these operations on the target machine to give the predicted total computation time. Cache performance and the memory footprint effects can be approximated by percentage of memory accesses and cache hit/miss ratio. Perfex and PAPI are supported Example of use, for a floating-point intensive code: +bgconfig./bg_config timing counter fpfactor 5e-7

Charm++ Workshop Simple Network Model No contention modeling  Latency and topology based Built-in network models for  Quadrics (Lemieux)  Blue Gene/C  Blue Gene/L

Charm++ Workshop Choose Network Model at Run-time Command line option:  +bgnetwork bluegenel BigSim config file:  +bgconfig./bg_config network bluegenel

Charm++ Workshop How to Add a New Network Model Inherit from this base class defined in blue_network.h: class BigSimNetwork { protected: double alpha; // cpu overhead of sending a message char *myname; // name of this network public: inline double alphacost() { return alpha; } inline char *name() { return myname; } virtual double latency(int ox, int oy, int oz, int nx, int ny, int nz, int bytes) = 0; virtual void print() = 0; };

Charm++ Workshop How to Obtain Predicted Time BgGetTime()  Print to stdout is not useful actually  Because the printed time at execution time is not final.  Final timestamp can only be obtained after timestamp correction (simulation) finishes.

Charm++ Workshop How to Obtain Predicted Time (cont.) BgPrint (char *)  Bookmarking events  E.g. BgPrint(“start at %f\n”);  Output to bgPrintFile.0 when simulation finishes Look back these bookmarks Replace “%f” with the committed time

Charm++ Workshop Running Applications with Simulator Two modes  With simple network model (timestamp correction) +bgcorrect  Partial prediction only (no timestamp correction) +bglog Generate trace logs for post-mortem simulation

Charm++ Workshop With bgconfig +bgconfig./bg_config x 64 y 32 z 32 cth 1 wth 1 stacksize 4000 timing walltime #timing bgelapse #timing counter cpufactor 1.0 #fpfactor 5e-7 traceroot /tmp log yes correct no network bluegene

Charm++ Workshop BigSim Trace Log Execution of messages on each target processor is stored in trace logs (binary format)  named bgTrace[#], # is simulating processor number. Can be used for  Visualization/Performance study  Post-mortem simulation with different network models Loadlog tool  Binary to human readable ascii format conversion  charm/examples/bigsim/tools/loadlog

Charm++ Workshop ASCII Log Sample [22] 0x80a7a60 name:msgep (srcnode:0 msgID:21) ep:1 [[ recvtime: startTime: endTime: ]] backward: forward: [0x80a7af0 23] [23] 0x80a7af0 name:Chunk_atomic_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime: startTime: endTime: ]] msgID:3 sent: recvtime: dstPe:7 size:208 msgID:4 sent: recvtime: dstPe:1 size:208 backward: [0x80a7a60 22] forward: [0x80a7ca8 24] [24] 0x80a7ca8 name:Chunk_overlap_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime: startTime: endTime: ]] backward: [0x80a7af0 23] forward: [0x80a7dc8 25] [0x80a ]

Charm++ Workshop Example (Jacobi1D) cd charm/examples/bigsim/sdag/jacobi-no-redn Make Bgconfig: x 4 y 2 z 2 cth 1 wth 1 stacksize #timing walltime timing bgelapse #timing counter cpufactor 1.0 traceroot. log yes correct yes network lemieux #projections 2,4-8

Charm++ Workshop Output./charmrun +p4./jacobi bgconfig./bg_config Reading Bluegene Config file./bg_config... BG info> Simulating 4x2x2 nodes with 1 comm + 1 work threads each. BG info> Network type: lemieux. bandwidth: e+08; alpha: e-06. BG info> cpufactor is BG info> floating point factor is BG info> BG stack size: bytes. BG info> Using BgElapse calls for timing method. BG info> Generating timing log. BG info> bgTrace root is.//. Iter starts: Iteration: 1 Iter starts: Iteration: 2 Iter starts: Iteration: 3 Numfin=1, total=32, Pes = 16 Numfin=2, total=32, Pes = 16 Finished in s s/step and iters is 3 [0] Number is numX:4 numY:2 numZ:2 numCth:1 numWth:1 numPes:4 totalProcs:16 [3] Wrote to disk for 4 BG nodes. [2] Wrote to disk for 4 BG nodes. [1] Wrote to disk for 4 BG nodes. [0] Wrote to disk for 4 BG nodes. BG> BlueGene emulator shutdown gracefully! BG> Emulation took seconds!

Charm++ Workshop Example (AMPI CJacobi3D) cd charm/examples/ampi/Cjacobi3D Make Bgconfig: x 2 y 2 z 1 cth 1 wth 1 stacksize timing walltime #timing bgelapse #timing counter cpufactor 1.0 traceroot. log yes correct yes network lemieux #projections 2,4-8

Charm++ Workshop Output (using BgPrint)./charmrun +p3 jacobi vp8 +bgconfig./bg_config +bgelapse Reading Bluegene Config file./bg_config... BG info> Simulating 2x2x1 nodes with 1 comm + 1 work threads each. BG info> Network type: lemieux. bandwidth: e+08; alpha: e-06. BG info> cpufactor is BG info> BG stack size: bytes. BG info> Using BgElapse for timing method. BG info> Generating timing log. BG info> Perform timestamp correction. BG info> bgTrace root is.//. interation starts at interation starts at interation starts at interation starts at interation starts at interation starts at interation starts at interation starts at interation starts at interation starts at BG> BlueGene emulator shutdown gracefully! BG> Emulation took seconds!

Charm++ Workshop Final Predictions (using BgPrint) clarity>cat bgPrintFile.0 [0] interation starts at [0] interation starts at [0] interation starts at [0] interation starts at [0] interation starts at [0] interation starts at [0] interation starts at [0] interation starts at [0] interation starts at [0] interation starts at

Charm++ Workshop Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework  Online mode simulation  Post-mortem simulation  Network simulation (Eric) Performance analysis/visualization

Charm++ Workshop Postmortem Simulation Run application once, get trace logs, and run simulation with logs for a variety of network configurations Implemented on POSE simulation framework

Charm++ Workshop How to Obtain Predicted Time Use BgPrint(char *) in similar way  Each BgPrint() called at execution time in online execution mode is stored in BgLog as a printing event In postmortem simulation, strings associated with BgPrint event is printed when the event is committed “%f” in the string will be replaced by committed time.

Charm++ Workshop Compile Postmortem Simulator Compile pose  Use normal charm++  cd charm/net-linux/tmp  make pose Compile HiSim simulator  cd charm/net-linux/pgms/pose/HiSim/BlueGene  make

Charm++ Workshop Example (AMPI CJacobi3D cont.) charm/net-linux/examples/pose/HiSim/tmp/BGHiSim 0 0 bgtrace: totalBGProcs=4 X=2 Y=2 Z=1 #Cth=1 #Wth=1 #Pes=3 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. Starting simulation Info> timing factor e Info> invoking startup task from proc 0... [0:AMPI_Barrier_END] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at [0:RECV_RESUME] interation starts at Simulation inactive at time: Final GVT =

Charm++ Workshop Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework  Online mode simulation  Post-mortem simulation  Network simulation (Eric) Performance analysis/visualization

Charm++ Workshop Big Network Simulator When message passing performance is critical and needs to consider network contention

Charm++ Workshop BigNetSim Overview Networks Design POSE Catalog of Network Simulations Building Running Configuration Modular HiSim  Mix and match architecture, topology, routing Using the Transceiver Extensibility

Charm++ Workshop Networks Direct Network Indirect Network

Charm++ Workshop HiSim Design packetization flit level (QsNet) routing topology contention latency Node Processor Switch Channel Virtual Channel Network Interface Routing protocol Components Features

Charm++ Workshop Implementation Post-Mortem Network simulators are Parallel Discrete Event Simulations  Parallel Object Simulation Environment (POSE)  Network layer constructs (NIC, Switch, Node, etc) implemented as poser simulation objects  Network data constructs (message, packet, etc) implemented as event methods on simulation objects

Charm++ Workshop POSE

Charm++ Workshop NetSims Direct  HiSim/Bluegene  SimpleNetSim  BigNetSim (original Bluegene network)  TCSim time stamp correction Indirect  QsNet for transceiver  QsNetTrace

Charm++ Workshop Building POSE POSE  cd charm ./build pose net-linux  options are set in pose_config.h stats enabled by POSE_STATS_ON=1 user event tracing TRACE_DETAIL=1 more advanced configuration options speculation checkpoints load balancing

Charm++ Workshop Building HiSim Build HiSim/Bluegene  cd net-linux/examples/HiSim/Bluegene  make  for sequential simulator make clean; make SEQUENTIAL=1  cd../tmp

Charm++ Workshop Running charmrun +p4 BGHiSim 1 1 Parameters  First parameter controls detailed network simulation 1 will use the detailed model 0 will use simple latency  Second parameter controls simulation skip 1 will skip forward to the time stamp set during trace creation 0 if not set or network startup interesting

Charm++ Workshop Configuring HiSim USE_TRANSCEIVER 0 For network analysis ignore trace and generate random traffic NUM_NODES 0 Number of nodes, taken from trace file or set for transceiver MAX_PACKET_SIZE 256 Maximum packet size SWITCH_VC 4 The number of switch virtual channels SWITCH_PORT 8 Number of ports in switch, calculated automatically for direct networks SWITCH_BUF 1024 Size in memory of each virtual channel CHANNELBW 1.75 Bandwidth in 100 MB/s CHANNELDELAY 9 Delay in 10 ns. So 9 => 90ns RECEPTION_SERIAL 0 Used for direct networks where reception FIFO access has to be serialized INPUT_SPEEDUP 8 Used to limit simultaneous access by VC in a port. Should be less than or equal to number of VC. Currently used only for bluegene. ADAPTIVE_ROUTING 0 Additional flag to use adaptive/deterministic routing COLLECTION_INTERVAL Collection * 10ns gives statistics bin size DISPLAY_LINK_STATS 0 Display statistics for each link DISPLAY_MESSAGE_DELAY 0 Display message delay statistics

Charm++ Workshop Output Completion time for trace run Turn on -tproj to get simple updated trace of network performance POSE trace for projections output  limited value to end user Coming soon: projections output displaying user events in simulation time (like BigSim)

Charm++ Workshop HiSim Modules Build your own interconnect  Architecture Bluegene HypCubeArch OB Output Buffered IB Input Buffered RedStorm (incomplete)  Topology 3DMesh FatTree Torus HyperCube  Routing Adaptive Static  Hybrid Fat Tree + Torus

Charm++ Workshop Transceiver Generate traffic patterns instead of using trace files  additional command line parameters Pattern Frequency Pattern 1 kshift 2 ring 3 bittranspose 4 bitreversal 5 bitcomplement 6 poisson Frequency 0 linear 1 uniform 2 exponential

Charm++ Workshop HiSim Design

Charm++ Workshop HiSim: Data Flow

Charm++ Workshop HiSim API: Extensibility

Charm++ Workshop Future Projections trace log of user events in simulation time. Hybrid Networks: both direct and indirect Improved scalability  adaptive strategies  load balancing Collective communication Representative collection of netconfig files

Charm++ Workshop Case Study - NAMD Molecular Dynamics Simulation Applications Compile BigSim Charm++: ./build bluegene net-linux bluegene Compile NAMD:  Get source code from: ./config fftw Linux-i686-g++

Charm++ Workshop Validation with Simple Network Model Processors Actual time (ms) Predicted time (ms) NAMD Apo-Lipoprotein A1 with 92K atom. Performance simulation using 8 Lemieux processors

Charm++ Workshop Network Communication Pattern Analysis NAMD with apoa1 15 timestep

Charm++ Workshop Network Communication Pattern Analysis Data transferred (KB) in a single time step

Charm++ Workshop Contention Encountered by Messages

Charm++ Workshop Outline Overview BigSim Emulator Charm++ on the Emulator Simulation framework  Online mode simulation  Post-mortem simulation  Network simulation (Eric) Performance analysis/visualization

Charm++ Workshop Performance Analysis/Visualization trace-projections is available for BigSim One challenge:  Number of log files can be overwhelming

Charm++ Workshop Generate Projections Logs Link application with –tracemode projections Select subset of processors in bgconfig: projections 0-100,2000, With timestamp correction, two sets of projections logs are generated  Before and after timestamp correction

Charm++ Workshop Generate Projections Logs (the hideous secret) Problem:  Projections tracing function maintains a fix sized buffer for storing projections logs  Buffer is flushed to disk when it is filled up, disk I/O can effect predicted time Solution:  Use +logsize runtime option to provide large projections buffer size In fact, in online mode simulation, simulation aborts when disk I/O occurs.

Charm++ Workshop Projections with Jacobi cd charm/examples/bigsim/sdag/jacobi-no-redn./charmrun +p4./jacobi bgconfig./bg_config Config file: x 32 y 16 z 16 cth 1 wth 1 stacksize #timing walltime timing bgelapse #timing counter cpufactor 1.0 fpfactor 5e-7 traceroot. log yes correct yes network lemieux projections 0,1000,

Charm++ Workshop

Charm++ Workshop Make bgtest With 16 processors

Charm++ Workshop Performance Analysis Tool: Projections

Charm++ Workshop

Charm++ Workshop Projections with Postmortem Simulation

Charm++ Workshop Thank You! Free download of Charm++ and BigSim at Send comments to