1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

MPI Message Passing Interface
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Router Architecture : Building high-performance routers Ian Pratt
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Processes CSCI 444/544 Operating Systems Fall 2008.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
WSN Simulation Template for OMNeT++
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Connecting LANs, Backbone Networks, and Virtual LANs
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
Hands-On Microsoft Windows Server 2008
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
FINAL MPX DELIVERABLE Due when you schedule your interview and presentation.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Charm++ Workshop 2008 BigSim Tutorial Presented by Eric Bohm Charm++ Workshop 2008 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
CHAPTER TEN AUTHORING.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
BigSim Tutorial Celso L. Mendes - Ryan Mokos Department of Computer Science University of Illinois at Urbana-Champaign
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
Lab 2 Parallel processing using NIOS II processors
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.
MPI and OpenMP.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
BigSim Tutorial Presented by Eric Bohm LACSI Charm++ Workshop 2005 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
1 Creating Simulations with POSE Terry L. Wilmarth, Nilesh Choudhury, David Kunzman, Eric Bohm Parallel Programming Laboratory University of Illinois at.
ECE 456 Computer Architecture Lecture #9 – Input/Output Instructor: Dr. Honggang Wang Fall 2013.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
BigNetSim Tutorial Presented by Gengbin Zheng & Eric Bohm Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Introduction to Operating Systems Concepts
Processes and threads.
Chapter 3: Process Concept
William Stallings Computer Organization and Architecture 8th Edition
Computer Engg, IIT(BHU)
MPI Message Passing Interface
Introduction to SimpleScalar
Performance Evaluation of Adaptive MPI
Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale
BigSim: Simulating PetaFLOPS Supercomputers
Parallel Programming in C with MPI and OpenMP
Support for Adaptivity in ARMCI Using Migratable Objects
COMP755 Advanced Operating Systems
Emulating Massively Parallel (PetaFLOPS) Machines
SPL – PS1 Introduction to C++.
MPI Message Passing Interface
Presentation transcript:

1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois at Urbana-Champaign 1

Charm++ Workshop 2009 Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 2

3Charm++ Workshop 2009 Simulation-based Performance Prediction Extremely large parallel machines are being built with enormous compute power Very large number of processors with petaflops level peak performance Are existing software environments ready for these new machines? How to write a peta-scale parallel application? What will be the performance like? Can these applications scale? 3

4 BigSim Infrastructure BigSim for whole-system simulation of a large parallel machine. Goal: Support early application development and identification of performance bottlenecks. What BigSim can do: An execution environment that can run both Charm++ and MPI applications on large scale target machines No or small changes to MPI application source codes. facilitate code development and debugging Predict parallel performance at varying levels of resolution Tune/scale performance Machine vendors designing future machines Charm++ Workshop 20094

5 BigSim Components BigSim Emulator Run AMPI/Charm++ on emulator Capture computation and communication information Parallel: Each physical processor is used to emulate multiple target processors, leveraging Charm++’s virtualization support BigSim Simulator PDES, Network contention Produce performance data in a format compatible with the Projections graphical browser 5

6Charm++ Workshop 2009 Charm++/MPI applications Simulation output trace logs BigNetSim (POSE)‏ Network Simulator Performance visualization (Projections)‏ BigSim Emulator AMPI Runtime Instruction Sim (RSim, IBM,..)‏ Simple Network Model Performance counters Load Balancing Module Offline PDES Architecture of BigSim 6 Charm++ Runtime

What BigSim Can not Do BigSim Itself does not predict cycle-accurate timing (needs instruction-level simulation) does not predict cache effect, virtual memory does not model O.S. jitter Charm++ Workshop 20097

Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 8

9 BigSim Emulator Emulate full machine on existing parallel machines Actually run a parallel program E.g. multi-million objects on 128K target processors Emulator is implemented on Charm++ Libraries that link to user application Simple architecture abstraction Many multiprocessor (SMP) nodes connected via message passing Charm++ Workshop 20099

10Charm++ Workshop 2009 BigSim Emulator: functional view Affinity message queues Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Converse scheduler Converse Q Affinity message queues Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Real Processor Target Node 10

11 Install BigSim Emulator Download Charm++ v ml Compile Charm++/AMPI with “bigemulator” option:./build AMPI net-linux-x86_64 bigemulator –O This builds charm++ and emulator libraries under net-linux-x86_64-bigemulator Compiler wrapper for MPI applications: charm/net-linux-x86_64-bigemulator/bin/mpicc, mpicxx, mpif90, etc Charm++ Workshop

12 Prepare MPI Applications Make sure applications are AMPI-complaint Adaptive MPI – an implementation of MPI standard on Charm++ Multithreaded Changes that may be needed: Fortran: Program Main => Program MPI_Main Handle global/static variables Manual: group globals into a big structure, and allocate on heap Semi-automatic: use thread local storage Int static __thread var; Automatic: -swapglobals compiler option (ELF binaries) Only handles globals, not statics Charm++ Workshop

13 Ring Example (ring.c) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif Int value = 0; int main(int argc, char *argv[]) { int myid, numprocs, i; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } Charm++ Workshop

14 Ring Example (AMPI-complaint) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif int main(int argc, char *argv[]) { int myid, numprocs, I, value=0; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } Charm++ Workshop Charm++ Workshop 2009

15Charm++ Workshop 2009 How to Compile and Run MPI Applications for the Emulator Compile with AMPI and emulator charm/net-linux-x86_64-bigemulator/mpicc –o ring./ring.c with performance trace module: charm/net-linux-x86_64-bigemulator/mpicc –o ring./ring.c –tracemode projections Run: Use mpirun provided by AMPI Give number of target processors as well number of real processors Define target machine Command line options +x +y +z +cth +wth E.g. mpirun –np 4./ring +x10 +y10 +z10 +cth2 +wth4 Or, use Config file mpirun –np 4./ring +bgconfig config 15

16Charm++ Workshop 2009 Bgconfig File Format +bgconfig./bg_config x 10 y 10 z 10 cth 2 wth 4 stacksize 4000 timing walltime #timing bgelapse #timing counter #cpufactor 1.0 fpfactor 5e-7 traceroot /tmp log yes correct no network bluegene 16

17Charm++ Workshop 2009 Ring Std Output Justice> mpirun –np 4./pgm +bgconfig./bg_config Reading Bluegene Config file./bg_config... BG info> Simulating 8x1x1 nodes with 1 comm + 1 work threads each. BG info> Network type: ibmpower. alpha: e-06 bandwidth : e+09. BG info> cpufactor is BG info> floating point factor is BG info> BG stack size: bytes. BG info> Using WallTimer for timing method. BG info> Generating timing log. BG info> bgTrace root is './'. LB> Load balancer ignores processor background load. Start of major loop at End of major loop at Sum=280, Time= [0] Number is numX:8 numY:1 numZ:1 numCth:1 numWth:1 numEmulatingPes:4 totalWorkerProcs:8 bglog_ver:5 [2] Wrote to disk for 2 BG nodes. [3] Wrote to disk for 2 BG nodes. [1] Wrote to disk for 2 BG nodes. [0] Wrote to disk for 2 BG nodes. BG> BlueGene emulator shutdown gracefully! BG> Emulation took seconds! 17

18 Ring Output Files Justice> ls -l -rwxr-xr-x 1 gzheng kale :03 ring -rw-r--r-- 1 gzheng kale :04 pgm.sts -rw-r--r-- 1 gzheng kale :04 pgm.projrc -rw-r--r-- 1 gzheng kale :04 pgm.7.log -rw-r--r-- 1 gzheng kale :04 pgm.6.log -rw-r--r-- 1 gzheng kale :04 pgm.5.log -rw-r--r-- 1 gzheng kale :04 pgm.4.log -rw-r--r-- 1 gzheng kale :04 pgm.3.log -rw-r--r-- 1 gzheng kale :04 pgm.2.log -rw-r--r-- 1 gzheng kale :04 pgm.1.log -rw-r--r-- 1 gzheng kale :04 pgm.0.log -rw-r--r-- 1 gzheng kale :04 bgTrace3 -rw-r--r-- 1 gzheng kale :04 bgTrace2 -rw-r--r-- 1 gzheng kale :04 bgTrace1 -rw-r--r-- 1 gzheng kale :04 bgTrace0 -rw-r--r-- 1 gzheng kale :04 bgTrace Charm++ Workshop 2009 Only 4 files 8 files

19Charm++ Workshop 2009 What is in the Trace Logs? Traces for 2 target processors Each SEB has: startTime, endTime Incoming Message ID Outgoing messages Dependences 19 Tools for reading bgTrace binary files: 1.charm/example/bigsim/tools/loadlog Convert to human-readable format 2.charm/example/bigsim/tools/log2proj Convert to trace projections log files

20 Ring Projections Timeline Charm++ Workshop 2009

Performance Prediction How to predict sequential performance? Different levels of fidelity: User supplied timing expression Wall clock time Performance counters Instruction level simulation Charm++ Workshop

22Charm++ Workshop 2009 Sequential Time - BgElapse BgElapse Manually advance processor time MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid;... BgElapse( ); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); Run with +bgelapse 22

23Charm++ Workshop 2009 Sequential Time – using Wallclock Wallclock measurement of the time can be used via a suitable multiplier (scale factor)‏ T * factor Run application with +bgwalltime and +bgcpufactor, or +bgconfig./bgconfig: timing walltime cpufactor 0.7 Good for predicting a larger machine using a fraction of the machine 23

24Charm++ Workshop 2009 Sequential Time – Performance Counters Count floating-point, integer, memory and branch instructions (for example) with hardware counters Derive these hardware counters to expected time on target machine. Cache performance and the memory footprint effects can be approximated by percentage of memory accesses and cache hit/miss ratio. Example of use, for a floating-point intensive code: +bgconfig./bg_config timing counter fpfactor 5e-7 Perfex and PAPI are supported 24

25Charm++ Workshop 2009 Sequential Time – Instruction level simulation Run instruction-level simulator separately to get accurate timing information‏ Issues: It is a different third-party hardware simulator Hard to integrate with BigSim Sequential Does not model communication Slow! 25

Interpolation BigSim and instruction-level simulator interact through logs Reduce the problem size by sampling: An interpolation- based scheme Run a smaller sized problem, or Run just one processor Assume computation can be modelled by a set of parameters: T C = F n (p 1, p 2, p 3,...) Use sample data from the instruction-level simulation to interpolate large dataset With sampling data, do a least-squares fit to determine the coefficients of an approximation polynomial function Charm++ Workshop

27Charm++ Workshop 2009 Case study: BigSim / Mambo void func( )‏ { startTraceBigSim( )‏ … endTraceBigSim( )‏ } Mambo BigSim Parallel Emulation Cycle-accurate prediction of sequential blocks on POWER7 processor Parameter files for sequential blocks Trace files Interpolation Adjusted trace files Prediction for Target System + Replace sequential timing BigSim Parallel Simulation 27

Ring Example MPI_Recv(&value,1,MPI_INT,myid- 1,999,MPI_COMM_WORLD,&status); startTraceBigSim(); value += myid; endTraceBigSim(); char param[128]; sprintf(param, “sum %d”, myid); tagTraceBigSim(param); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs, 999,MPI_COMM_WORLD); Charm++ Workshop

Output Files justice>ls -l total rw-r--r-- 1 gzheng kale :08 bgTrace -rw-r--r-- 1 gzheng kale :08 bgTrace0 -rw-r--r-- 1 gzheng kale :08 bgTrace1 -rwxr-xr-x 1 gzheng kale :46 charmrun* -rw-r--r-- 1 gzheng kale :08 param.0 -rw-r--r-- 1 gzheng kale :08 param.1 -rw-r--r-- 1 gzheng kale :08 param.2 -rw-r--r-- 1 gzheng kale :08 param.3 -rw-r--r-- 1 gzheng kale :08 param.4 -rw-r--r-- 1 gzheng kale :08 param.5 -rw-r--r-- 1 gzheng kale :08 param.6 -rw-r--r-- 1 gzheng kale :08 param.7 -rwxr-xr-x 1 gzheng kale :07 ring* -rw-r--r-- 1 gzheng kale :07 ring.C Charm++ Workshop justice>cat param.7 48 sum 7

Run ring Through Instruction-level Simulator Compile normal version of ring (not emulator) Run sequentially through an instruction- level simulator Sample line of Mambo output: : ( ): TRACE_END: sum 7 Charm++ Workshop

Compile and Run Interpolation Tool Install GSL, the GNU Scientific Library cd charm/examples/bigsim/tools/rewritelog Modify the file interpolatelog.C to match your particular tastes. OUTPUTDIR specifies a directory for the new logfiles CYCLE_TIMES_FILE specifies the file which contains accurate timing information Make Run interpolation tool under bgTrace dir:./interpolatelog Charm++ Workshop

32Charm++ Workshop 2009 Interpolation Tool Rewrites SEB Durations Replace the duration of a portion of each SEB with known exact times recorded in a execution or cycle-accurate simulator Scale begin/end portions by a constant factor Message send points are linearly mapped into the new times 32

Record/Replay Record only a subset of special logs when running full size emulation With the special logs, replay the execution of a particular target processor through hardware simulator Example:./pgm +x y 1 +z1 +bgrecord +bgrecordprocessors :1024./pgm +bgreplay Charm++ Workshop

34Charm++ Workshop 2009 Out-of-core Emulation Motivation Applications with large memory footprint VM system can not handle well Use hard drive Similar to checkpointing Message driven execution Peek msg queue => what execute next? (prefetch)‏ 34

35Charm++ Workshop 2009 Overview of the idea 35

36Charm++ Workshop 2009 Using Out-of-core Change bigsim configuration file: Charm/tmp/Conv-mach-bigemulator.h #define BIGSIM_OUT_OF_CORE 1 Recompile Charm++ and application Run the application through the emulator, with an addintional command line option: +bgooc

Charm++ Workshop 2009 Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 37

Charm++ Workshop 2009 Postmortem Simulation Run application once, get trace logs, and run simulation with logs for a variety of network configurations Big Network Simulator (BigNetSim) implemented on POSE simulation framework Particularly useful when message passing performance is critical and strongly affected by network contention Note: BigSim emulator and BigSim simulator both use same network models for latency-only calculations located in charm/src/langs/bluegene/bigsim_network.h 38

Charm++ Workshop 2009 Implementation Post-Mortem Network simulators are Parallel Discrete Event Simulations Parallel Object Simulation Environment (POSE)‏ Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects Network data constructs (message, packet, etc.) implemented as event methods on simulation objects 39

Charm++ Workshop 2009 POSE 40

Charm++ Workshop 2009 Terms Several network models available Specific: e.g., BlueGene Latency-only model – does not account for contention Network contention model Generic: Simple Latency Model – uses a simple equation for determining message transmission time Emulating processors – physical processors on which emulation is run (+p?) Simulating processors – physical processors on which simulation (BigNetSim) is run (+p?) Target processors – virtual (or simulated) processors on which emulation and simulation are run (+vp?) 41

Charm++ Workshop 2009 Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 42

Charm++ Workshop 2009 BigNetSim Build Flow Download and compile charm Compile POSE Compile bigsim Download BigNetSim Compile BigNetSim Run simulator Output 43

Charm++ Workshop 2009 Download and compile charm (if not done already) Download the latest version of charm from the PPL archives: Compile charm cd charm./build charm++ net-linux 44

Charm++ Workshop 2009 Compile POSE cd charm./build pose net-linux options are set in pose_config.h stats enabled by POSE_STATS_ON=1 user event tracing TRACE_DETAIL=1 more advanced configuration options speculation checkpoints load balancing 45

Charm++ Workshop 2009 Compile bigsim cd charm/net-linux/tmp make bigsim 46

Charm++ Workshop 2009 Download BigNetSim Download latest revision from repository: svn co Directory structure: BigNetSim/trunk/ BlueGene/ RedStorm/ and others - network models SimpleLatency/ - Simple Latency Model Topology/ Routing/ InputVcSelection/ OutputVcSelection/ - network configuration choices Main/ - main simulation files tools/ - tools directory tmp/ - working directory created during build 47

Charm++ Workshop 2009 Compile BigNetSim Fix BigNetSim/trunk/Makefile.common so CHARMBASE points to your charm directory For the Simple Latency Model: cd BigNetSim/trunk/SimpleLatency For parallel simulator: make For sequential simulator (runs only on 1 simulating processor): make SEQUENTIAL=1 48

Charm++ Workshop 2009 Run Simulator cd BigNetSim/trunk/tmp Copy bgTrace files into /tmp directory For parallel build, run with:./charmrun +p4 bigsimulator -lat 1 -bw 1 For sequential build, run with:./bigsimulator -lat 1 -bw 1 49

Charm++ Workshop 2009 Output Simulation completion time Specified in “GVT ticks” (GVT = Global Virtual Time) GVT tick length is determined by the value of #define factor in BigNetSim/trunk/Main/TCsim.h Divide final GVT by factor to get simulation time in seconds factor = 1e8 => 1 tick = 10ns factor = 1e9 => 1 tick = 1ns 50

Charm++ Workshop 2009 Output (continued) Use BgPrint(char *) in source code to print event times Each BgPrint() called at execution time in online execution mode is stored in trace log as a printing event In postmortem simulation, strings associated with BgPrint() events are printed when the event is committed “%f” in the string will be replaced by committed time Useful for determining iteration times during simulation as well as emulation 51

Charm++ Workshop 2009 Output (continued) Projections Copy emulation Projections logs and sts file into BigNetSim/trunk/tmp Two ways to use: Command-line parameter: -projname Creates a new set of logs by updating the emulation logs Assumes emulation Projections logs are:.*.log Output: -bg.*.log Disadvantage: emulation Projections overhead included Command-line parameter: -tproj Creates a new set of logs from the trace files, ignoring the emulation logs Must first copy.sts file to tproj.sts Output: tproj.*.log Advantage: no emulation Projections overhead included 52

Charm++ Workshop 2009 Ring Example./bigsimulator -lat 1 -bw 1 Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor e Info> invoking startup task from proc 0... [0:RECV_RESUME] Start of major loop at [0:RECV_RESUME] End of major loop at Simulation inactive at time: Final GVT = PE Simulation finished at Program finished. 53

Charm++ Workshop 2009 Projections - Ring Example Emulation Simulation: -lat 1 (latency = 1  s) generated with -tproj 54

Charm++ Workshop 2009 Projections - Ring Example Simulation: -lat 1 (latency = 1  s) generated with -tproj Simulation: -lat 20 (latency = 20  s) generated with -tproj 55

Charm++ Workshop 2009 Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 56

Charm++ Workshop 2009 Simple Latency Model Calculates message transmission time using a simple equation: lat + (N / bw) + [cpp * (N / psize)] where lat = latency in  s bw = bandwidth in GB/sec cpp = cost per packet in  s psize = packet size in bytes N = number of bytes sent Only lat and bw are required; cpp and psize are optional Does not account for contention or topology, except for intra-node vs. inter-node messages 57

Charm++ Workshop 2009 BigNetSim Design – Simple Latency Model BGnode BGproc Net Interface Switch Channel Transceiver 58

Charm++ Workshop 2009 Simple Latency Model Parameters No netconfig file used for Simple Latency Model; only command-line parameters Required command-line parameters: -lat : latency in  s (double) -bw : bandwidth in GB/sec (double) Optional parameters: -cpp : cost-per-packet in  s (double) -psize : packet size in bytes (int) -lat_in : intra-node latency in  s (double) If not specified, lat_in defaults to charmcost (0.5  s) -bw_in : intra-node bandwidth in GB/sec (double) If not specified, bw_in defaults to the value from -bw -winsize : window size in # of log entries (int) 59

Charm++ Workshop 2009 Simple Latency Model Parameters More optional parameters: -winthresh : threshold beyond which windows will not be kept when looking at forward dependencies; specified in # of windows (int) -print_params : echoes the parameters fed into the simulation -debuglevel : specify level of debugging info displayed 0: display nothing extra 1: display some basic data, including memory usage and timing stats 2: display function-level debugging messages 3: display all debugging messages -skip : skip to a predefined point specified in the emulation source code -profile : display data for forward dependent distance histogram 60

Charm++ Workshop 2009 Other Parameters Other useful parameters for all network models: -check : at the end of the simulation, print all uncommitted events for each target processor None should be displayed, except for perhaps event 0 on each target processor -projname : create Projections logs based on existing emulation logs -tproj : create Projections logs from trace files, ignoring existing emulation logs 61

Charm++ Workshop 2009 Window Size Parameter Specifies the window size used for reading the bgTrace logs incrementally Necessary for large traces Available for all network models: Simple Latency Model: use -winsize on command line Other models: specify FILE_WINDOW_SIZE in netconfig file Reduces memory footprint of simulation at the cost of increasing runtime Setting to 0 sets size to the whole timeline (no windowing) Next window is loaded when simulation needs to look at forward dependencies of a particular event Doesn’t help much if dependencies span most of a timeline => use window threshold parameter 62

Charm++ Workshop 2009 Window Size Memory Savings Depends a lot on trace log profiles Use -profile command-line option in Simple Latency Model to get a forward dependent distance histogram (in text format) 63

Charm++ Workshop 2009 Window Size Memory Savings - NAMD Sequential Simple Latency Model, 280 target procs, 1 simulating proc, ~4000 events per time line, no thresholding (-winthresh 0) Note: Memory savings is good: save ~55% 64

Charm++ Workshop 2009 Window Size Memory Savings - MILC Sequential Simple Latency Model, 280 target procs, 1 simulating proc, ~4000 events per time line, no thresholding (-winthresh 0) Note: Memory savings is bad: only ~3% 65

Charm++ Workshop 2009 Window Load Threshold Parameter Specifies a threshold, in number of event windows, beyond which the simulator will not keep a window in memory when looking at forward dependencies Available for all network models: Simple Latency Model: use -winthresh on command line Other models: specify WINDOW_LOAD_THRESHOLD in netconfig file Balancing required Too small => extra window loading and longer runtime Too large => more windows loaded and larger memory footprint Setting to 0 sets threshold to infinity (no thresholding) 66

Charm++ Workshop 2009 Threshold Usage Note: each box is an event window; lines represent forward dependents Without thresholding, the simulator loads and stores a new window when it looks at the backward dependents of an event’s forward dependents to see if it can queue it for execution. This is fine for time lines where the forward dependents are “close”: But when they’re “far away,” this could load most of the timeline, which won’t be discarded until the events are executed: Solution: define a threshold beyond which the windows aren’t saved. 67

Charm++ Workshop 2009 Threshold Time Savings - MILC Sequential Simple Latency Model, 280 target procs, 1 simulating proc, ~4000 events per time line, with thresholding (-winthresh 1) Note: Memory savings is much better: save ~80% 68

Charm++ Workshop 2009 Debuglevel Example - Ring./bigsimulator -lat 1 -bw 1 -winsize 5 -debuglevel 1 Does this for each target processor: Beginning first pass on PE 0 [0] Max memory during init: (6.4 MB) [0] FileWindower: totalTlineLength=60 [0] First pass took seconds [0] Max memory during first pass: (1.2 MB) Beginning execution on PE 0 Shows execution updates: *** Entire first pass sequence took about seconds *** Execution update: at event 0 of 60 on PE 0 (run time = seconds) [0:RECV_RESUME] Start of major loop at [0:RECV_RESUME] End of major loop at Simulation inactive at time: Final GVT = Prints this at the end for each target processor: On proc 0: [0] Max memory during execution: (1.6 MB) [0] Max windows open: 2 [0] Open windows at completion: 0 [0] Window deletes: first pass=12 later=12 69

Charm++ Workshop 2009 Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 70

Charm++ Workshop 2009 Networks Direct Network Indirect Network 71

Charm++ Workshop 2009 BigNetSim Design BGnode BGproc Net Interface Switch Channel Transceiver 72

Charm++ Workshop 2009 BigNetSim: Network Data Flow 73

Charm++ Workshop 2009 Interconnection Networks Flexible Interconnection Network modeling: Choose from a variety of Topologies Routing Algorithms Input Virtual Channel Selection strategies Output Virtual Channel Selection strategies 74

Charm++ Workshop 2009 Topology Topologies available HyperCube; Mesh; generalized k-ary-n-mesh; n-mesh; Torus; generalized k-ary-n-cube; FatTree; generalized k-ary-n-tree; Low Diameter Regular graphs(LDR) Hybrid topologies HyperCube-Fattree; HyperCube-LDR; 75

Charm++ Workshop 2009 Network Modeling Routing models Virtual cut-through routing Contention Modeling Port contention at a Switch Load contention: available buffer at next layer of switches Adaptive and static Routing algorithms Minimal deadlock-free Non-minimal Fault-tolerant 76

Charm++ Workshop 2009 Routing Algorithms K-ary-N-mesh / N-mesh Direction Ordered; Planar Routing; Static Direction Reversal Routing Optimally Fully Adaptive Routing (modified too)‏ K-ary-N-tree UpDown (modified, non-minimal)‏ HyperCube Hamming P-Cube (modified too)‏ 77

Charm++ Workshop 2009 Input/Output VC selection Input Virtual Channel Selection Round Robin; Shortest Length Queue Output Buffer length Output Virtual Channel Selection Max. available buffer length Max. available buffer bubble VC Output Buffer length 78

Charm++ Workshop 2009 Configuring BigNetSim netconfig file parameters: USE_TRANSCEIVER 0 For network analysis ignore trace and generate random traffic NUM_NODES 0 Number of nodes, taken from trace file or set for transceiver MAX_PACKET_SIZE 256 Maximum packet size SWITCH_VC 4 The number of switch virtual channels SWITCH_PORT 8 Number of ports in switch, calculated automatically for direct networks SWITCH_BUF 1024 Size in memory of each virtual channel CHANNELBW 1.75 Bandwidth in 100 MB/s CHANNELDELAY 9 Delay in 10 ns. So 9 => 90ns RECEPTION_SERIAL 0 Used for direct networks where reception FIFO access has to be serialized INPUT_SPEEDUP 8 Used to limit simultaneous access by VC in a port. Should be less than or equal to number of VC. Currently used only for bluegene. ADAPTIVE_ROUTING 1 Additional flag to use adaptive/deterministic routing COLLECTION_INTERVAL Collection * 10ns gives statistics bin size DISPLAY_LINK_STATS 1 Display statistics for each link DISPLAY_MESSAGE_DELAY 1 Display message delay statistics FILE_WINDOW_SIZE 0 Window size for incremental log reading DEBUG_PRINT_LEVEL 1 Level of debugging messages to display WINDOW_LOAD_THRESHOLD 0 Threshold beyond which windows are discarded INTRA_NODE_LATENCY 0.5 Intra-node latency in  s INTRA_NODE_BANDWIDTH 1.0 Intra-node bandwidth in GB/sec 79

Charm++ Workshop Charm++ Workshop 2009 Running Specific Network Model Simulations./charmrun +p4 bigsimulator 1 1 Parameters First parameter selects detailed network simulation 1 will use the detailed model (w/ or w/o contention) 0 will use latency-only model Second parameter controls simulation skip 1 will skip forward to a time stamp set during trace creation 0 will not skip - use if no skips points were set in the emulation code or if network startup portion of the simulation is interesting 80

Charm++ Workshop 2009 Artificial Network Loads - Transceiver Generate traffic patterns instead of using trace files additional command line parameters Pattern Frequency Pattern 1 kshift 2 ring 3 bittranspose 4 bitreversal 5 bitcomplement 6 poisson Frequency 0 linear 1 uniform 2 exponential 81

Charm++ Workshop 2009 Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 82

Charm++ Workshop 2009 Adding a Network mkdir new subdir in trunk copy boilerplate InitNetwork.h copy boilerplate Makefile change MACHINE make variable to your dirname new InitNetwork.C Define switch, channel, nic mappings Define how switches route and select virtual channels Define topology and default routing 83

Charm++ Workshop 2009 Adding a Topology New *.h *.C in trunk/Topology constructor()‏ getNeighbours()‏ getNext()‏ getNextChannel()‏ getStartPort()‏ getStartVC()‏ getStartSwitch()‏ getStartNode()‏ getEndNode()‏ 84

Charm++ Workshop 2009 Adding a Routing Strategy New *.h *.C files in trunk/Routing constructor()‏ selectRoute()‏ populateRoute()‏ loadTable()‏ getNextSwitch()‏ sourceToSwitchRoutes()‏ 85

Charm++ Workshop 2009 Adding a VC Selector Either Input or Output VC Selector new *.h *C in [Input/Output]VCSelector constructor()‏ select[Input/Output]VC()‏ 86

Charm++ Workshop 2009 BlueWaters BlueWaters network model design currently in progress 87

Thank you! Free download of Charm++ and BigSim: Send questions and comments to: