1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.

1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois at Urbana-Champaign 1

Charm++ Workshop 2009 Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 2

3Charm++ Workshop 2009 Simulation-based Performance Prediction Extremely large parallel machines are being built with enormous compute power Very large number of processors with petaflops level peak performance Are existing software environments ready for these new machines? How to write a peta-scale parallel application? What will be the performance like? Can these applications scale? 3

4 BigSim Infrastructure BigSim for whole-system simulation of a large parallel machine. Goal: Support early application development and identification of performance bottlenecks. What BigSim can do: An execution environment that can run both Charm++ and MPI applications on large scale target machines No or small changes to MPI application source codes. facilitate code development and debugging Predict parallel performance at varying levels of resolution Tune/scale performance Machine vendors designing future machines Charm++ Workshop 20094

5 BigSim Components BigSim Emulator Run AMPI/Charm++ on emulator Capture computation and communication information Parallel: Each physical processor is used to emulate multiple target processors, leveraging Charm++’s virtualization support BigSim Simulator PDES, Network contention Produce performance data in a format compatible with the Projections graphical browser 5

6Charm++ Workshop 2009 Charm++/MPI applications Simulation output trace logs BigNetSim (POSE)‏ Network Simulator Performance visualization (Projections)‏ BigSim Emulator AMPI Runtime Instruction Sim (RSim, IBM,..)‏ Simple Network Model Performance counters Load Balancing Module Offline PDES Architecture of BigSim 6 Charm++ Runtime

What BigSim Can not Do BigSim Itself does not predict cycle-accurate timing (needs instruction-level simulation) does not predict cache effect, virtual memory does not model O.S. jitter Charm++ Workshop 20097

Outline Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow Generic network model: Simple Latency Model Specific network models Extensibility 8

9 BigSim Emulator Emulate full machine on existing parallel machines Actually run a parallel program E.g. multi-million objects on 128K target processors Emulator is implemented on Charm++ Libraries that link to user application Simple architecture abstraction Many multiprocessor (SMP) nodes connected via message passing Charm++ Workshop 20099

10Charm++ Workshop 2009 BigSim Emulator: functional view Affinity message queues Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Converse scheduler Converse Q Affinity message queues Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Real Processor Target Node 10

11 Install BigSim Emulator Download Charm++ v6.1.2 http://charm.cs.uiuc.edu/download/downloads.sht ml Compile Charm++/AMPI with “bigemulator” option:./build AMPI net-linux-x86_64 bigemulator –O This builds charm++ and emulator libraries under net-linux-x86_64-bigemulator Compiler wrapper for MPI applications: charm/net-linux-x86_64-bigemulator/bin/mpicc, mpicxx, mpif90, etc Charm++ Workshop 200911

12 Prepare MPI Applications Make sure applications are AMPI-complaint Adaptive MPI – an implementation of MPI standard on Charm++ Multithreaded Changes that may be needed: Fortran: Program Main => Program MPI_Main Handle global/static variables Manual: group globals into a big structure, and allocate on heap Semi-automatic: use thread local storage Int static __thread var; Automatic: -swapglobals compiler option (ELF binaries) Only handles globals, not statics Charm++ Workshop 200912

13 Ring Example (ring.c) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif Int value = 0; int main(int argc, char *argv[]) { int myid, numprocs, i; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } Charm++ Workshop 200913

14 Ring Example (AMPI-complaint) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif int main(int argc, char *argv[]) { int myid, numprocs, I, value=0; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } Charm++ Workshop 200914Charm++ Workshop 2009

15Charm++ Workshop 2009 How to Compile and Run MPI Applications for the Emulator Compile with AMPI and emulator charm/net-linux-x86_64-bigemulator/mpicc –o ring./ring.c with performance trace module: charm/net-linux-x86_64-bigemulator/mpicc –o ring./ring.c –tracemode projections Run: Use mpirun provided by AMPI Give number of target processors as well number of real processors Define target machine Command line options +x +y +z +cth +wth E.g. mpirun –np 4./ring +x10 +y10 +z10 +cth2 +wth4 Or, use Config file mpirun –np 4./ring +bgconfig config 15

16Charm++ Workshop 2009 Bgconfig File Format +bgconfig./bg_config x 10 y 10 z 10 cth 2 wth 4 stacksize 4000 timing walltime #timing bgelapse #timing counter #cpufactor 1.0 fpfactor 5e-7 traceroot /tmp log yes correct no network bluegene 16

17Charm++ Workshop 2009 Ring Std Output Justice> mpirun –np 4./pgm +bgconfig./bg_config Reading Bluegene Config file./bg_config... BG info> Simulating 8x1x1 nodes with 1 comm + 1 work threads each. BG info> Network type: ibmpower. alpha: 1.000000e-06 bandwidth :1.700000e+09. BG info> cpufactor is 1.000000. BG info> floating point factor is 0.000000. BG info> BG stack size: 30000 bytes. BG info> Using WallTimer for timing method. BG info> Generating timing log. BG info> bgTrace root is './'. LB> Load balancer ignores processor background load. Start of major loop at 0.268719 End of major loop at 0.273697 Sum=280, Time=0.00497856 [0] Number is numX:8 numY:1 numZ:1 numCth:1 numWth:1 numEmulatingPes:4 totalWorkerProcs:8 bglog_ver:5 [2] Wrote to disk for 2 BG nodes. [3] Wrote to disk for 2 BG nodes. [1] Wrote to disk for 2 BG nodes. [0] Wrote to disk for 2 BG nodes. BG> BlueGene emulator shutdown gracefully! BG> Emulation took 0.692498 seconds! 17

18 Ring Output Files Justice> ls -l -rwxr-xr-x 1 gzheng kale 2194434 2009-04-15 00:03 ring -rw-r--r-- 1 gzheng kale 10105 2009-04-15 00:04 pgm.sts -rw-r--r-- 1 gzheng kale 0 2009-04-15 00:04 pgm.projrc -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.7.log -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.6.log -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.5.log -rw-r--r-- 1 gzheng kale 4559 2009-04-15 00:04 pgm.4.log -rw-r--r-- 1 gzheng kale 4861 2009-04-15 00:04 pgm.3.log -rw-r--r-- 1 gzheng kale 5163 2009-04-15 00:04 pgm.2.log -rw-r--r-- 1 gzheng kale 5167 2009-04-15 00:04 pgm.1.log -rw-r--r-- 1 gzheng kale 6670 2009-04-15 00:04 pgm.0.log -rw-r--r-- 1 gzheng kale 23901 2009-04-15 00:04 bgTrace3 -rw-r--r-- 1 gzheng kale 23938 2009-04-15 00:04 bgTrace2 -rw-r--r-- 1 gzheng kale 24663 2009-04-15 00:04 bgTrace1 -rw-r--r-- 1 gzheng kale 24242 2009-04-15 00:04 bgTrace0 -rw-r--r-- 1 gzheng kale 60 2009-04-15 00:04 bgTrace Charm++ Workshop 2009 Only 4 files 8 files

19Charm++ Workshop 2009 What is in the Trace Logs? Traces for 2 target processors Each SEB has: startTime, endTime Incoming Message ID Outgoing messages Dependences 19 Tools for reading bgTrace binary files: 1.charm/example/bigsim/tools/loadlog Convert to human-readable format 2.charm/example/bigsim/tools/log2proj Convert to trace projections log files

20 Ring Projections Timeline Charm++ Workshop 2009

Performance Prediction How to predict sequential performance? Different levels of fidelity: User supplied timing expression Wall clock time Performance counters Instruction level simulation Charm++ Workshop 200921

22Charm++ Workshop 2009 Sequential Time - BgElapse BgElapse Manually advance processor time MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid;... BgElapse(0.000005); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); Run with +bgelapse 22

23Charm++ Workshop 2009 Sequential Time – using Wallclock Wallclock measurement of the time can be used via a suitable multiplier (scale factor)‏ T * factor Run application with +bgwalltime and +bgcpufactor, or +bgconfig./bgconfig: timing walltime cpufactor 0.7 Good for predicting a larger machine using a fraction of the machine 23

24Charm++ Workshop 2009 Sequential Time – Performance Counters Count floating-point, integer, memory and branch instructions (for example) with hardware counters Derive these hardware counters to expected time on target machine. Cache performance and the memory footprint effects can be approximated by percentage of memory accesses and cache hit/miss ratio. Example of use, for a floating-point intensive code: +bgconfig./bg_config timing counter fpfactor 5e-7 Perfex and PAPI are supported 24

25Charm++ Workshop 2009 Sequential Time – Instruction level simulation Run instruction-level simulator separately to get accurate timing information‏ Issues: It is a different third-party hardware simulator Hard to integrate with BigSim Sequential Does not model communication Slow! 25

Interpolation BigSim and instruction-level simulator interact through logs Reduce the problem size by sampling: An interpolation- based scheme Run a smaller sized problem, or Run just one processor Assume computation can be modelled by a set of parameters: T C = F n (p 1, p 2, p 3,...) Use sample data from the instruction-level simulation to interpolate large dataset With sampling data, do a least-squares fit to determine the coefficients of an approximation polynomial function Charm++ Workshop 200926

27Charm++ Workshop 2009 Case study: BigSim / Mambo void func( )‏ { startTraceBigSim( )‏ … endTraceBigSim( )‏ } Mambo BigSim Parallel Emulation Cycle-accurate prediction of sequential blocks on POWER7 processor Parameter files for sequential blocks Trace files Interpolation Adjusted trace files Prediction for Target System + Replace sequential timing BigSim Parallel Simulation 27

Ring Example MPI_Recv(&value,1,MPI_INT,myid- 1,999,MPI_COMM_WORLD,&status); startTraceBigSim(); value += myid; endTraceBigSim(); char param[128]; sprintf(param, “sum %d”, myid); tagTraceBigSim(param); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs, 999,MPI_COMM_WORLD); Charm++ Workshop 200928

Output Files justice>ls -l total 2328 -rw-r--r-- 1 gzheng kale 60 2009-04-15 11:08 bgTrace -rw-r--r-- 1 gzheng kale 36757 2009-04-15 11:08 bgTrace0 -rw-r--r-- 1 gzheng kale 37023 2009-04-15 11:08 bgTrace1 -rwxr-xr-x 1 gzheng kale 94886 2009-04-14 09:46 charmrun* -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.0 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.1 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.2 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.3 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.4 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.5 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.6 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.7 -rwxr-xr-x 1 gzheng kale 2153700 2009-04-15 11:07 ring* -rw-r--r-- 1 gzheng kale 2965 2009-04-15 11:07 ring.C Charm++ Workshop 200929 justice>cat param.7 48 sum 7

Run ring Through Instruction-level Simulator Compile normal version of ring (not emulator) Run sequentially through an instruction- level simulator Sample line of Mambo output: 10900820693: (10718653772): TRACE_END: sum 7 Charm++ Workshop 200930

Compile and Run Interpolation Tool Install GSL, the GNU Scientific Library cd charm/examples/bigsim/tools/rewritelog Modify the file interpolatelog.C to match your particular tastes. OUTPUTDIR specifies a directory for the new logfiles CYCLE_TIMES_FILE specifies the file which contains accurate timing information Make Run interpolation tool under bgTrace dir:./interpolatelog Charm++ Workshop 200931

32Charm++ Workshop 2009 Interpolation Tool Rewrites SEB Durations Replace the duration of a portion of each SEB with known exact times recorded in a execution or cycle-accurate simulator Scale begin/end portions by a constant factor Message send points are linearly mapped into the new times 32

Record/Replay Record only a subset of special logs when running full size emulation With the special logs, replay the execution of a particular target processor through hardware simulator Example:./pgm +x 32768 +y 1 +z1 +bgrecord +bgrecordprocessors 0-32767:1024./pgm +bgreplay 31744 Charm++ Workshop 200933

34Charm++ Workshop 2009 Out-of-core Emulation Motivation Applications with large memory footprint VM system can not handle well Use hard drive Similar to checkpointing Message driven execution Peek msg queue => what execute next? (prefetch)‏ 34

35Charm++ Workshop 2009 Overview of the idea 35

36Charm++ Workshop 2009 Using Out-of-core Change bigsim configuration file: Charm/tmp/Conv-mach-bigemulator.h #define BIGSIM_OUT_OF_CORE 1 Recompile Charm++ and application Run the application through the emulator, with an addintional command line option: +bgooc 1024 36

Charm++ Workshop 2009 Postmortem Simulation Run application once, get trace logs, and run simulation with logs for a variety of network configurations Big Network Simulator (BigNetSim) implemented on POSE simulation framework Particularly useful when message passing performance is critical and strongly affected by network contention Note: BigSim emulator and BigSim simulator both use same network models for latency-only calculations located in charm/src/langs/bluegene/bigsim_network.h 38

Charm++ Workshop 2009 Implementation Post-Mortem Network simulators are Parallel Discrete Event Simulations Parallel Object Simulation Environment (POSE)‏ Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects Network data constructs (message, packet, etc.) implemented as event methods on simulation objects 39

Charm++ Workshop 2009 POSE 40

Charm++ Workshop 2009 Terms Several network models available Specific: e.g., BlueGene Latency-only model – does not account for contention Network contention model Generic: Simple Latency Model – uses a simple equation for determining message transmission time Emulating processors – physical processors on which emulation is run (+p?) Simulating processors – physical processors on which simulation (BigNetSim) is run (+p?) Target processors – virtual (or simulated) processors on which emulation and simulation are run (+vp?) 41

Charm++ Workshop 2009 BigNetSim Build Flow Download and compile charm Compile POSE Compile bigsim Download BigNetSim Compile BigNetSim Run simulator Output 43

Charm++ Workshop 2009 Download and compile charm (if not done already) Download the latest version of charm from the PPL archives: http://charm.cs.uiuc.edu/download/downloads.shtml Compile charm cd charm./build charm++ net-linux 44

Charm++ Workshop 2009 Compile POSE cd charm./build pose net-linux options are set in pose_config.h stats enabled by POSE_STATS_ON=1 user event tracing TRACE_DETAIL=1 more advanced configuration options speculation checkpoints load balancing 45

Charm++ Workshop 2009 Compile bigsim cd charm/net-linux/tmp make bigsim 46

Charm++ Workshop 2009 Download BigNetSim Download latest revision from repository: svn co https://charm.cs.uiuc.edu/svn/repos/BigNetSim Directory structure: BigNetSim/trunk/ BlueGene/ RedStorm/ and others - network models SimpleLatency/ - Simple Latency Model Topology/ Routing/ InputVcSelection/ OutputVcSelection/ - network configuration choices Main/ - main simulation files tools/ - tools directory tmp/ - working directory created during build 47

Charm++ Workshop 2009 Compile BigNetSim Fix BigNetSim/trunk/Makefile.common so CHARMBASE points to your charm directory For the Simple Latency Model: cd BigNetSim/trunk/SimpleLatency For parallel simulator: make For sequential simulator (runs only on 1 simulating processor): make SEQUENTIAL=1 48

Charm++ Workshop 2009 Run Simulator cd BigNetSim/trunk/tmp Copy bgTrace files into /tmp directory For parallel build, run with:./charmrun +p4 bigsimulator -lat 1 -bw 1 For sequential build, run with:./bigsimulator -lat 1 -bw 1 49

Charm++ Workshop 2009 Output Simulation completion time Specified in “GVT ticks” (GVT = Global Virtual Time) GVT tick length is determined by the value of #define factor in BigNetSim/trunk/Main/TCsim.h Divide final GVT by factor to get simulation time in seconds factor = 1e8 => 1 tick = 10ns factor = 1e9 => 1 tick = 1ns 50

Charm++ Workshop 2009 Output (continued) Use BgPrint(char *) in source code to print event times Each BgPrint() called at execution time in online execution mode is stored in trace log as a printing event In postmortem simulation, strings associated with BgPrint() events are printed when the event is committed “%f” in the string will be replaced by committed time Useful for determining iteration times during simulation as well as emulation 51

Charm++ Workshop 2009 Output (continued) Projections Copy emulation Projections logs and sts file into BigNetSim/trunk/tmp Two ways to use: Command-line parameter: -projname Creates a new set of logs by updating the emulation logs Assumes emulation Projections logs are:.*.log Output: -bg.*.log Disadvantage: emulation Projections overhead included Command-line parameter: -tproj Creates a new set of logs from the trace files, ignoring the emulation logs Must first copy.sts file to tproj.sts Output: tproj.*.log Advantage: no emulation Projections overhead included 52

Charm++ Workshop 2009 Ring Example./bigsimulator -lat 1 -bw 1 Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor 1.000000e+08... Info> invoking startup task from proc 0... [0:RECV_RESUME] Start of major loop at 0.347418 [0:RECV_RESUME] End of major loop at 0.349147 Simulation inactive at time: 38129444 Final GVT = 38129444 1 PE Simulation finished at 0.052671. Program finished. 53

Charm++ Workshop 2009 Projections - Ring Example Emulation Simulation: -lat 1 (latency = 1  s) generated with -tproj 54

Charm++ Workshop 2009 Projections - Ring Example Simulation: -lat 1 (latency = 1  s) generated with -tproj Simulation: -lat 20 (latency = 20  s) generated with -tproj 55

Charm++ Workshop 2009 Simple Latency Model Calculates message transmission time using a simple equation: lat + (N / bw) + [cpp * (N / psize)] where lat = latency in  s bw = bandwidth in GB/sec cpp = cost per packet in  s psize = packet size in bytes N = number of bytes sent Only lat and bw are required; cpp and psize are optional Does not account for contention or topology, except for intra-node vs. inter-node messages 57

Charm++ Workshop 2009 BigNetSim Design – Simple Latency Model BGnode BGproc Net Interface Switch Channel Transceiver 58

Charm++ Workshop 2009 Simple Latency Model Parameters No netconfig file used for Simple Latency Model; only command-line parameters Required command-line parameters: -lat : latency in  s (double) -bw : bandwidth in GB/sec (double) Optional parameters: -cpp : cost-per-packet in  s (double) -psize : packet size in bytes (int) -lat_in : intra-node latency in  s (double) If not specified, lat_in defaults to charmcost (0.5  s) -bw_in : intra-node bandwidth in GB/sec (double) If not specified, bw_in defaults to the value from -bw -winsize : window size in # of log entries (int) 59

Charm++ Workshop 2009 Simple Latency Model Parameters More optional parameters: -winthresh : threshold beyond which windows will not be kept when looking at forward dependencies; specified in # of windows (int) -print_params : echoes the parameters fed into the simulation -debuglevel : specify level of debugging info displayed 0: display nothing extra 1: display some basic data, including memory usage and timing stats 2: display function-level debugging messages 3: display all debugging messages -skip : skip to a predefined point specified in the emulation source code -profile : display data for forward dependent distance histogram 60

Charm++ Workshop 2009 Other Parameters Other useful parameters for all network models: -check : at the end of the simulation, print all uncommitted events for each target processor None should be displayed, except for perhaps event 0 on each target processor -projname : create Projections logs based on existing emulation logs -tproj : create Projections logs from trace files, ignoring existing emulation logs 61

Charm++ Workshop 2009 Window Size Parameter Specifies the window size used for reading the bgTrace logs incrementally Necessary for large traces Available for all network models: Simple Latency Model: use -winsize on command line Other models: specify FILE_WINDOW_SIZE in netconfig file Reduces memory footprint of simulation at the cost of increasing runtime Setting to 0 sets size to the whole timeline (no windowing) Next window is loaded when simulation needs to look at forward dependencies of a particular event Doesn’t help much if dependencies span most of a timeline => use window threshold parameter 62

Charm++ Workshop 2009 Window Size Memory Savings Depends a lot on trace log profiles Use -profile command-line option in Simple Latency Model to get a forward dependent distance histogram (in text format) 63

Charm++ Workshop 2009 Window Size Memory Savings - NAMD Sequential Simple Latency Model, 280 target procs, 1 simulating proc, ~4000 events per time line, no thresholding (-winthresh 0) Note: Memory savings is good: save ~55% 64

Charm++ Workshop 2009 Window Size Memory Savings - MILC Sequential Simple Latency Model, 280 target procs, 1 simulating proc, ~4000 events per time line, no thresholding (-winthresh 0) Note: Memory savings is bad: only ~3% 65

Charm++ Workshop 2009 Window Load Threshold Parameter Specifies a threshold, in number of event windows, beyond which the simulator will not keep a window in memory when looking at forward dependencies Available for all network models: Simple Latency Model: use -winthresh on command line Other models: specify WINDOW_LOAD_THRESHOLD in netconfig file Balancing required Too small => extra window loading and longer runtime Too large => more windows loaded and larger memory footprint Setting to 0 sets threshold to infinity (no thresholding) 66

Charm++ Workshop 2009 Threshold Usage Note: each box is an event window; lines represent forward dependents Without thresholding, the simulator loads and stores a new window when it looks at the backward dependents of an event’s forward dependents to see if it can queue it for execution. This is fine for time lines where the forward dependents are “close”: But when they’re “far away,” this could load most of the timeline, which won’t be discarded until the events are executed: Solution: define a threshold beyond which the windows aren’t saved. 67

Charm++ Workshop 2009 Threshold Time Savings - MILC Sequential Simple Latency Model, 280 target procs, 1 simulating proc, ~4000 events per time line, with thresholding (-winthresh 1) Note: Memory savings is much better: save ~80% 68

Charm++ Workshop 2009 Debuglevel Example - Ring./bigsimulator -lat 1 -bw 1 -winsize 5 -debuglevel 1 Does this for each target processor: Beginning first pass on PE 0 [0] Max memory during init: 6660424 (6.4 MB) [0] FileWindower: totalTlineLength=60 [0] First pass took 0.004848 seconds [0] Max memory during first pass: 1297480 (1.2 MB) Beginning execution on PE 0 Shows execution updates: *** Entire first pass sequence took about 0.049207 seconds *** Execution update: at event 0 of 60 on PE 0 (run time = 0.000000 seconds) [0:RECV_RESUME] Start of major loop at 0.347418 [0:RECV_RESUME] End of major loop at 0.349147 Simulation inactive at time: 38129444 Final GVT = 38129444 Prints this at the end for each target processor: On proc 0: [0] Max memory during execution: 1643560 (1.6 MB) [0] Max windows open: 2 [0] Open windows at completion: 0 [0] Window deletes: first pass=12 later=12 69

Charm++ Workshop 2009 Networks Direct Network Indirect Network 71

Charm++ Workshop 2009 BigNetSim Design BGnode BGproc Net Interface Switch Channel Transceiver 72

Charm++ Workshop 2009 BigNetSim: Network Data Flow 73

Charm++ Workshop 2009 Interconnection Networks Flexible Interconnection Network modeling: Choose from a variety of Topologies Routing Algorithms Input Virtual Channel Selection strategies Output Virtual Channel Selection strategies 74

Charm++ Workshop 2009 Topology Topologies available HyperCube; Mesh; generalized k-ary-n-mesh; n-mesh; Torus; generalized k-ary-n-cube; FatTree; generalized k-ary-n-tree; Low Diameter Regular graphs(LDR) Hybrid topologies HyperCube-Fattree; HyperCube-LDR; 75

Charm++ Workshop 2009 Network Modeling Routing models Virtual cut-through routing Contention Modeling Port contention at a Switch Load contention: available buffer at next layer of switches Adaptive and static Routing algorithms Minimal deadlock-free Non-minimal Fault-tolerant 76

Charm++ Workshop 2009 Routing Algorithms K-ary-N-mesh / N-mesh Direction Ordered; Planar Routing; Static Direction Reversal Routing Optimally Fully Adaptive Routing (modified too)‏ K-ary-N-tree UpDown (modified, non-minimal)‏ HyperCube Hamming P-Cube (modified too)‏ 77

Charm++ Workshop 2009 Input/Output VC selection Input Virtual Channel Selection Round Robin; Shortest Length Queue Output Buffer length Output Virtual Channel Selection Max. available buffer length Max. available buffer bubble VC Output Buffer length 78

Charm++ Workshop 2009 Configuring BigNetSim netconfig file parameters: USE_TRANSCEIVER 0 For network analysis ignore trace and generate random traffic NUM_NODES 0 Number of nodes, taken from trace file or set for transceiver MAX_PACKET_SIZE 256 Maximum packet size SWITCH_VC 4 The number of switch virtual channels SWITCH_PORT 8 Number of ports in switch, calculated automatically for direct networks SWITCH_BUF 1024 Size in memory of each virtual channel CHANNELBW 1.75 Bandwidth in 100 MB/s CHANNELDELAY 9 Delay in 10 ns. So 9 => 90ns RECEPTION_SERIAL 0 Used for direct networks where reception FIFO access has to be serialized INPUT_SPEEDUP 8 Used to limit simultaneous access by VC in a port. Should be less than or equal to number of VC. Currently used only for bluegene. ADAPTIVE_ROUTING 1 Additional flag to use adaptive/deterministic routing COLLECTION_INTERVAL 1000000 Collection * 10ns gives statistics bin size DISPLAY_LINK_STATS 1 Display statistics for each link DISPLAY_MESSAGE_DELAY 1 Display message delay statistics FILE_WINDOW_SIZE 0 Window size for incremental log reading DEBUG_PRINT_LEVEL 1 Level of debugging messages to display WINDOW_LOAD_THRESHOLD 0 Threshold beyond which windows are discarded INTRA_NODE_LATENCY 0.5 Intra-node latency in  s INTRA_NODE_BANDWIDTH 1.0 Intra-node bandwidth in GB/sec 79

Charm++ Workshop 200980Charm++ Workshop 2009 Running Specific Network Model Simulations./charmrun +p4 bigsimulator 1 1 Parameters First parameter selects detailed network simulation 1 will use the detailed model (w/ or w/o contention) 0 will use latency-only model Second parameter controls simulation skip 1 will skip forward to a time stamp set during trace creation 0 will not skip - use if no skips points were set in the emulation code or if network startup portion of the simulation is interesting 80

Charm++ Workshop 2009 Artificial Network Loads - Transceiver Generate traffic patterns instead of using trace files additional command line parameters Pattern Frequency Pattern 1 kshift 2 ring 3 bittranspose 4 bitreversal 5 bitcomplement 6 poisson Frequency 0 linear 1 uniform 2 exponential 81

Charm++ Workshop 2009 Adding a Network mkdir new subdir in trunk copy boilerplate InitNetwork.h copy boilerplate Makefile change MACHINE make variable to your dirname new InitNetwork.C Define switch, channel, nic mappings Define how switches route and select virtual channels Define topology and default routing 83

Charm++ Workshop 2009 Adding a Topology New *.h *.C in trunk/Topology constructor()‏ getNeighbours()‏ getNext()‏ getNextChannel()‏ getStartPort()‏ getStartVC()‏ getStartSwitch()‏ getStartNode()‏ getEndNode()‏ 84

Charm++ Workshop 2009 Adding a Routing Strategy New *.h *.C files in trunk/Routing constructor()‏ selectRoute()‏ populateRoute()‏ loadTable()‏ getNextSwitch()‏ sourceToSwitchRoutes()‏ 85

Charm++ Workshop 2009 Adding a VC Selector Either Input or Output VC Selector new *.h *C in [Input/Output]VCSelector constructor()‏ select[Input/Output]VC()‏ 86

Charm++ Workshop 2009 BlueWaters BlueWaters network model design currently in progress 87

Thank you! Free download of Charm++ and BigSim: http://charm.cs.uiuc.edu Send questions and comments to: ppl@charm.cs.uiuc.edu

1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.

Similar presentations

Presentation on theme: "1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.

Similar presentations

Presentation on theme: "1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback