1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University of Illinois at Urbana-Champaign 14/28/2010
Charm++ Workshop 2010 Outline Overview BigSim Emulator BigSim Simulator 24/28/2010
Summarizing the State of Art Petascale Very powerful parallel machines exist (Jaguar, Roadrunner, etc) Application domains exist that need that kind of power New generation of applications Use sophisticated algorithms Dynamic adaptive refinements Multi-scale, multi-physics Parallel applications are more complex than sequential ones, hard to predict without actually running it Challenge: Is it possible to simulate these applications on large scale using small clusters? 3 Charm++ Workshop /28/2010
BigSim Why BigSim, and why on Charm++? Targets large scale simulation Object-based processor virtualization For a virtualized execution environment Efficient message passing runtime by Charm++ Support fine-grained decomposition Portability 4 Charm++ Workshop /28/2010
5 BigSim Infrastructure Emulator A virtualized execution environment Charm++ and MPI applications No or small changes to MPI application source codes. facilitate code development and debugging Simulator Trace-driven approach Parallel Discrete Event Simulation Simple latency, full network contention modeling Predict parallel performance at varying levels of resolution Charm++ Workshop /28/2010
6Charm++ Workshop 2010 Charm++/MPI applications Simulation trace logs BigSim Simulator Performance visualization (Projections) BigSim Emulator AMPI Runtime Architecture of BigSim 6 Charm++ Runtime 4/28/2010 POSE
7 MPI Alltoall Timeline Charm++ Workshop 20104/28/2010
8 BigSim Emulator Emulate full machine on existing machines Actually run a parallel program E.g. NAMD on 256K target processors using 8K cores of Ranger cluster Implemented on Charm++ Libraries that link to user application Simple architecture abstraction Many multiprocessor (SMP) nodes connected via message passing Do not emulate at instruction level Charm++ Workshop /28/2010
Processor-level queues Communication processors Worker processors Node-level queue Converse scheduler Converse Queue Processor-level queues Communication processors Incoming queue Worker processors Node-level queue Physical Processor Target Node 9 Incoming queue Target Node BigSim Emulator: functional view 9Charm++ Workshop 20104/28/2010
Processor Virtualization User ViewSystem View Programmer: Decomposes the computation into objects Runtime: Maps the computation on to the processors 10Charm++ Workshop 20104/28/2010
Major Challenges Running multiple copies of code on each processor Shared global variables Charm++ applications already handle this AMPI Global/static variables Runtime techniques, compiler tools E.g. NAMD on 1024 target processors using 8 cores Simulation time Memory footprint Global read-only variables can be shared Out-of-core execution Charm++ Workshop /28/2010
NAMD Emulation Charm++ Workshop Only 19 times of slowdownOnly 7 times of increase in mem 4/28/2010
13Charm++ Workshop 2010 Out-of-core Emulation Motivation Applications with large memory footprint VM system can not handle well Use hard drive Similar to checkpointing Message driven execution Peek msg queue => what execute next? (prefetch) 134/28/2010
14Charm++ Workshop 2010 What is in the Trace Logs? Traces for 2 target processors Each SEB has: startTime, endTime Incoming Message ID Outgoing messages Dependences 14 Tools for reading bgTrace binary files: 1.charm/example/bigsim/tools/loadlog Convert to human-readable format 2.charm/example/bigsim/tools/log2proj Convert to trace projections log files 4/28/2010
BigSim Simulator: BigNetSim Post-mortem network simulator built on POSE (Parallel Object-oriented Simulation Environment), which is built on Charm++ Parallel Discrete Event Simulation Pass emulator traces through different network models in BigNetSim to get final performance results Details of using BigNetSim: hop2009/slides/tut_BigSim09.ppt manual.html 4/28/2010Charm++ Workshop
POSE Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects Network data constructs (message, packet, etc.) implemented as event methods on simulation objects 4/28/2010Charm++ Workshop
Posers 4/28/2010Charm++ Workshop Each poser is a tiny simulation
Performance Prediction Two components: Time to execute blocks of sequential, computational code SEBs = Sequential Execution Blocks Communication time based on a particular network topology 4/28/2010Charm++ Workshop
Sequential Time Prediction (Emulator) Manual Advance processor time using BgElapse() calls in application code Wallclock time Use multiplier (scale factor) to account for architecture differences Performance counters Count instructions with hardware counters Use expected time of each instruction on target machine to derive execution time Instruction-level simulation (e.g., Mambo) Record cycle-accurate execution times for functions Use interpolation tool to replace SEB times 4/28/2010Charm++ Workshop
Sequential Time Prediction (continued) Model-based (recent work) Performed after emulation Determine application functions responsible for most of the computation time Run these functions on target machine Obtain run times based on function parameters to create model Feed emulation traces through offline modeling tool (like interpolation tool) to replace SEB times Generates corrected set of traces 4/28/2010Charm++ Workshop
Communication Time Prediction (Simulator) Valid for a particular network topology Generic: Simple Latency model Formula predicts time using latency and bandwidth parameters Specific BlueGene, Blue Waters, and others Latency-only option – uses formula specific to network Full contention 4/28/2010Charm++ Workshop
Specific Model (Full Network) 4/28/2010Charm++ Workshop BGnode BGproc Net Interface Switch Transceiver Channel
Generic Model (Simple Latency) 4/28/2010Charm++ Workshop BGnode BGproc Net Interface Switch Transceiver Channel
What We Model Processors Nodes NICs Switches/hubs Channels Packet-level direct and indirect routing Buffers with credit scheme Virtual channels 4/28/2010Charm++ Workshop
Other BigNetSim Features Skip points Set skip points in application code (e.g., after startup) Simulate only between skip points Transceiver Traffic pattern generator – replaces nodes and processors Windowing Set file window size to decrease memory footprint Can cut footprint in half or better, depending on trace structure Checkpoint-to-disk (recent work) Saves simulator state based on time or GVT interval for restart if crash occurs 4/28/2010Charm++ Workshop
BigNetSim Tools Located in BigNetSim/trunk/tools Log Analyzer Provides info about a set of traces Number of events / simulated processor Number of messages sent Log Transformation (recently completed) Produces new set of traces with remapped objects Useful for testing load-balancing scenarios 4/28/2010Charm++ Workshop
BigNetSim Output BgPrintf() statements Added to application code “%f” converted to committed time during simulation GVT = Global Virtual Time Each GVT tick = 1/factor seconds factor is defined in BigNetSim/trunk/Main/TCsim.h Link utilization statistics Projections traces Use -tproj command-line parameter 4/28/2010Charm++ Workshop
BigNetSim Output Example Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor e Info> invoking startup task from proc 0... [0:RECV_RESUME] Start of major loop at [0:RECV_RESUME] End of major loop at Simulation inactive at time: Final GVT = Final link stats [Node 0, Channel 0, ### Link]: ovt: , utilization time: , utilization %: , packets sent: gvt= Final link stats [Node 0, Channel 3, ### Link]: ovt: , utilization time: , utilization %: , packets sent: 4259 gvt= PE Simulation finished at Program finished. 4/28/2010Charm++ Workshop
29 Ring Projections Timeline Charm++ Workshop 20104/28/2010
BigNetSim Performance Examples of sequential simulator performance on Blue Print 4k-VP MILC Startup time: 0.7 hours Execution time: 5.6 hours Total run time: 6.3 hours Memory footprint: ~3.1 GB 256k-VP 3D Jacobi (10x10x10 grid, 3 iterations) Startup time: 0.5 hours Execution time: 1.5 hours Total run time: 2.0 hours Memory footprint: ~20 GB Still tuning parallel simulator performance 4/28/2010Charm++ Workshop
Thank you! Free download of Charm++ and BigSim: Send questions and comments to: 4/28/2010Charm++ Workshop