Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Section 6.2. Record data by magnetizing the binary code on the surface of a disk. Data area is reusable Allows for both sequential and direct access file.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Operating-System Structures
Lab6 – Debug Assembly Language Lab
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
A NoC Generation and Evaluation Framework
Chapter 14 Chapter 14: Server Monitoring and Optimization.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Hands-On Microsoft Windows Server 2008
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 2: Operating-System Structures Operating.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
IBM Software Group ® Overview of SA and RSA Integration John Jessup June 1, 2012 Slides from Kevin Cornell December 2008 Have been reused in this presentation.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
Hybrid MPI and OpenMP Parallel Programming
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
BigSim Tutorial Celso L. Mendes - Ryan Mokos Department of Computer Science University of Illinois at Urbana-Champaign
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Lab 2 Parallel processing using NIOS II processors
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
MODPI: A parallel MOdel Data Passing Interface for integrating legacy environmental system models A. Dozier, O. David, Y. Zhang, and M. Arabi.
Threaded Programming Lecture 2: Introduction to OpenMP.
Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.
BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
BigSim Tutorial Presented by Eric Bohm LACSI Charm++ Workshop 2005 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Unit 4: Processes, Threads & Deadlocks June 2012 Kaplan University 1.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
NGS computation services: APIs and.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
CHAPTER 3 Router CLI Command Line Interface. Router User Interface User and privileged modes User mode --Typical tasks include those that check the router.
Visual Programming Borland Delphi. Developing Applications Borland Delphi is an object-oriented, visual programming environment to develop 32-bit applications.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
SQL Database Management
Outline Installing Gem5 SPEC2006 for Gem5 Configuring Gem5.
Development Environment
INTRODUCING Adams/CHASSIS
Chapter 2: System Structures
Introduction to ZBOSS Embedded Systems Software Training Center
Introduction to SimpleScalar
Introduction to Operating System (OS)
Chapter 2: System Structures
BigSim: Simulating PetaFLOPS Supercomputers
Carthage ios 8 onwards Dependency manager that streamlines the process of integrating the libraries into the project.
ICS103 Programming in C 1: Overview of Computers And Programming
Emulating Massively Parallel (PetaFLOPS) Machines
SPL – PS1 Introduction to C++.
Presentation transcript:

Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010

Outline Overview BigSim Emulator BigSim Simulator Using BigSim to Estimate Application Performance 2

BigSim Built on Charm++ Object-based processor virtualization Virtualized execution environment that allows running large-scale simulations on small-scale systems Runs Charm++ and AMPI applications at large scale 3 Using BigSim to Estimate Application Performance

Performance Prediction Two components: Time to execute blocks of sequential, computation code SEBs = Sequential Execution Blocks Communication time based on a particular network topology 4 Using BigSim to Estimate Application Performance

BigSim Components Emulator Generates traces that capture SEB execution times, dependencies, and messages Simulator (BigNetSim) Trace-driven PDES (Parallel Discrete Event Simulation) Calculates message timing based on the specified network model 5 Using BigSim to Estimate Application Performance

BigSim Architecture 6 Using BigSim to Estimate Application Performance Charm++/AMPI applications Simulation trace logs BigSim Simulator Performance visualization (Projections), Link utilization stats, Terminal output‏ BigSim Emulator AMPI RuntimeCharm++ Runtime POSE

Limitations BigSim does not: Include cycle-accurate/instruction-level simulation But can be integrated with external simulators Predict cache and virtual memory effects Model interference Operating system External job Model non-deterministic applications 7 Using BigSim to Estimate Application Performance

Outline Overview BigSim Emulator BigSim Simulator Using BigSim to Estimate Application Performance 8

Emulator Implemented on Charm++ Libraries link to user application Virtualized execution environment Each physical processor emulates multiple target processors Be careful of increased memory footprint Efficiencies realized NAMD: virt. ratio 128 => 7x memory, 19x run time 9 Using BigSim to Estimate Application Performance

Using the Emulator (64-Bit Linux Example) Convert MPI application to AMPI (or use a Charm++ application) Install emulator Download Charm++ Compile Charm++/AMPI with “bigemulator” option./build AMPI net-linux-x86_64 bigemulator -j8 -O This builds Charm++ and emulator libraries under net-linux-x86_64-bigemulator (work in this directory) 10 Using BigSim to Estimate Application Performance

Using the Emulator (continued) Set parameters in a config file Note: the same topology (x, y, z dimensions and number of worker threads) will be used by the simulator wth = # worker threads = # cores / node Compile the application to be emulated in the -bigemulator directory Run the application with the config file via +bgconfig 11 Using BigSim to Estimate Application Performance

Example – AMPI Cjacobi3D cd charm/net-linux-x86_64-bigemulator/examples/ampi/Cjacobi3D make Modify config file bg_config as desired: x 4 y 2 z 2 Cth 1 wth 8 stacksize timing walltime #timing bgelapse #timing counter cpufactor 1.0 fpfactor 5e-7 traceroot. log yes correct no network bluegene #projections 2, Using BigSim to Estimate Application Performance

Example – AMPI Cjacobi3D (continued) Run emulation of 8 target processors (virtual processors) on 2 physical processors./charmrun +p2./jacobi vp8 +bgconfig bg_config As long as “log yes” is specified in the config file, 3 trace files will be written: bgTrace – summary file bgTrace0 – trace file for the 4 vps on processor 0 bgTrace1 – trace file for the 4 vps on processor 1 13 Using BigSim to Estimate Application Performance

LogAnalyzer Tool for analyzing emulator traces Run as./LogAnalyzer –i (interactive) or./LogAnalyzer –c (good for scripting) Options: Display time line lengths Convert traces to ASCII files Display the number of messages sent and received by each target processor Display the total execution time of all events on each target processor Display the number of packets sent by each target processor Execution time estimation (experimental) 14 Using BigSim to Estimate Application Performance

Skip Points Add to actual application code at places where control is completely given back to processor 0 (e.g., after allreduce, barrier, load balancing, etc.) BgSetStartEvent() Skip points marked in trace files Simulator can execute between skip points Uses: Bypass start-up sequence Simulate only one application step 15 Using BigSim to Estimate Application Performance

Projections Visual tool for analyzing program runs Link the emulated application with –tracemode projections to get projections traces Can be modified by the simulator 16 Using BigSim to Estimate Application Performance

Projections Example – MPI AlltoAll Timeline 17 Using BigSim to Estimate Application Performance

Emulator – Other Features Different levels of fidelity available for predicting performance Wallclock time with cpu scaling factor (already discussed) Manually elapse time with BgElapse() calls Performance counters Instruction-level/cycle-accurate simulation Model-based (time most-used functions and iterpolate to create model) Out-of-core execution when emulation won’t fit in main memory Record/replay subset of traces 18 Using BigSim to Estimate Application Performance

Outline Overview BigSim Emulator BigSim Simulator Using BigSim to Estimate Application Performance 19

Simulator (BigNetSim) PDES simulator built on top of Charm++ Run BigNetSim on emulator traces to get final run results for a particular network model Pre-compiled binaries supplied for this workshop To download and build source code from public repository (does not contain Blue Waters model), see the 2009 Charm++ Workshop BigSim tutorial 20 Using BigSim to Estimate Application Performance

Running BigNetSim on Blue Print Ensure bgTrace files, charmrun, and charmrun.ll are in the same directory as the executable Update charmrun.ll with desired output and error file names Submit job to loadleveler./charmrun +p +n./ E.g.:./charmrun +p 1 +n 1./bigsimulator -linkstats -check Note: there must be a space between +p and +n and their numbers Note: the +p parameter specifies the total number of processors E.g., running on all 16 procs of each of 4 nodes (64 procs total) would be +p 64 +n 4 21 Using BigSim to Estimate Application Performance

Simple Latency Model vs. Blue Waters Model Simple Latency includes processors and nodes and implements the network with an equation: lat + (N / bw) + [cpp * (N / psize)] lat = latency in  s bw = bandwidth in GB/sec cpp = cost per packet in  s psize = packet size in bytes N = number of bytes sent Blue Waters includes processors, nodes, Torrents, and links between the Torrents 22 Using BigSim to Estimate Application Performance

Command-Line Arguments – Simple Latency Bandwidth and latency must be specified: -bw Link bandwidth in GB/s -lat Link latency in  s Other optional arguments: -help Displays all available arguments -cpp Cost per packet in  s -psize Packet size in bytes -bw_in Intra-node bandwidth in GB/s Defaults to -bw value if not specified -lat_in Intra-node latency in  s Defaults to 0.5  s if not specified 23 Using BigSim to Estimate Application Performance

Command-Line Arguments – Simple Latency (continued) -check Checks for unexecuted events at the end of the simulation -cpufactor A constant by which SEB execution times are multiplied; defaults to 1.0 -debuglevel 0: no debug statements 1: high-level debug statements and summary info -projname Sets the name of the projections logs that will be corrected based on network simulation -skip_start Sets the skip point at which simulation execution begins -skip_end Sets the skip point at which simulation execution ends -tproj Generate projections logs based only on network simulation 24 Using BigSim to Estimate Application Performance

Command-Line Arguments – Blue Waters No arguments are required; all are optional: -help Displays all available arguments -check Checks for unexecuted events at the end of the simulation -cpufactor A constant by which SEB execution times are multiplied; defaults to 1.0 -debuglevel 0: no debug statements 1: high-level debug statements and summary info -linkstats Enable link stats for display at the end of the simulation -projname Sets the name of the projections logs that will be corrected based on network simulation -skip_start Sets the skip point at which simulation execution begins -skip_end Sets the skip point at which simulation execution ends -tproj Generate projections logs based only on network simulation 25 Using BigSim to Estimate Application Performance

Command-Line Arguments – Blue Waters (continued) -tracelinkstats Enable tracing of link stats -tracecontention Enable tracing of contention 26 Using BigSim to Estimate Application Performance

BigNetSim Output – Terminal (Text) BgPrintf(char *) statements Added to actual application code “%f” in function call argument converted to committed time during simulation GVT = Global Virtual Time Final simulation virtual time expressed in GVT ticks 1 GVT tick = 1 ns for the provided binaries Link utilization statistics 27 Using BigSim to Estimate Application Performance

BigNetSim Output – Terminal (Text) – Example Charm++: standalone mode (not using charmrun) Charm++> Running on 1 unique compute nodes (8-way SMP). ================= Simulation Configuration ================= Production version: 1.0 (10/13/2010) Simulation start time: Fri Oct 15 13:11: Number of physical PEs: 1 POSE mode: Sequential Network model: Blue Waters... ============================================================ Construction phase complete Initialization phase complete Info> invoking startup task from proc 0... Info> Starting at the beginning of the simulation Info> Running to the end of the simulation Entire first pass sequence took about seconds [0:user_code] #MILC# - WHILE Loop Iterarion Starting at [0:user_code] #MILC# - LL-Fat Starting at Sequential Endtime Approximation: Final link stats [Node 0, Channel 0, LL Link]: ovt: , utilization time: , utilization %: , packets sent: 2290 gvt= Final link stats [Node 0, Channel 11, LR Link]: ovt: , utilization time: , utilization %: , packets sent: 1827 gvt= PE Simulation finished at Program finished. 28 Using BigSim to Estimate Application Performance

BigNetSim Output – Projections Copy emulation Projections logs and sts file into directory with executable Two ways to use: Command-line parameter: -projname Creates a new set of logs by updating the emulation logs Assumes emulation Projections logs are:.*.log Output: -bg.*.log Disadvantage: emulation Projections overhead included Command-line parameter: -tproj Creates a new set of logs from the trace files, ignoring the emulation logs Must first copy.sts file to tproj.sts Output: tproj.*.log Advantage: no emulation Projections overhead included 29 Using BigSim to Estimate Application Performance

Projections – Ring Example 30 Using BigSim to Estimate Application Performance Emulation Simulation: -lat 1 (latency = 1  s) generated with -tproj

BigNetSim Output – Link Stat and Contention Traces (experimental) Enabled on the command line at run time Placed in a unique folder named link_traces_ May significantly increase run time and memory footprint LinkStatTraceAnalyzer tool examines link stat traces for links with high utilization and contention Writes reports listing links in order from most to least utilized 31 Using BigSim to Estimate Application Performance

BigNetSim Validation Network traffic generator tests of the BigNetSim Blue Waters network model give simulation results within a couple percent of those of IBM’s hardware simulator 32 Using BigSim to Estimate Application Performance

BigNetSim Performance Simulation Memory Footprint Estimate (GB) Startup Time (hours) Execution Time (hours) Total Run Time (hours) Sim Lat BW Sim Lat BW Sim Lat BW Sim Lat BW 4k-VP MILC k-VP 3D Jabobi (10x10x10 grid, 3 iters) k-VP NAMD (1M atoms, 8 iters, skip startup) Using BigSim to Estimate Application Performance Examples of sequential simulator performance on Blue Print Parallel performance is comparable to sequential but does not scale well yet outside a single node on Blue Print

BigNetSim – Other Features Other network models (e.g., BlueGene) Transceiver (traffic pattern generator) for testing network models without traces Checkpoint-to-disk to allow restart if hardware on which BigNetSim is running goes down Load balancing 34 Using BigSim to Estimate Application Performance

Additional Resources BigSim manuals: Recent Charm++ Workshop tutorials and talks 2008 BigSim tutorial (bottom of page) BigSim tutorial (bottom of page) BigSim talk (near top of page) PPL for help: 35 Using BigSim to Estimate Application Performance