1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Multiple Processor Systems
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.
Cactus in GrADS (HFA) Ian Foster Dave Angulo, Matei Ripeanu, Michael Russell.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
WSN Simulation Template for OMNeT++
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Advances in Language Design
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1Charm++ Workshop 2009 BigSim Tutorial Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Iris Simulator Overview Mitchelle Rasquinha CASL, School of ECE Georgia Institute of Technology ece8813a 7 th Sept 2010.
MapReduce How to painlessly process terabytes of data.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Modeling VHDL in POSE. Overview Motivation Motivation Quick Introduction to VHDL Quick Introduction to VHDL Mapping VHDL to POSE (the Translator) Mapping.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Middleware Services. Functions of Middleware Encapsulation Protection Concurrent processing Communication Scheduling.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Lab 2 Parallel processing using NIOS II processors
Faucets Queuing System Presented by, Sameer Kumar.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.
Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Full and Para Virtualization
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Background Computer System Architectures Computer System Software.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Process Management Process Concept Why only the global variables?
Parallel Objects: Virtualization & In-Process Components
Computer Engg, IIT(BHU)
Performance Evaluation of Adaptive MPI
Lecture Topics: 11/1 General Operating System Concepts Processes
BigSim: Simulating PetaFLOPS Supercomputers
Support for Adaptivity in ARMCI Using Migratable Objects
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University of Illinois at Urbana-Champaign 14/28/2010

Charm++ Workshop 2010 Outline Overview BigSim Emulator BigSim Simulator 24/28/2010

Summarizing the State of Art Petascale Very powerful parallel machines exist (Jaguar, Roadrunner, etc) Application domains exist that need that kind of power New generation of applications Use sophisticated algorithms Dynamic adaptive refinements Multi-scale, multi-physics Parallel applications are more complex than sequential ones, hard to predict without actually running it Challenge: Is it possible to simulate these applications on large scale using small clusters? 3 Charm++ Workshop /28/2010

BigSim Why BigSim, and why on Charm++? Targets large scale simulation Object-based processor virtualization For a virtualized execution environment Efficient message passing runtime by Charm++ Support fine-grained decomposition Portability 4 Charm++ Workshop /28/2010

5 BigSim Infrastructure Emulator A virtualized execution environment Charm++ and MPI applications No or small changes to MPI application source codes. facilitate code development and debugging Simulator Trace-driven approach Parallel Discrete Event Simulation Simple latency, full network contention modeling Predict parallel performance at varying levels of resolution Charm++ Workshop /28/2010

6Charm++ Workshop 2010 Charm++/MPI applications Simulation trace logs BigSim Simulator Performance visualization (Projections)‏ BigSim Emulator AMPI Runtime Architecture of BigSim 6 Charm++ Runtime 4/28/2010 POSE

7 MPI Alltoall Timeline Charm++ Workshop 20104/28/2010

8 BigSim Emulator Emulate full machine on existing machines Actually run a parallel program E.g. NAMD on 256K target processors using 8K cores of Ranger cluster Implemented on Charm++ Libraries that link to user application Simple architecture abstraction Many multiprocessor (SMP) nodes connected via message passing Do not emulate at instruction level Charm++ Workshop /28/2010

Processor-level queues Communication processors Worker processors Node-level queue Converse scheduler Converse Queue Processor-level queues Communication processors Incoming queue Worker processors Node-level queue Physical Processor Target Node 9 Incoming queue Target Node BigSim Emulator: functional view 9Charm++ Workshop 20104/28/2010

Processor Virtualization User ViewSystem View Programmer: Decomposes the computation into objects Runtime: Maps the computation on to the processors 10Charm++ Workshop 20104/28/2010

Major Challenges Running multiple copies of code on each processor Shared global variables Charm++ applications already handle this AMPI Global/static variables Runtime techniques, compiler tools E.g. NAMD on 1024 target processors using 8 cores Simulation time Memory footprint Global read-only variables can be shared Out-of-core execution Charm++ Workshop /28/2010

NAMD Emulation Charm++ Workshop Only 19 times of slowdownOnly 7 times of increase in mem 4/28/2010

13Charm++ Workshop 2010 Out-of-core Emulation Motivation Applications with large memory footprint VM system can not handle well Use hard drive Similar to checkpointing Message driven execution Peek msg queue => what execute next? (prefetch)‏ 134/28/2010

14Charm++ Workshop 2010 What is in the Trace Logs? Traces for 2 target processors Each SEB has: startTime, endTime Incoming Message ID Outgoing messages Dependences 14 Tools for reading bgTrace binary files: 1.charm/example/bigsim/tools/loadlog Convert to human-readable format 2.charm/example/bigsim/tools/log2proj Convert to trace projections log files 4/28/2010

BigSim Simulator: BigNetSim Post-mortem network simulator built on POSE (Parallel Object-oriented Simulation Environment), which is built on Charm++ Parallel Discrete Event Simulation Pass emulator traces through different network models in BigNetSim to get final performance results Details of using BigNetSim: hop2009/slides/tut_BigSim09.ppt manual.html 4/28/2010Charm++ Workshop

POSE Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects Network data constructs (message, packet, etc.) implemented as event methods on simulation objects 4/28/2010Charm++ Workshop

Posers 4/28/2010Charm++ Workshop Each poser is a tiny simulation

Performance Prediction Two components: Time to execute blocks of sequential, computational code SEBs = Sequential Execution Blocks Communication time based on a particular network topology 4/28/2010Charm++ Workshop

Sequential Time Prediction (Emulator) Manual Advance processor time using BgElapse() calls in application code Wallclock time Use multiplier (scale factor) to account for architecture differences Performance counters Count instructions with hardware counters Use expected time of each instruction on target machine to derive execution time Instruction-level simulation (e.g., Mambo) Record cycle-accurate execution times for functions Use interpolation tool to replace SEB times 4/28/2010Charm++ Workshop

Sequential Time Prediction (continued) Model-based (recent work) Performed after emulation Determine application functions responsible for most of the computation time Run these functions on target machine Obtain run times based on function parameters to create model Feed emulation traces through offline modeling tool (like interpolation tool) to replace SEB times Generates corrected set of traces 4/28/2010Charm++ Workshop

Communication Time Prediction (Simulator) Valid for a particular network topology Generic: Simple Latency model Formula predicts time using latency and bandwidth parameters Specific BlueGene, Blue Waters, and others Latency-only option – uses formula specific to network Full contention 4/28/2010Charm++ Workshop

Specific Model (Full Network) 4/28/2010Charm++ Workshop BGnode BGproc Net Interface Switch Transceiver Channel

Generic Model (Simple Latency) 4/28/2010Charm++ Workshop BGnode BGproc Net Interface Switch Transceiver Channel

What We Model Processors Nodes NICs Switches/hubs Channels Packet-level direct and indirect routing Buffers with credit scheme Virtual channels 4/28/2010Charm++ Workshop

Other BigNetSim Features Skip points Set skip points in application code (e.g., after startup) Simulate only between skip points Transceiver Traffic pattern generator – replaces nodes and processors Windowing Set file window size to decrease memory footprint Can cut footprint in half or better, depending on trace structure Checkpoint-to-disk (recent work) Saves simulator state based on time or GVT interval for restart if crash occurs 4/28/2010Charm++ Workshop

BigNetSim Tools Located in BigNetSim/trunk/tools Log Analyzer Provides info about a set of traces Number of events / simulated processor Number of messages sent Log Transformation (recently completed) Produces new set of traces with remapped objects Useful for testing load-balancing scenarios 4/28/2010Charm++ Workshop

BigNetSim Output BgPrintf() statements Added to application code “%f” converted to committed time during simulation GVT = Global Virtual Time Each GVT tick = 1/factor seconds factor is defined in BigNetSim/trunk/Main/TCsim.h Link utilization statistics Projections traces Use -tproj command-line parameter 4/28/2010Charm++ Workshop

BigNetSim Output Example Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor e Info> invoking startup task from proc 0... [0:RECV_RESUME] Start of major loop at [0:RECV_RESUME] End of major loop at Simulation inactive at time: Final GVT = Final link stats [Node 0, Channel 0, ### Link]: ovt: , utilization time: , utilization %: , packets sent: gvt= Final link stats [Node 0, Channel 3, ### Link]: ovt: , utilization time: , utilization %: , packets sent: 4259 gvt= PE Simulation finished at Program finished. 4/28/2010Charm++ Workshop

29 Ring Projections Timeline Charm++ Workshop 20104/28/2010

BigNetSim Performance Examples of sequential simulator performance on Blue Print 4k-VP MILC Startup time: 0.7 hours Execution time: 5.6 hours Total run time: 6.3 hours Memory footprint: ~3.1 GB 256k-VP 3D Jacobi (10x10x10 grid, 3 iterations) Startup time: 0.5 hours Execution time: 1.5 hours Total run time: 2.0 hours Memory footprint: ~20 GB Still tuning parallel simulator performance 4/28/2010Charm++ Workshop

Thank you! Free download of Charm++ and BigSim: Send questions and comments to: 4/28/2010Charm++ Workshop