SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Programming Types of Testing.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Continuously Recording Program Execution for Deterministic Replay Debugging.
1 Lecture 6 Performance Measurement and Improvement.
Concurrent Processes Lecture 5. Introduction Modern operating systems can handle more than one process at a time System scheduler manages processes and.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
INTRODUCTION TO PERFORMANCE ENGINEERING Allen D. Malony Performance Research Laboratory University of Oregon SC ‘09: Productive Performance Engineering.
3.5 Interprocess Communication
OS Spring’03 Introduction Operating Systems Spring 2003.
Chapter 1 and 2 Computer System and Operating System Overview
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 1 Profiling & Optimization David Geldreich (DREAM)
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
CS 584. Performance Analysis Remember: In measuring, we change what we are measuring. 3 Basic Steps Data Collection Data Transformation Data Visualization.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
Lecture 2 Foundations and Definitions Processes/Threads.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Debugging and Profiling With some help from Software Carpentry resources.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Analysis report examination with CUBE Markus Geimer Jülich Supercomputing.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
Parallel Computing Presented by Justin Reschke
Presented by Jack Dongarra University of Tennessee and Oak Ridge National Laboratory KOJAK and SCALASCA.
Tuning Threaded Code with Intel® Parallel Amplifier.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.
Non Contiguous Memory Allocation
Parallel Programming By J. H. Wang May 2, 2017.
Real-time Software Design
A configurable binary instrumenter
Chapter 1 Computer System Overview
CSE 153 Design of Operating Systems Winter 19
Foundations and Definitions
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Presentation transcript:

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray)

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Performance: an old problem 2 “The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible.” Charles Babbage 1791 – 1871 Difference Engine

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Today: the “free lunch” is over ■ Moore's law is still in charge, but ■ Clock rates no longer increase ■ Performance gains only through increased parallelism ■ Optimizations of applications more difficult ■ Increasing application complexity ■ Multi-physics ■ Multi-scale ■ Increasing machine complexity ■ Hierarchical networks / memory ■ More CPUs / multi-core  Every doubling of scale reveals a new bottleneck! 3

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Example: XNS ■ CFD simulation of unsteady flows ■ Developed by CATS / RWTH Aachen ■ Exploits finite-element techniques, unstructured 3D meshes, iterative solution strategies ■ MPI parallel version ■ >40,000 lines of Fortran & C ■ DeBakey blood-pump data set (3,714,611 elements) 4 Hæmodynamic flow pressure distribution Partitioned finite-element mesh

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering XNS wait-state analysis on BG/L (2007) 5

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Performance factors of parallel applications ■ “Sequential” factors ■ Computation  Choose right algorithm, use optimizing compiler ■ Cache and memory  Tough! Only limited tool support, hope compiler gets it right ■ Input / output  Often not given enough attention ■ “Parallel” factors ■ Partitioning / decomposition ■ Communication (i.e., message passing) ■ Multithreading ■ Synchronization / locking  More or less understood, good tool support 6

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Tuning basics ■ Successful engineering is a combination of ■ The right algorithms and libraries ■ Compiler flags and directives ■ Thinking !!! ■ Measurement is better than guessing ■ To determine performance bottlenecks ■ To compare alternatives ■ To validate tuning decisions and optimizations  After each step! 7

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering However… ■ It's easier to optimize a slow correct program than to debug a fast incorrect one  Nobody cares how fast you can compute a wrong answer... 8 “We should forget about small efficiencies, say 97% of the time: premature optimization is the root of all evil.” Charles A. R. Hoare

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Performance engineering workflow 9 ■ Prepare application (with symbols), insert extra code (probes/hooks) ■ Collection of data relevant to execution performance analysis ■ Calculation of metrics, identification of performance metrics ■ Presentation of results in an intuitive/understandable form ■ Modifications intended to eliminate/reduce performance problems Preparation Measurement Analysis Examination Optimization

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering The 80/20 rule ■ Programs typically spend 80% of their time in 20% of the code ■ Programmers typically spend 20% of their effort to get 80% of the total speedup possible for the application  Know when to stop! ■ Don't optimize what does not matter  Make the common case fast! 10 “If you optimize everything, you will always be unhappy.” Donald E. Knuth

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Metrics of performance ■ What can be measured? ■ A count of how often an event occurs ■ E.g., the number of MPI point-to-point messages sent ■ The duration of some interval ■ E.g., the time spent these send calls ■ The size of some parameter ■ E.g., the number of bytes transmitted by these calls ■ Derived metrics ■ E.g., rates / throughput ■ Needed for normalization 11

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Example metrics ■ Execution time ■ Number of function calls ■ CPI ■ CPU cycles per instruction ■ FLOPS ■ Floating-point operations executed per second 12 “math” Operations? HW Operations? HW Instructions? 32-/64-bit? …

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Execution time ■ Wall-clock time ■ Includes waiting time: I/O, memory, other system activities ■ In time-sharing environments also the time consumed by other applications ■ CPU time ■ Time spent by the CPU to execute the application ■ Does not include time the program was context-switched out ■ Problem: Does not include inherent waiting time (e.g., I/O) ■ Problem: Portability? What is user, what is system time? ■ Problem: Execution time is non-deterministic ■ Use mean or minimum of several runs 13

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering ■ Inclusive ■ Information of all sub-elements aggregated into single value ■ Exclusive ■ Information cannot be subdivided further Inclusive Inclusive vs. Exclusive values Exclusive 14 int foo() { int a; a = 1 + 1; bar(); a = a + 1; return a; }

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Classification of measurement techniques ■ How are performance measurements triggered? ■ Sampling ■ Code instrumentation ■ How is performance data recorded? ■ Profiling / Runtime summarization ■ Tracing ■ How is performance data analyzed? ■ Online ■ Post mortem 15

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Sampling 16 ■ Running program is periodically interrupted to take measurement ■ Timer interrupt, OS signal, or HWC overflow ■ Service routine examines return-address stack ■ Addresses are mapped to routines using symbol table information ■ Statistical inference of program behavior ■ Not very detailed information on highly volatile metrics ■ Requires long-running applications ■ Works with unmodified executables Time mainfoo(0)foo(1)foo(2) int main() { int i; for (i=0; i < 3; i++) foo(i); return 0; } void foo(int i) { if (i > 0) foo(i – 1); } Measurement t9t9 t7t7 t6t6 t5t5 t4t4 t1t1 t2t2 t3t3 t8t8

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Instrumentation 17 Time Measurement ■ Measurement code is inserted such that every event of interest is captured directly ■ Can be done in various ways ■ Advantage: ■ Much more detailed information ■ Disadvantage: ■ Processing of source-code / executable necessary ■ Large relative overheads for small functions int main() { int i; for (i=0; i 0) foo(i – 1); } Time t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 t 11 t 12 t 13 t 14 mainfoo(0)foo(1)foo(2) Enter(“main”); Leave(“main”); Enter(“foo”); Leave(“foo”);

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Instrumentation techniques ■ Static instrumentation ■ Program is instrumented prior to execution ■ Dynamic instrumentation ■ Program is instrumented at runtime ■ Code is inserted ■ Manually ■ Automatically ■ By a preprocessor / source-to-source translation tool ■ By a compiler ■ By linking against a pre-instrumented library / runtime system ■ By binary-rewrite / dynamic instrumentation tool 18

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Critical issues ■ Accuracy ■ Intrusion overhead ■ Measurement itself needs time and thus lowers performance ■ Perturbation ■ Measurement alters program behaviour ■ E.g., memory access pattern ■ Accuracy of timers & counters ■ Granularity ■ How many measurements? ■ How much information / processing during each measurement?  Tradeoff: Accuracy vs. Expressiveness of data 19

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Classification of measurement techniques ■ How are performance measurements triggered? ■ Sampling ■ Code instrumentation ■ How is performance data recorded? ■ Profiling / Runtime summarization ■ Tracing ■ How is performance data analyzed? ■ Online ■ Post mortem 20

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Profiling / Runtime summarization ■ Recording of aggregated information ■ Total, maximum, minimum, … ■ For measurements ■ Time ■ Counts ■ Function calls ■ Bytes transferred ■ Hardware counters ■ Over program and system entities ■ Functions, call sites, basic blocks, loops, … ■ Processes, threads  Profile = summarization of events over execution interval 21

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Types of profiles ■ Flat profile ■ Shows distribution of metrics per routine / instrumented region ■ Calling context is not taken into account ■ Call-path profile ■ Shows distribution of metrics per executed call path ■ Sometimes only distinguished by partial calling context (e.g., two levels) ■ Special-purpose profiles ■ Focus on specific aspects, e.g., MPI calls or OpenMP constructs ■ Comparing processes/threads 22

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Tracing ■ Recording information about significant points (events) during execution of the program ■ Enter / leave of a region (function, loop, …) ■ Send / receive a message, … ■ Save information in event record ■ Timestamp, location, event type ■ Plus event-specific information (e.g., communicator, sender / receiver, …) ■ Abstract execution model on level of defined events  Event trace = Chronologically ordered sequence of event records 23

Event tracing void foo() {... send(B, tag, buf);... } Process A void bar() {... recv(A, tag, buf);... } Process B MONITOR synchronize(d) void bar() { trc_enter("bar");... recv(A, tag, buf); trc_recv(A);... trc_exit("bar"); } void foo() { trc_enter("foo");... trc_send(B); send(B, tag, buf);... trc_exit("foo"); } instrument Global trace view 58AENTER1 60BENTER2 62ASENDB 64AEXIT1 68BRECVA... 69BEXIT2... merge unify 1foo 2bar... 58ENTER1 62SENDB 64EXIT1... Local trace A Local trace B foo1... bar ENTER1 68RECVA 69EXIT1...

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Tracing vs. Profiling ■ Tracing advantages ■ Event traces preserve the temporal and spatial relationships among individual events (  context) ■ Allows reconstruction of dynamic application behaviour on any required level of abstraction ■ Most general measurement technique ■ Profile data can be reconstructed from event traces ■ Disadvantages ■ Traces can very quickly become extremely large ■ Writing events to file at runtime causes perturbation ■ Writing tracing software is complicated ■ Event buffering, clock synchronization,... 25

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Classification of measurement techniques ■ How are performance measurements triggered? ■ Sampling ■ Code instrumentation ■ How is performance data recorded? ■ Profiling / Runtime summarization ■ Tracing ■ How is performance data analyzed? ■ Online ■ Post mortem 26

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Online analysis ■ Performance data is processed during measurement run ■ Process-local profile aggregation ■ More sophisticated inter-process analysis using ■ “Piggyback” messages ■ Hierarchical network of analysis agents ■ Inter-process analysis often involves application steering to interrupt and re-configure the measurement 27

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Post-mortem analysis ■ Performance data is stored at end of measurement run ■ Data analysis is performed afterwards ■ Automatic search for bottlenecks ■ Visual trace analysis ■ Calculation of statistics 28

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Example: Time-line visualization 29 1foo 2bar AENTER160BENTER262ASENDB64AEXIT168BRECVA... 69BEXIT2... main foo bar B A

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering No single solution is sufficient! 30  A combination of different methods, tools and techniques is typically needed! ■ Analysis ■ Statistics, visualization, automatic analysis, data mining,... ■ Measurement ■ Sampling / instrumentation, profiling / tracing,... ■ Instrumentation ■ Source code / binary, manual / automatic,...

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Typical performance analysis procedure ■ Do I have a performance problem at all? ■ Time / speedup / scalability measurements ■ What is the key bottleneck (computation / communication)? ■ MPI / OpenMP / flat profiling ■ Where is the key bottleneck? ■ Call-path profiling, detailed basic block profiling ■ Why is it there? ■ Hardware counter analysis, trace selected parts to keep trace size manageable ■ Does the code have scalability problems? ■ Load imbalance analysis, compare profiles at various sizes function-by-function 31