SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

Process Description and Control
0 - 0.
Addition Facts
Chapter 5 Input/Output 5.1 Principles of I/O hardware
Scheduling Introduction to Scheduling
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
ECE 495: Integrated System Design I
3/17 Dividend Street, Mansfield, 4122, Queensland, Australia phone: web: The SuperCycler A Software.
Processor Data Path and Control Diana Palsetia UPenn
SE 292 (3:0) High Performance Computing L2: Basic Computer Organization R. Govindarajan
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc
Michael Hildebrandt H UMBOLDT U NIVERSITY B ERLIN D EPARTMENT O F C OMPUTER S CIENCE C HAIR O F S OFTWARE E NGINEERING.
The “Little Man Computer” Version
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
SE-292: High Performance Computing
Module 10: Virtual Memory
Chapter 3 Memory Management
Chapter 10: Virtual Memory
Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.
Josef Weidendorfer KDE Developer Conference 2004 Ludwigsburg, Germany.
Processes Management.
Addition 1’s to 20.
CS 240 Computer Programming 1
Week 1.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
SE-292 High Performance Computing
Execution Cycle. Outline (Brief) Review of MIPS Microarchitecture Execution Cycle Pipelining Big vs. Little Endian-ness CPU Execution Time 1 IF ID EX.
Interfacing to the Analog World
Interfacing to the Analog World
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
CSC 501 Lecture 2: Processes. Von Neumann Model Both program and data reside in memory Execution stages in CPU: Fetch instruction Decode instruction Execute.
1 Lecture 6 Performance Measurement and Improvement.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
CSC 501 Lecture 2: Processes. Process Process is a running program a program in execution an “instantiation” of a program Program is a bunch of instructions.
Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Lecture 8 February 29, Topics Questions about Exercise 4, due Thursday? Object Based Programming (Chapter 8) –Basic Principles –Methods –Fields.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.
(a) What is the output generated by this program? In fact the output is not uniquely defined, i.e., it is not necessarily the same in each execution. What.
Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Measuring Performance Based on slides by Henri Casanova.
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
CSCI1600: Embedded and Real Time Software
Page Replacement.
CSCI1600: Embedded and Real Time Software
Presentation transcript:

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

2 Performance Measurement and Tuning Tools to help you measure the performmance of program Determining program execution time % time a.out real 0m0.019s user 0m0.014s system 0m0.002s Gives elapse time, user, and system Tools to identify the important parts of your program for perf. Improvement Concentrate optimization efforts on those parts

3 Amdahls Law Which part of the program to optimize? Amdahls Law: Speedup is limited by the part of program which does not benefit by the optimization IOW, Sp 1/s ! Implies concentrate on part of the program where maximum time is spent!

4 Timing Timing: measuring the time spent in specific parts of your program Examples of `parts: Functions, loops, … Recall: Different kinds of time that can be measured (real/wallclock/elapsed vs virtual/CPU) 1.Decide which time you are interested in measuring at what granularity 2.Find out what mechanisms are available and their granularity of measurement

5 Timing Mechanisms gettimeofday Real time in seconds and microseconds since 00:00 1/1/1970 Q: Overflow of 32b second value? getrusage times system call High resolution timers Example: gethrtime

6 Profiling Profiler: tool that helps you identify the `important parts of your program to concentrate your optimization efforts Profile: breakup (of execution time) across different parts of the program Can be done by adding statements to your program (instrumentation) -- so that during execution, data is gathered, outputted and possibly processed later Automation: where a profiling tool adds those instructions into your program for you

7 Profiling Mechanisms Levels of Granularity typically supported Function level Statement level Basic block level: A basic block is a sequence of contiguous instructions in a program with a single entry point (the first instruction in the basic block) and a single exit point (the last instruction in the basic block) Two kinds of profile data execution time execution counts We will look at examples of profiling mechanisms at the function and basic block level

8 Prof: UNIX Function Level Profiling Usage % cc –p program.c /generates instrumented a.out % a.out / execution; instrumentation / generates data and mon.out % prof / processing of profile data Output gives a function by function breakup of execution time Useful in identifying which functions to concentrate optimization efforts on

9 Output: %TimeSecondsCumSecs#Calls Name _baz _bar _foo … _main _strcpy

10 Prof: How it Works Instrumentation does three things 1. At entry of each function: increment an execution count for that function 2. At program entry: make a call to system call profil to get execution times 3. At program exit: write profile data to output file that can later be processed by prof profil(): execution time profiler Generates an execution time histogram, execution time in each function

11 Profil: What it does One of the parameters in call to profil is a buffer Used as an array of counters initialized to 0 Array elements are associated with contiguous regions of program text During execution, PC value is sampled (once every clock tick, default: 10 msec); triggered on timer interrupt Corresponding buffer element is incremented Later associated with a function; time weight of 10 msec used to estimate CPU times

12 Using prof From how it works, we understand that Granularity is at best 10 msec Generated profile could differ for multiple runs of a program with same input! Could be completely wrong; observe that there could be a particular function that just happens to be running each time the timer interrupt occurs Some usage guidelines Run under light load conditions Run a few times and see if results vary a lot Note that function execution counts are exact, while execution times are estimates

13 Pixie: Basic Block Level Profiling Available on MIPS, Alpha machines Usage % cc program.c / a.out % pixie a.out / instrumented a.out.pixie % a.out.pixie / profile output file % prof / report on profile data Output is based on basic block level execution counts Useful for all kinds of things

14 What is a Basic Block? A section of program that does not cross any conditional branches, loop boundaries or other transfers of control A sequence of instructions with a single entry point, single exit point, and no internal branches A sequence of program statements that contains no labels and no branches A basic block can only be executed completely and in sequence

15 Pixie: How it works 1.Identification of basic blocks Q: How can basic blocks be identified? Pixie uses heuristics where necessary 2.Instrumentation Increment a counter for the basic block On program entry and exit: initialization of data structures; writing profile output file

16 How intrusive are these mechanisms? Issue: Does the instrumented program behave enough like the original program? If not, the profile generated might mislead the direction of program optimization efforts Pixie: instrumented executable can be several times the size of the original Does not matter; basic block execution counts are accurate Prof: gathers more than just execution counts Instrumentation is not very large

17 Performance Tuning Tools Performance Counters provided in hardware Event-based or sampled counters Measure various events (e.g., CPU cycles, L1 Cache misses, TLB misses, loads, instrn. Count, … ) Counters may be accessible to user-level or kernel level. Accessible through command-line (user level) or through Performance tools! %perfex executable [arguments] Accesses MIPS R10000 Counters

18 Other tools : Vtune Use Sampling to gain an accurate representation of your software's actual performance, with negligible overhead. Gather CPU snapshots to identify problems such as cache misses. No special builds or instrumentation are required. Produce a picture of program flow to quickly identify critical functions and call sequences using Call Graph Profiling. Gain a high-level, algorithmic view of program execution.

19 Other Tools: Pin Uses dynamic instrumentation Does not need source code, recompilation, post-linking Programmable Instrumentation: Provides rich APIs to write in C/C++ your own instrumentation tools (called Pintools) Instrumentation done on executable (binary) and can be attached statically or dynamically Launch and instrument an application $pin –t pintool –- application Instrumentation Engine (provided) Instrumentation tool (write your own or provided)

20 Assignment #2 (contd.) 5. Use any of the performance tuning tools to measure various performance metrics (cache misses, exec. Time, etc.) and reason the performance of different versions of the matrix multiplication program. (Due: Oct. 14, 2010)