Copyright 2004 David J. Lilja1 Measurement tools and techniques Fundamental strategies Interval timers Program profiling Tracing Indirect measurement.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

CMPT 300: Operating Systems I Dr. Mohamed Hefeeda

Continuously Recording Program Execution for Deterministic Replay Debugging.

Copyright 2004 David J. Lilja1 Performance metrics What is a performance metric? Characteristics of good metrics Standard processor and system metrics.

Copyright 2004 David J. Lilja1 Errors in Experimental Measurements Sources of errors Accuracy, precision, resolution A mathematical model of errors Confidence.

Chapter 101 Virtual Memory Chapter 10 Sections and plus (Skip:10.3.2, 10.7, rest of 10.8)

Operating System Support Focus on Architecture

OS Fall ’ 02 Introduction Operating Systems Fall 2002.

CS 104 Introduction to Computer Science and Graphics Problems

CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.

1 Computer System Overview OS-1 Course AA

OS Spring’03 Introduction Operating Systems Spring 2003.

Computer System Overview

Memory Management Virtual Memory Page replacement algorithms

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

Computer Organization and Architecture

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Copyright 2004 David J. Lilja1 Measuring Computer Performance: A Practitioner’s Guide David J. Lilja Electrical and Computer Engineering University of.

OS Spring’04 Introduction Operating Systems Spring 2004.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Virtual Memory.

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

Machine Instruction Characteristics

The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.

Computer Studies (AL) Memory Management Virtual Memory I.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Time Management.  Time management is concerned with OS facilities and services which measure real time, and is essential to the operation of timesharing.

2Object-Oriented Program Development Using C++ 3 Basic Loop Structures Loops repeat execution of an instruction set Three repetition structures: while,

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Lecture 2d: Performance Comparison. Quality of Measurement Characteristics of a measurement tool (timer) Accuracy: Absolute difference of a measured value.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.

Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Lecture 1: Review of Computer Organization

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.

NETW3005 Virtual Memory. Reading For this lecture, you should have read Chapter 9 (Sections 1-7). NETW3005 (Operating Systems) Lecture 08 - Virtual Memory2.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Copyright © Curt Hill More on Operating Systems Continuation of Introduction.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.

Confidence Intervals Cont.

REAL-TIME OPERATING SYSTEMS

Lecture 12 Virtual Memory.

OPERATING SYSTEMS CS3502 Fall 2017

Lecture 2d1: Quality of Measurements

Swapping Segmented paging allows us to have non-contiguous allocations

Morgan Kaufmann Publishers

Real-time Software Design

Processor Fundamentals

Operating Systems.

Architectural Support for OS

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Architectural Support for OS

Page Allocation and Replacement

Presentation transcript:

Copyright 2004 David J. Lilja1 Measurement tools and techniques Fundamental strategies Interval timers Program profiling Tracing Indirect measurement

Copyright 2004 David J. Lilja2 Events Most measurement tools based on events Some predefined change to system state Definition depends on metric being measured Memory reference Disk access Change in a register’s state Network message Processor interrupt

Copyright 2004 David J. Lilja3 Event Classification Count metrics The number of times event X occurs Number of cache misses Number of I/O operations

Copyright 2004 David J. Lilja4 Event Classification Secondary-event metrics Record a value when triggered by some event Record block size for each I/O operation Count number of operations Find average I/O transfer size

Copyright 2004 David J. Lilja5 Event Classification Profiles Characterization of overall behavior Aggregate/big picture view of an application program Time spent in each function

Copyright 2004 David J. Lilja6 Event-Driven Strategies Record necessary information only when selected event occurs Modify system to record event Dump data when program terminates May need intermediate dumps also E.g. simple counter in page fault routine

Copyright 2004 David J. Lilja7 Event-Driven Strategies System overhead Only when the event of interest actually occurs Infrequent events → little perturbation Frequent events → high perturbation No longer “typical” behavior? Perturbation changes system being measured

Copyright 2004 David J. Lilja8 Event-Driven Strategies Inter-event time is unpredictable Depends on when events actually occur Makes it hard to estimate perturbation How long to measure? Event-driven measurement tools → Good for low-frequency events

Copyright 2004 David J. Lilja9 Event-Driven Strategies Counts 8 events exactly +1

Copyright 2004 David J. Lilja10 Tracing Similar to event-driven But record additional system state Event has occurred – count Additional information to uniquely identify event E.g. addresses that cause page faults Overhead Additional memory or disk storage Time to save state Relatively large system perturbation

Copyright 2004 David J. Lilja11 Tracing Counts 8 events plus extra data +1; Addr +1; Addr +1; Addr +1; Addr +1; Addr +1; Addr +1; Addr +1; Addr

Copyright 2004 David J. Lilja12 Sampling Record necessary state at fixed time intervals Overhead Independent of specific event frequency Depends on sampling frequency Misses some events Produces statistical summary May miss infrequent events Each replication will produce different results

Copyright 2004 David J. Lilja13 Sampling Counts 3 events out of 5 samples +1

Copyright 2004 David J. Lilja14 Comparisons Event count TracingSampling Resolution Exact count Detailed info Statistical summary OverheadLowHighConstant Perturbation~ #eventsHighFixed

Copyright 2004 David J. Lilja15 Comparison Event counting Best for low frequency events Required if exact counts needed Sampling Best for high frequency events If statistical summary is adequate Tracing When additional detail is required

Copyright 2004 David J. Lilja16 Indirect Measurements Used when desired metric is not directly accessible Measure one thing directly Derive or deduce desired metric Highly dependent on creativity of performance analyst

Copyright 2004 David J. Lilja17 Interval Timers Fundamental tool of performance measurement Measure execution time of any portion of a program Provide time basis for sampling

Copyright 2004 David J. Lilja18 Interval Timers Actually count clock pulses between two events Event 1 Event 2 x 1 = Counter x 2 = Counter TcTc T e =(x 2 – x 1 )T c

Copyright 2004 David J. Lilja19 Using an Interval Timer Within an application program Start_count = read_timer(); Portion of program to be measured Stop_count = read_timer(); Elapsed_time = (stop_count – start_count) * clock_period;

Copyright 2004 David J. Lilja20 Hardware Timer Clock n-bit counter To CPU input port TcTc T e =(x 2 – x 1 )T c

Copyright 2004 David J. Lilja21 Software Timer T e =(x 2 – x 1 )T c Clock T’ c Prescalar (divide-by-n) Software counter CPU interrupt input TcTc

Copyright 2004 David J. Lilja22 Quantization Errors

Copyright 2004 David J. Lilja23 Quantization Error Timer resolution → quantization error Repeated measurements nT c < T e < (n+1)T c T e rounded to ± one clock tick Completely unpredictable rounding → Want T c to be as small as possible

Copyright 2004 David J. Lilja24 Timer Rollover n-bit counter count = [0, 2 n -1] Rollover = transition from (2 n – 1) → 0 If rollover occurs between start/stop events Then count = (x 2 – x 1 ) < 0 Check for count < 0 Measure again Add 2 n to count

Copyright 2004 David J. Lilja25 Timer Rollover Counter width, n Resolution (T c ) ns655 us43 s58.5 cent 1 us65.5 ms1.2 h5,580 cent 100 us6.55 s5 days585,000 cent 1 ms1.1 min50 days5,850,000 cent

Copyright 2004 David J. Lilja26 Timer Overhead Start_count = read_timer(); Portion of program to be measured Stop_count = read_timer(); Elapsed_time = (stop_count – start_count) * clock_period; To access timer Min of 1 memory read → subroutine call Min of 1 memory write → subroutine call Once at start, again at stop

Copyright 2004 David J. Lilja27 Timer Overhead T1T1 T2T2 T3T3 T4T4 Event begins; Initiate read_timer() Current time actually read Event ends; Initiate read_time() Current time actually read Event being measured begins

Copyright 2004 David J. Lilja28 Timer Overhead T1T1 T2T2 T3T3 T4T4 T 1 = time to read counter value T 2 = time to store counter value T 3 = time of the event we are measuring T 4 = time to read counter value T 4 = T 1

Copyright 2004 David J. Lilja29 Timer Overhead T1T1 T2T2 T3T3 T4T4 T e = event time = T 3 But actually measured T m = T 2 + T 3 + T 4 T e = T m – (T 2 + T 4 ) = T m – (T 1 + T 2 ) Timer overhead = T ovhd = (T 1 + T 2 )

Copyright 2004 David J. Lilja30 Timer Overhead If T e >> T ovhd Ignore the timer overhead If T e ≈ T ovhd Measurements will be highly suspect Potentially large variations in T ovhd Good rule of thumb T e should be x > T ovhd

Copyright 2004 David J. Lilja31 Approximate Measures of Short Intervals How to measure an event that is shorter than the resolution of the clock? Cannot directly measure events with T e < T c Overhead makes it hard to measure even when T e > nT c, n is small integer

Copyright 2004 David J. Lilja32 Approximate Measures of Short Intervals TcTc TeTe TeTe Case 1: Count+1 Case 2: Count+0

Copyright 2004 David J. Lilja33 Approximate Measures of Short Intervals Bernoulli experiment Outcome = +1 with probability p Outcome = +0 with probability (1-p) Equivalent to flipping a biased coin Repeat n times Approximates a binomial distribution Only approximate since each measurement cannot be guaranteed to be independent Usually close enough in practice

Copyright 2004 David J. Lilja34 Approximate Measures of Short Intervals m = number of times Case 1 occurs Count+1 n = total number of measurements Average duration is ratio of m/n Use confidence interval for proportions

Copyright 2004 David J. Lilja35 Example Clock resolution = 10 us n = 8764 measurements m = 467 clock ticks counted 95% confidence interval 10 us ?? Case 1: 467 Case 2: 8297

Copyright 2004 David J. Lilja36 Example Scale by clock period = 10 us 95% chance that measured event is (0.49, 0.58) us

Copyright 2004 David J. Lilja37 Profiling Overall view of program’s execution-time behavior Fraction of total time spent in specific states Fraction of time in each subroutine Fraction of time in OS kernel Fraction of time doing I/O Find bottlenecks, code hot-spots Optimize those sections first

Copyright 2004 David J. Lilja38 Statistical Sampling Select a random subset of a population Gather information on only this subset Extrapolate this information to overall population Results are a statistical summary with corresponding error probabilities

Copyright 2004 David J. Lilja39 PC Sampling Periodically interrupt program at fixed intervals Record appropriate state information in interrupt service routine Post-process to obtain overall profile +1

Copyright 2004 David J. Lilja40 PC Sampling At each interrupt Examine PC on return address stack Use address map to translate this PC to subroutine i Increment array element H[i] Addr map : Subr : Subr : Subr : Subr 4 PC: 4582 Histogram counters: H[3]=H[3]+1

Copyright 2004 David J. Lilja41 PC Sampling

Copyright 2004 David J. Lilja42 PC Sampling n total interrupts Post-processing step H[i]/n = fraction of time executing in subroutine i (H[i]/n) * (interrupt period) = time in each subroutine

Copyright 2004 David J. Lilja43 PC Sampling This is a statistical process Different counts each time the experiment is performed Infer behavior of entire program from small sample Apply confidence intervals to quantify precision of results

Copyright 2004 David J. Lilja44 Example 40 us interrupt 36,128 interrupts in subroutine A Program runs for 10 seconds Time in this subroutine? 90% confidence interval m = 36,128 n = 10 sec / 40 us = 250,000 p = m/n = 0.144

Copyright 2004 David J. Lilja45 Example 90% chance that the program spent % of its time in subroutine A

Copyright 2004 David J. Lilja46 Example 10 ms interrupt 12 interrupts in subroutine A n = 800 samples 8 seconds total execution time Time in this subroutine? 99% confidence interval p = m/n = 0.015

Copyright 2004 David J. Lilja47 Example 99% chance that the program spent ms in subroutine A A pretty wide range! But only <3% of total execution time Start optimizing somewhere else first

Copyright 2004 David J. Lilja48 Reducing the Interval Size Use a lower confidence level Obtain more samples Run program longer May not be possible Increase sample rate May be fixed by system Will increase overhead and perturbation Run multiple times and add samples from each run

Copyright 2004 David J. Lilja49 PC Sampling Interrupts must occur asynchronously w.r.t. any program events Samples must be independent of each other Else over/under-sample events synchronous with interrupt Periodic versus random sampling +1

Copyright 2004 David J. Lilja50 Basic Block Counting Basic block Sequence of instructions with no branches into or out of the block When first instruction is executed, guaranteed that all instructions in block will be executed Single entry, single exit

Copyright 2004 David J. Lilja51 Basic Block Counting Generate a program profile by inserting additional instructions in each block Increment a unique counter each time a block is entered Produces a histogram of program execution Can post-process to find instruction execution frequencies

Copyright 2004 David J. Lilja52 Comparison PC sampling Basic block counting Output Statistical estimate Exact count Overhead Interrupt service routine Extra instructions per block Perturbation Randomly distributed High Repeatability Within statistical variance Perfect

Copyright 2004 David J. Lilja53 Event Tracing Profile shows overall frequency-of-execution behavior Ignores time-ordering of events Program trace Dynamic list of events generated by program Events = anything you want to instrument Sequence of memory addresses I/O blocks accessed Typically used to drive a simulator

Copyright 2004 David J. Lilja54 Trace Generation Application program Compress Uncompress Trace consumer Modify to generate trace

Copyright 2004 David J. Lilja55 Trace Generation Application program Compress Uncompress Trace consumer Online trace consumption Modify to generate trace

Copyright 2004 David J. Lilja56 Trace Generation Source-code modification Allows precise control of what events are traced and what data is recorded Typically a manual process Source code Object code Proc Trace Compiler

Copyright 2004 David J. Lilja57 Trace Generation Software exceptions HW forces an exception before each instruction Exception routine decodes instruction Store instr type, PC, operand addresses, etc. “Trace” bit in many processors Tremendous slowdown Source code Object code Proc Trace Compiler

Copyright 2004 David J. Lilja58 Trace Generation Emulation Make a system appear to be something else Modify emulator to generate trace E.g. Java Virtual Machine Source code Object code Proc Trace Compiler

Copyright 2004 David J. Lilja59 Microcode modification Modify instruction execution directly Allows tracing of all instructions Including operating system Depends on access to lower levels of the processor E.g. Transmeta Crusoe processor Trace Generation Source code Object code Proc Trace Compiler

Copyright 2004 David J. Lilja60 Trace Generation Compiler modification Insert trace code directly in object file Requires access to the compiler itself Source code Object code Proc Trace Compiler

Copyright 2004 David J. Lilja61 Trace Generation Compiler modification Insert trace code directly in object file Requires access to the compiler itself Write post-compilation binary editor/rewrite tool Source code Object code Proc Trace Compiler

Copyright 2004 David J. Lilja62 Trace Data Tracing generates a tremendous volume of data Trace 100,000,000 instrs/sec 16 bits of data per event 190 Mbytes of data per second 11 Gbytes per minute Huge perturbations Due to tracing code Time to store trace data

Copyright 2004 David J. Lilja63 Trace Data Compression Standard compression algorithms as trace is written to disk Uncompress when reading Typical reduction 20-70% Tradeoff is compress- uncompress time Application program Compress Uncompress Trace consumer Modify to generate trace

Copyright 2004 David J. Lilja64 Online Trace Consumption Use trace data as it is generated Never stored on disk Multitasking may lead to non-deterministic behavior Repeatability issue Before-and-after comparison tests Difference due to change in system or change in trace? Becomes statistical comparison with n runs Application program Trace consumer Online trace consumption Modify to generate trace

Copyright 2004 David J. Lilja65 Abstract Execution Use higher-level information to intelligently compress trace info Two-step process Compiler-style analysis to find critical subset of trace Store only control flow information sufficient to reconstruct trace later Produce trace-regeneration code for subsequent use of trace

Copyright 2004 David J. Lilja66 Abstract Execution Trace will be either Store only 2 or 3 Combine with compiler- generated control flow graph to regenerate trace Slowdown = 2-10x Compress = x 1. if (i > 5) 2.then a = a + i; 3.else b = b + i; 4. i = i + 1; 1. if (i>5) 2. a=a+i3. b=b+i 4. i=i+1

Copyright 2004 David J. Lilja67 Trace Sampling Save only subsequences of overall trace Drive simulator with samples Results should be statistically similar to driving with complete trace One sample = k consecutive events Sampling interval = P (period) kk P

Copyright 2004 David J. Lilja68 SimPoint Find “representative” program samples Match basic block execution frequencies Clustering tool to automate process Perform detailed timing simulation on only these samples Fast-forward (functional simulation) between samples [Sherwood et al, ASPLOS, 2002]

Copyright 2004 David J. Lilja69 SimPoint Weight each sample’s result by execution frequency to produced overall result Relatively small number (10s) of SimPoints produced ≈3% error in IPC on SPEC

Copyright 2004 David J. Lilja70 SMARTS Uses systematic sampling Fixed sample interval Apply statistical sampling techniques to determine j, k, P k P k Functional simulation Detailed simulation jj

Copyright 2004 David J. Lilja71 Indirect Ad Hoc Techniques Sometimes the desired metric cannot be measured directly Use your creativity to measure one thing and then derive/infer the desired value

Copyright 2004 David J. Lilja72 Example – System Load What is system load? Number of jobs in run queue? Number of jobs actively time-sharing? Fraction of time processor is not in idle loop? Others? How to measure it? Modify OS PC sampling Indirect?

Copyright 2004 David J. Lilja73 Example Let system run for fixed time T Note value of counter Monitor Count n T

Copyright 2004 David J. Lilja74 Example Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value Monitor App 1 Count n n/2 T

Copyright 2004 David J. Lilja75 Example Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value Monitor App 1 App 2 Monitor Count n n/2 n/3 T

Copyright 2004 David J. Lilja76 Perturbation To obtain more information (higher resolution) → Use more instrumentation points More instrumentation points → Greater perturbation

Copyright 2004 David J. Lilja77 Perturbation Computer performance measurement uncertainty principle Accuracy is inversely proportional to resolution. Resolution Accuracy Low High

Copyright 2004 David J. Lilja78 Perturbation Superposition does not work here Non-linear Non-additive Double instrumentation ≠ double impact on performance Some instrumentation cancels out Some multiplies impact No way to predict!

Copyright 2004 David J. Lilja79 Instrumentation Code Changes memory access patterns Affects memory banking optimizations Generates additional load/store instructions More frequent cache flushes and replacements But may reduce set associativity conflicts Generates more I/O operations Will increase overall execution time More time-sharing context switches Alters virtual memory paging behavior

Copyright 2004 David J. Lilja80 Important Points Event types Simple counts of primary event Secondary events triggered by some primary event Overall profiles

Copyright 2004 David J. Lilja81 Important Points Measurement strategies Event-driven Tracing Sampling Indirect approaches

Copyright 2004 David J. Lilja82 Important Points Interval timers Stopwatch functionality Rollover problem Overhead Quantization errors Statistical measures of short intervals

Copyright 2004 David J. Lilja83 Important Points Profiling PC sampling Statistical view Basic block counting Exact behavior High overhead and perturbation

Copyright 2004 David J. Lilja84 Important Points Trace generation Source code modification Force exceptions Emulation Microcode modification Compiler modification Object code editor Online trace consumption Trace sampling

Copyright 2004 David J. Lilja85 Important Points Indirect measurements when all else fails System load example Perturbations Nobody likes them Have to learn to live with them