Performance Measurement

Slides:

Advertisements

Similar presentations

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Advertisements

CHAPTER 2 ALGORITHM ANALYSIS 【 Definition 】 An algorithm is a finite set of instructions that, if followed, accomplishes a particular task. In addition,

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Lecture3: Algorithm Analysis Bohyung Han CSE, POSTECH CSED233: Data Structures (2014F)

Fundamentals of Python: From First Programs Through Data Structures

Performance Measurement Performance Analysis Paper and pencil. Don’t need a working computer program or even a computer.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

Time Complexity s Sorting –Insertion sorting s Time complexity.

Fast matrix multiplication; Cache usage

1 Performance Measurement CSE, POSTECH 2 2 Program Performance Recall that the program performance is the amount of computer memory and time needed to.

External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.

1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.

Performance Measurement Performance Analysis Paper and pencil. Don’t need a working computer program or even a computer.

SNU IDB Lab. Ch4. Performance Measurement © copyright 2006 SNU IDB Lab.

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

ANALYSING COSTS COMP 103. RECAP  ArrayList: add(), ensuring capacity, iterator for ArrayList TODAY  Analysing Costs 2 RECAP-TODAY.

1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

TAs Fardad Jalili: Aisharjya Sarkar: Soham Das:

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Introduction to Analysing Costs 2013-T2 Lecture 10 School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Rashina.

Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.

LECTURE 9 CS203. Execution Time Suppose two algorithms perform the same task such as search (linear search vs. binary search) and sorting (selection sort.

Cache Memories.

May 17th – Comparison Sorts

Cache Memories CSE 238/2038/2138: Systems Programming

External Sorting Sort n records/elements that reside on a disk.

Introduction to complexity

Introduction to Analysing Costs

Architecture Background

Algorithm Analysis CSE 2011 Winter September 2018.

CS 105 Tour of the Black Holes of Computing

Big-Oh and Execution Time: A Review

Algorithm Analysis (not included in any exams!)

Building Java Programs

CSE 143 Lecture 5 Binary search; complexity reading:

Cache Memories Topics Cache memory organization Direct mapped caches

Insertion Sort for (int i = 1; i < a.length; i++)

Introduction to Computer Systems

ITEC 2620M Introduction to Data Structures

CSC 413/513: Intro to Algorithms

Performance Measurement

Programming and Data Structure

Introduction to Computer Systems

Building Java Programs

Computing Fundamentals

Simple Sorting Methods: Bubble, Selection, Insertion, Shell

Introduction to Computer Systems

Sorting Rearrange a[0], a[1], …, a[n-1] into ascending order.

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Insertion Sort for (int i = 1; i < n; i++)

Performance Measurement

Memory System Performance Chapter 3

Sum this up for me Let’s write a method to calculate the sum from 1 to some n public static int sum1(int n) { int sum = 0; for (int i = 1; i

Cache Performance October 3, 2007

Performance Measurement

Cache Memories.

Insertion Sort for (int i = 1; i < n; i++)

Cache Memory and Performance

Performance Measurement

Writing Cache Friendly Code

Presentation transcript:

Performance Measurement

Performance Analysis Paper and pencil. Don’t need a working computer program or even a computer. These could be considered some of the advantages of performance analysis over performance measurement.

Some Uses Of Performance Analysis determine practicality of algorithm predict run time on large instance compare 2 algorithms that have different asymptotic complexity e.g., O(n) and O(n2) If the complexity is O(n2), we may consider the algorithm practical even for large n. An O(2n) algorithm is practical only for small n, say n < 40. The worst-case run time of an O(n2) algorithm will quadruple with each doubling of n. So if the worst-case time is 10 sec when n = 100, it will be approximately 40 sec when n = 200.

Reminder

Limitations of Analysis Doesn’t account for constant factors. but constant factor may dominate 1000n vs n2 and we are interested only in n < 1000 Constant factor may dominate until some n.

Limitations of Analysis Modern computers have a hierarchical memory organization with different access time for memory at different levels of the hierarchy.

Memory Hierarchy MAIN L2 L1 ALU R 8-32 32KB ~512KB ~512MB 1C 2C 10C 1 cycle to add data that are in the registers 2 cycles to copy data from L1 cache to a register 10 cycles to copy from L2 to L1 and register L1 ALU R 8-32 32KB ~512KB ~512MB 1C 2C 10C 100C

Limitations of Analysis Our analysis doesn’t account for this difference in memory access times. Programs that do more work may take less time than those that do less work.  Temporal and Spatial locality matters! A program that does 100 operations on the same data would take less time than A program that performs a single operation on say 50 different pieces of data (assuming, in a simple model) that each time data is fetched from main memory, only one piece of data is fetched. The first program would make one fetch at a cost of 100 cycles and perform 100 operations at a cost of 1 cycle each for a total of 200 cycles. The second program would need 50 * 100 cycles just to fetch the data.

Cache-aware and Cache-oblivious algorithms* Cache-aware algorithms try to minimize “cache misses” Example: Matrix multiplication Cache-oblivious algorithms try to take advantage of cache without knowledge of hardware details *extra material Source: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Matrix Multiplication for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; ijk, ikj, jik, jki, kij, kji orders of loops yield same result. All perform same number of operations. But run time differs! ijk takes > 7x ikj on modern PC when n = 4K. In fact, on a contemporary PC, ijk took 7x ikj when n = 4K. Slide source: Professor Sartaj Sahni, Advanced Data Structures course, University of Florida

Performance Measurement/Benchmarking Measure actual time on an actual computer. What do we need?

Performance Measurement Needs programming language e.g. C++/Java working program Insertion Sort computer compiler and options to use gcc –O2

Performance Measurement Needs data to use for measurement worst-case data best-case data average-case data timing mechanism --- clock

Timing In C++ double clocksPerMillis = double(CLOCKS_PER_SEC) / 1000; // clock ticks per millisecond clock_t startTime = clock(); // code to be timed comes here double elapsedMillis = (clock() – startTime) / clocksPerMillis; // elapsed time in milliseconds

Shortcoming Clock accuracy assume 100 ticks Repeat work many times to bring total time to be >= 1000 ticks Preceding measurement code is acceptable only when the elapsed time is large relative to the accuracy of the clock.

More Accurate Timing clock_t startTime = clock(); long numberOfRepetitions; do { numberOfRepetitions++; doSomething(); } while (clock() - startTime < 1000) double elapsedMillis = (clock()- startTime) / clocksPerMillis; double timeForCode = elapsedMillis/numberOfRepetitions;

Bad Way To Time do { counter++; startTime = clock(); doSomething(); elapsedTime += clock() - startTime; } while (elapsedTime < 1000) Suppose the clock is updated every 100ms and doSomething takes 99ms and it takes another 1ms do do everything else in the loop. If the clock is updated exactly at the start of the loop then it is not updated between the startTime = and elapsedTime += statements. So elapsedTime is always 0.

Accuracy Now accuracy is 10%. first reading may be just about to change to startTime + 100 second reading (final value of clock())may have just changed to finishTime so finishTime - startTime is off by 100 ticks

Accuracy first reading may have just changed to startTime second reading may be about to change to finishTime + 100 so finishTime - startTime is off by 100 ticks

Accuracy Examining remaining cases, we get trueElapsedTime = finishTime - startTime +- 100 ticks To ensure 10% accuracy, require elapsedTime = finishTime – startTime >= 1000 ticks

Timing in Java Same semantics, slightly different syntax long startTime = System.nanoTime(); //Code to be measured System.out.println( "Elapsed time: " + (System.nanoTime() - startTime)); or Use micro-benmarking frameworks such as JMH (recommended)

What Went Wrong? clock_t startTime = clock(); long numberOfRepetitions; do { numberOfRepetitions++; insertionSort(a, n); } while (clock() - startTime < 1000) double elapsedMillis = (clock()- startTime) / clocksPerMillis; double timeForCode = elapsedMillis/numberOfRepetitions; Suppose we are measuring worst-case run time. The array a Initially is in descending order. After the first iteration, the data is sorted. Subsequent iterations no longer start with worst-case data. So the measured time is for one Worst-case sort plus counter-1 best-case sorts!

The Fix clock_t startTime = clock(); long numberOfRepetitions; do { // put code to initialize a[] here insertionSort(a, n) } while (clock() - startTime < 1000) double elapsedMillis = (clock()- startTime) / clocksPerMillis; Now the measured time is for counter number of worst-case sorts plus counter number of initializes. We can measure the time to initialize separately and subtract. Or, if We are to compare with other sort methods, we could ignore the initialize time as it is there in the measurements for all sort methods. Or we could argue that for large n the worst case time to sort using insertion sort (O(n^2)) overshadows the initialize time, which is O(n), and so we may ignore the initialize time.

Time Shared System UNIX time MyProgram Using System.currentTimeMillis() is a problem, because your program may run for only part of the physical time that has elapsed. The UNIX command time gives the time for which your program ran.

To-do 1. C++: 1’.Java (JMH): 2. Linux: http://www.cplusplus.com/reference/ctime/clock/ 1’.Java (JMH): http://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html http://nitschinger.at/Using-JMH-for-Java-Microbenchmarking/ 2. Linux: http://stackoverflow.com/a/556411