Performance Measurement

Performance Measurement

Performance Analysis Paper and pencil.
Don’t need a working computer program or even a computer. These could be considered some of the advantages of performance analysis over performance measurement.

Some Uses Of Performance Analysis
determine practicality of algorithm predict run time on large instance compare 2 algorithms that have different asymptotic complexity e.g., O(n) and O(n2) If the complexity is O(n2), we may consider the algorithm practical even for large n. An O(2n) algorithm is practical only for small n, say n < 40. The worst-case run time of an O(n2) algorithm will quadruple with each doubling of n. So if the worst-case time is 10 sec when n = 100, it will be approximately 40 sec when n = 200.

Reminder

Limitations of Analysis
Doesn’t account for constant factors. but constant factor may dominate 1000n vs n2 and we are interested only in n < 1000 Constant factor may dominate until some n.

Modern computers have a hierarchical memory organization with different access time for memory at different levels of the hierarchy.

Memory Hierarchy MAIN L2 L1 ALU R 8-32 32KB ~512KB ~512MB 1C 2C 10C
1 cycle to add data that are in the registers 2 cycles to copy data from L1 cache to a register 10 cycles to copy from L2 to L1 and register L1 ALU R 8-32 32KB ~512KB ~512MB 1C 2C 10C 100C

Our analysis doesn’t account for this difference in memory access times. Programs that do more work may take less time than those that do less work.  Temporal and Spatial locality matters! A program that does 100 operations on the same data would take less time than A program that performs a single operation on say 50 different pieces of data (assuming, in a simple model) that each time data is fetched from main memory, only one piece of data is fetched. The first program would make one fetch at a cost of 100 cycles and perform 100 operations at a cost of 1 cycle each for a total of 200 cycles. The second program would need 50 * 100 cycles just to fetch the data.

Cache-aware and Cache-oblivious algorithms*
Cache-aware algorithms try to minimize “cache misses” Example: Matrix multiplication Cache-oblivious algorithms try to take advantage of cache without knowledge of hardware details *extra material Source:

Matrix Multiplication
for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; ijk, ikj, jik, jki, kij, kji orders of loops yield same result. All perform same number of operations. But run time differs! ijk takes > 7x ikj on modern PC when n = 4K. In fact, on a contemporary PC, ijk took 7x ikj when n = 4K. Slide source: Professor Sartaj Sahni, Advanced Data Structures course, University of Florida

Performance Measurement/Benchmarking
Measure actual time on an actual computer. What do we need?

Performance Measurement Needs
programming language e.g. C++/Java working program Insertion Sort computer compiler and options to use gcc –O2

Performance Measurement Needs
data to use for measurement worst-case data best-case data average-case data timing mechanism --- clock

Timing In C++ double clocksPerMillis = double(CLOCKS_PER_SEC) / 1000;
// clock ticks per millisecond clock_t startTime = clock(); // code to be timed comes here double elapsedMillis = (clock() – startTime) / clocksPerMillis; // elapsed time in milliseconds

Shortcoming Clock accuracy assume 100 ticks
Repeat work many times to bring total time to be >= 1000 ticks Preceding measurement code is acceptable only when the elapsed time is large relative to the accuracy of the clock.

More Accurate Timing clock_t startTime = clock();
long numberOfRepetitions; do { numberOfRepetitions++; doSomething(); } while (clock() - startTime < 1000) double elapsedMillis = (clock()- startTime) / clocksPerMillis; double timeForCode = elapsedMillis/numberOfRepetitions;

Bad Way To Time do { counter++; startTime = clock(); doSomething();
elapsedTime += clock() - startTime; } while (elapsedTime < 1000) Suppose the clock is updated every 100ms and doSomething takes 99ms and it takes another 1ms do do everything else in the loop. If the clock is updated exactly at the start of the loop then it is not updated between the startTime = and elapsedTime += statements. So elapsedTime is always 0.

Accuracy Now accuracy is 10%.
first reading may be just about to change to startTime + 100 second reading (final value of clock())may have just changed to finishTime so finishTime - startTime is off by 100 ticks

Accuracy first reading may have just changed to startTime
second reading may be about to change to finishTime + 100 so finishTime - startTime is off by 100 ticks

Accuracy Examining remaining cases, we get trueElapsedTime =
finishTime - startTime ticks To ensure 10% accuracy, require elapsedTime = finishTime – startTime >= 1000 ticks

Timing in Java Same semantics, slightly different syntax
long startTime = System.nanoTime(); //Code to be measured System.out.println( "Elapsed time: " + (System.nanoTime() - startTime)); or Use micro-benmarking frameworks such as JMH (recommended)

What Went Wrong? clock_t startTime = clock();
long numberOfRepetitions; do { numberOfRepetitions++; insertionSort(a, n); } while (clock() - startTime < 1000) double elapsedMillis = (clock()- startTime) / clocksPerMillis; double timeForCode = elapsedMillis/numberOfRepetitions; Suppose we are measuring worst-case run time. The array a Initially is in descending order. After the first iteration, the data is sorted. Subsequent iterations no longer start with worst-case data. So the measured time is for one Worst-case sort plus counter-1 best-case sorts!

The Fix clock_t startTime = clock(); long numberOfRepetitions; do {
// put code to initialize a[] here insertionSort(a, n) } while (clock() - startTime < 1000) double elapsedMillis = (clock()- startTime) / clocksPerMillis; Now the measured time is for counter number of worst-case sorts plus counter number of initializes. We can measure the time to initialize separately and subtract. Or, if We are to compare with other sort methods, we could ignore the initialize time as it is there in the measurements for all sort methods. Or we could argue that for large n the worst case time to sort using insertion sort (O(n^2)) overshadows the initialize time, which is O(n), and so we may ignore the initialize time.

Time Shared System UNIX time MyProgram
Using System.currentTimeMillis() is a problem, because your program may run for only part of the physical time that has elapsed. The UNIX command time gives the time for which your program ran.

To-do 1. C++: 1’.Java (JMH): 2. Linux:
1’.Java (JMH): 2. Linux:

Performance Measurement

Similar presentations

Presentation on theme: "Performance Measurement"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Measurement

Similar presentations

Presentation on theme: "Performance Measurement"— Presentation transcript:

Similar presentations

About project

Feedback