Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Slides:



Advertisements
Similar presentations
Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
1 Parallel Computing 6 Performance Analysis Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Chapter 7 Performance Analysis. 2 Additional References Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version.
CS 584. Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.
Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.
Chapter 7 Performance Analysis. 2 References (Primary Reference): Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
Parallel System Performance CS 524 – High-Performance Computing.
CS 240A: Complexity Measures for Parallel Computation.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Performance of Parallel Programs Michelle Kuttel 1.
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
Chapter 7 Performance Analysis.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Parallel Programming with MPI and OpenMP
Advanced Computer Networks Lecture 1 - Parallelization 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
Concurrency and Performance Based on slides by Henri Casanova.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
Supercomputing in Plain English Tuning Blue Waters Undergraduate Petascale Education Program May 29 – June
Potential for parallel computers/parallel programming
Parallel Processing - introduction
What Exactly is Parallel Processing?
Introduction to Parallelism.
Complexity Measures for Parallel Computation
CSE8380 Parallel and Distributed Processing Presentation
CS 584.
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Complexity Measures for Parallel Computation
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Performance Measurement

A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors: –execution time –scalability –efficiency

A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors: n Also must take into account the costs: –memory requirements –implementation costs –maintenance costs etc.

A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors: n Also must take into account the costs: n Mathematical performance models are used to assess these costs and predict performance.

Defining Performance n How do you define parallel performance? n What do you define it in terms of? n Consider –Distributed databases –Image processing pipeline –Nuclear weapons testbed

Metrics for Performance n Efficiency n Speedup n Scalability n Others …………..

Some Terms n s(n,p) = speedup for problem size n on p processors n o(n) = serial portion of computation n p(n) = parallel portion of computation n c(n,p) = time for communication n Speed1 = o(n) + p(n) n SpeedP = o(n) + p(n)/p + c(n,p)

Efficiency pT p T1T1 E  The fraction of time a processor spends doing useful work n What about when pT p < T 1 –Does cache make a processor work at 110%? o(n) + p(n) p * o(n) + p(n) + p * c(n,p) E =

Speedup SpeedP Speed S 1  What is Speed? What algorithm for Speed1? What is the work performed? How much work?

Speedup (More Detail) n s(n,p) = speedup for problem size n on p processors n o(n) = serial portion of computation n p(n) = parallel portion of computation n c(n,p) = time for communication n Speed1 = o(n) + p(n) n SpeedP = o(n) + p(n)/p + c(n,p) o(n) + p(n) o(n) + p(n)/p + c(n,p) Speedup =

More on Speedup Computation time decreases as we add processors but communication time increases

Two kinds of Speedup n Relative –Uses parallel algorithm on 1 processor –Most common –Useful for determining algorithm scalability n Absolute –Uses best known serial algorithm –Eliminates overheads in calculation. –Useful to express absolute performance n Story: Prime Number Generation

Amdahl's Law n Every algorithm has a sequential component. n Sequential component limits speedup Sequential Component Maximum Speedup = 1/s = s ¾ can be parallelized ¼ sequential Suppose each ¼ of the program takes 1 unit of time Speedup = 1 proc time / n proc time = 4/1 = 4

Amdahl’s Law o(n) + p(n) o(n) + p(n)/p + c(n,p) Speedup = o(n) + p(n) o(n) + p(n)/p <= s = o(n)/(o(n) + p(n)) = the inherently sequential percentage Speedup <= o(n) / s o(n) + o(n) ( 1/s -1)/p Speedup <= 1 s + ( 1 - s)/p

Amdahl's Law s Speedup

Speedup n Algorithm A –Serial execution time is 10 sec. –Parallel execution time is 2 sec. n Algorithm B –Serial execution time is 2 sec. –Parallel execution time is 1 sec. n What if I told you A = B?

Speedup n Conventional speedup is defined as the reduction in execution time. n Consider running a problem on a slow parallel computer and on a faster one. –Same serial component –Speedup will be lower on the faster computer.

Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of logic is the syllogism, consisting of a major and minor premise and a conclusion.

Example n Major Premise: Sixty men can do a piece of work sixty times as quickly as one man. n Minor Premise: One man can dig a post- hole in sixty seconds. n Conclusion: Sixty men can dig a post-hole in one second.

Speedup and Amdahl's Law n Conventional speedup penalizes faster absolute speed. n Assumption that task size is constant as the computing power increases results in an exaggeration of task overhead. n Scaling the problem size reduces these distortion effects.

Solution n Gustafson introduced scaled speedup. n Scale the problem size as you increase the number of processors. n Calculated in two ways –Experimentally –Analytical models

Traditional Speedup (Strong Scaling) )( )( 1 NT NT Speedup P  T x (y) is time taken to solve problem of size y on x processors

Scaled Speedup (weak scaling) )( )( 1 PNT T Speedup P  Traditional speedup reduces the work done by each processor as we add processors Scaled speedup keeps the work constant on each processor as we add processors.

Scaled Speedup o(n) + p(n) o(n) + p(n)/p Speedup <= can be divided into two pieces serial and parallel s = o(n) / (o(n) + p(n)/p) and (1 – s) = p(n)/p / (o(n) + p(n)/p) now solve for o(n) and p(n) respectively o(n) = (o(n) + p(n)/p) * s p(n) = (o(n) + p(n)/p) * (1 – s) * p substituting these back into Speedup Equation yeilds Speedup <= s + (1 – s) * p and Speedup <= p + (1 – p) * s where s is fraction of time doing serial code = o(n) / t(n,k) t(n,k) is time of parallel program for size n on k processors Thus, max speedup with p < k processors is Speedup <= p + (1 – p) * s

Traditional Speedup ideal measured Number of Processors Speedup

Scaled Speedup ideal Number of Processors Speedup Small problem Medium problem Large Problem

Scaled Speedup vs Amdahl’s Law n Amdahl’s Law determines speedup by taking a serial computation and predicting how quickly it could be done in parallel n Scaled speedup begins with a parallel computation and estimates how much faster the parallel computation is than the same computation on a serial processor n strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size. n weak scaling is defined as how the solution time varies with the number of processors for a fixed problem size per processor.

Determining Scaled Speedup n Time problem size n on 1 processor n Time problem size 2n on 2 processors n Time problem size 2n on 1 processor n Time problem size 4n on 4 processors n Time problem size 4n on 1 processor n etc. n Plot the curve

Performance Measurement n There is not a perfect way to measure and report performance. n Wall clock time seems to be the best. n But how much work do you do? n Best Bet: –Develop a model that fits experimental results.

A Parallel Programming Model n Goal: Define an equation that predicts execution time as a function of –Problem size –Number of processors –Number of tasks –Etc.,....),(PNfT 

A Parallel Programming Model n Execution time can be broken up into –Computing –Communicating –Idling  idlecommcomp TTTT

Computation Time n Normally depends on problem size n Also depends on machine characteristics –Processor speed –Memory system –Etc. n Often, experimentally obtained

Communication Time n The amount of time spent sending & receiving messages n Most often is calculated as –Cost of sending a single message * #messages n Single message cost –T = startuptime + time_to_send_one_word * #words

Idle Time n Difficult to determine n This is often the time waiting for a message to be sent to you. n Can be avoided by overlapping communication and computation.

Finite Difference Example n Finite Difference Code n 512 x 512 x 5 Elements n Nine-point stencil n Row-wise decomposition –Each processor gets n/p*n*z elements n 16 IBM RS6000 workstations n Connected via Ethernet z n x

Finite Difference Model n Execution Time (per iteration) –ExTime = (Tcomp + Tcomm)/P n Communication Time (per iteration) –Tcomm = 2 (lat + 2*n*z*bw) n Computation Time –Estimate using some sample code

Estimated Performance

Finite Difference Example

What was wrong? n Ethernet –Shared bus n Change the computation of Tcomm –Reduce the bandwith –Scale the message volume by the number of processors sending concurrently. –Tcomm = 2 (lat + 2*n*z*bw * P/2)

Finite Difference Example

Using analytical models n Examine the control flow of the algorithm n Find a general algebraic form for the complexity (execution time). n Fit the curve with experimental data. n If the fit is poor, find the missing terms and repeat. n Calculate the scaled speedup using formula.

Example n Serial Time = N seconds n Parallel Time = N/P + 5P seconds n Let N/P = 128 n Scaled Speedup for 4 processors is:  )4(5)4/)128(4(124 ))128(4(122    )( )( 1  PNC C P

Performance Evaluation n Identify the data n Design the experiments to obtain the data n Report data

Performance Evaluation n Identify the data –Execution time –Be sure to examine a range of data points n Design the experiments to obtain the data n Report data

Performance Evaluation n Identify the data n Design the experiments to obtain the data –Make sure the experiment measures what you intend to measure. –Remember: Execution time is max time taken. –Repeat your experiments many times –Validate data by designing a model n Report data

Performance Evaluation n Identify the data n Design the experiments to obtain the data n Report data –Report all information that affects execution –Results should be separate from Conclusions –Present the data in an easily understandable format.