Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Slides:



Advertisements
Similar presentations
Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Potential for parallel computers/parallel programming
Performance Analysis of Multiprocessor Architectures
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Chapter 7 Performance Analysis. 2 Additional References Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version.
CS 584. Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.
Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.
Recap.
Chapter 7 Performance Analysis. 2 References (Primary Reference): Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
Parallel Programming Chapter 3 Introduction to Parallel Architectures Johnnie Baker January 26 , 2011.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
DCS/2003/1 CENG Distributed Computing Systems Measures of Performance.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.
Parallel Programming in C with MPI and OpenMP
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
Chapter 7 Performance Analysis.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Effective Presentation Techniques Michael J. Quinn 7 October 2005 Version 1.2.
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Parallel Programming with MPI and OpenMP
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
Concurrency and Performance Based on slides by Henri Casanova.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
Distributed and Parallel Processing George Wells.
DCS/1 CENG Distributed Computing Systems Measures of Performance.
Supercomputing in Plain English Tuning Blue Waters Undergraduate Petascale Education Program May 29 – June
Potential for parallel computers/parallel programming
4- Performance Analysis of Parallel Programs
Chapter 3: Principles of Scalable Performance
Chapter 12: Analysis of Algorithms
Analytical Modeling Of Parallel Programs
CS 584.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Analysis of Algorithms
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Presentation transcript:

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance

Outline General speedup formula Amdahl’s Law Gustafson-Barsis’ Law (skip) Karp-Flatt metric Isoefficiency metric

Speedup Formula

Execution Time Components Inherently sequential computations:  (n) Potentially parallel computations:  (n) Communication operations:  (n,p)

Speedup Expression Psi as in “lopsided”

 (n)/p p grows as n is fixed

(n,p)(n,p)

 (n)/p +  (n,p) p grows as n is fixed

Speedup Plot “elbowing out”

Efficiency

0   (n,p)  1 All terms > 0   (n,p) > 0 Denominator > numerator   (n,p) < 1

Amdahl’s Law Let f =  (n)/(  (n) +  (n)) A bound on the maximum potential speedup, given in terms of the fraction of sequential processing time.

Example 1 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs?

Example 2 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program?

Pop Quiz An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors?

Pop Quiz A computer animation program generates a feature movie frame-by-frame. Each frame can be generated independently and is output to its own file. If it takes 99 seconds to render a frame and 1 second to output it, how much speedup can be achieved by rendering the movie on 100 processors?

Limitations of Amdahl’s Law Ignores  (n,p) Overestimates speedup achievable Treats problem size as a constant Shows how execution time decreases as number of processors increases Could also be overly pessimistic

The Amdahl Effect Typically  (n,p) has lower complexity than  (n)/p As n increases,  (n)/p dominates  (n,p) As n increases, speedups potentially increase

Illustration of the Amdahl Effect n = 100 n = 1,000 n = 10,000 Speedup Processors

Another Perspective We often use faster computers to solve larger problem instances Let’s treat time as a constant and allow problem size to increase with number of processors

Gustafson-Barsis’s Law Let s =  (n)/(  (n)+  (n)/p) Like Amdahl, it gives a bound on the maximum potential speedup. Now given in terms of the fraction of time a parallel program spends on sequential processing. We say “scaled speedup” since we start with a parallel computation which often has a problem size as function of p.

Example 1 An application running on 10 processors spends 3% of its time in serial code. What is the scaled speedup of the application? Execution on 1 CPU takes 10 times as long… …except 9 do not have to execute serial code

Example 2 What is the maximum fraction of a program’s parallel execution time that can be spent in serial code if it is to achieve a scaled speedup of 7 on 8 processors?

Pop Quiz A parallel program executing on 32 processors spends 5% of its time in sequential code. What is the scaled speedup of this program?

Isoefficiency Metric Parallel system: parallel program executing on a parallel computer Scalability of a parallel system: measure of its ability to increase performance as number of processors increases A scalable system maintains efficiency as processors are added Isoefficiency: way to measure scalability

Isoefficiency Derivation Steps Need to determine impact of parallel overhead Begin with speedup formula Compute total amount of overhead Assume efficiency remains constant Determine relation between sequential execution time and overhead

Deriving Isoefficiency Relation Determine overhead Substitute overhead into speedup equation Substitute T(n,1) =  (n) +  (n). Assume efficiency is constant. Isoefficiency Relation

Scalability Function Space is the limiting factor, since to maintain efficiency we must increase problem size. Suppose isoefficiency relation is n  f(p) Let M(n) denote memory required for problem of size n M(f(p))/p shows how memory usage per processor must increase to maintain same efficiency We call M(f(p))/p the scalability function

Meaning of Scalability Function To maintain efficiency when increasing p, we must increase n Maximum problem size limited by available memory, which is linear in p Scalability function shows how memory usage per processor must grow to maintain efficiency Scalability function a constant means parallel system is perfectly scalable

Interpreting Scalability Function Number of processors Memory needed per processor Cplogp Cp Clogp C Memory Size Can maintain efficiency Cannot maintain efficiency

Example 1: Reduction Sequential algorithm complexity T(n,1) =  (n) Parallel algorithm –Computational complexity =  (n/p) –Communication complexity =  (log p) Parallel overhead T 0 (n,p) =  (p log p)

Reduction (continued) Isoefficiency relation: n  C p log p We ask: To maintain same level of efficiency, how must n increase when p increases? M(n) = n The system has good scalability

Example 2: Floyd’s Algorithm Sequential time complexity:  (n 3 ) Parallel computation time:  (n 3 /p) Parallel communication time:  (n 2 log p) Parallel overhead: T 0 (n,p) =  (pn 2 log p)

Floyd’s Algorithm (continued) Isoefficiency relation n 3  C(p n 3 log p)  n  C p log p M(n) = n 2 The parallel system has poor scalability

Example 3: Finite Difference Sequential time complexity per iteration:  (n 2 ) Parallel communication complexity per iteration:  (n/  p) Parallel overhead:  (n  p)

Finite Difference (continued) Isoefficiency relation n 2  Cn  p  n  C  p M(n) = n 2 This algorithm is perfectly scalable

Summary (1/3) Performance terms –Speedup –Efficiency Model of speedup –Serial component –Parallel component –Communication component

Summary (2/3) What prevents linear speedup? –Serial operations –Communication operations –Process start-up –Imbalanced workloads –Architectural limitations

Summary (3/3) Analyzing parallel performance –Amdahl’s Law –Gustafson-Barsis’ Law –Karp-Flatt metric –Isoefficiency metric