1 Parallel Computing 6 Performance Analysis Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
What is an Algorithm? (And how do we analyze one?)
Reference: Message Passing Fundamentals.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Chapter 7 Performance Analysis. 2 Additional References Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version.
CS 584. Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Parallel System Performance CS 524 – High-Performance Computing.
CS 240A: Complexity Measures for Parallel Computation.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
Virtues of Good (Parallel) Software
PMIT-6102 Advanced Database Systems
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Copyright warning. COMP5348 Lecture 6: Predicting Performance Adapted with permission from presentations by Alan Fekete.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.
Chapter 3: A Quantative Basis for Design Real design tries to reach an optimal compromise between a number of thing Execution time Memory requirements.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
October COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.
Parallel Programming with MPI and OpenMP
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Concurrency and Performance Based on slides by Henri Casanova.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Potential for parallel computers/parallel programming
Chapter 3: Principles of Scalable Performance
Complexity Measures for Parallel Computation
Objective of This Course
CSE8380 Parallel and Distributed Processing Presentation
Distributed Systems CS
COMP60621 Designing for Parallelism
COMP60621 Fundamentals of Parallel and Distributed Systems
COMP60611 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Complexity Measures for Parallel Computation
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

1 Parallel Computing 6 Performance Analysis Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR

2 Performance models Execution time: computation, communication, idle Experimental studies Speed, efficiency, cost Amdahl’s and Gustafson’s law Scalability – fixed and scaled problem size Isoefficiency function Outline of the lecture

3 Common pursuit at the design of parallel programs: maximum speed –but in fact tradeoffs between performance, simplicity, portability, user friendliness, etc., and also development / maintenance cost higher development cost in comparison with sequential software Mathematical performance models of parallel algorithms can help –predict performance before implementation improvement on increasing number of processors? –compare design alternatives and make decisions –explain barriers to higher performance of existing codes –guide optimization efforts –i.e. (not unlike a scientific theory) explain existing observations predict future behaviour abstract unimportant details –tradeoff between simplicity and accuracy For many common algorithms, perf. models can be found in literature –e.g. [Grama 2003] Introduction to Parallel Computing Why analysis of (parallel) algorithms

4 Performance – a multifaceted issue, with application-dependent importance Examples of metrics for measuring parallel performance: –execution time –parallel efficiency –memory requirements –throughput and/or latency –scalability –ratio of execution time to system cost Performance model: mathematical formalization of a given metrics –take into account (parallel application + target parallel architecture) = parallel system Ex: Performance model for the parallel execution time T T = f (N, P, U,...) N – problem size, P – number of processors, U – number of tasks,... – other hw and sw characteristics depending on the level of detail Performance models

5 Probably the most important metrics, not only in parallel processing Simple definition: The time elapsed from when the first processor starts executing on the (parallel) program to when the last processor completes the execution Parallel execution time can be divided into computation (comp), communication (comm) and idle (idle) times [next slides]  Execution time T equals the execution time Ti on any (i th ) processor T = Ti = Ti,comp + Ti,comm + Ti,idle or, using sums of times Tcomp, Tcomm, Tidle over all P processors T = (Tcomp + Tcomm + Tidle) / P Assumption: one-to-one task-processor mapping, identical processors ( = processing elements) Execution time P1P1 P2P2 P3P3 P4P4 T

6 Process-time diagram, real application generated in XPVM

7 Tcomp – time spent on the proper computation –sequential programs are supposed to run only in Tcomp Depends on:  the performance characteristics of processors and their memory systems  the size of the problem N (may be a set of parameters)  the number of processors P in particular if replication of computation is applied cannot assume constant computation time when number of processors varies Computation time

8 Tcomm – time spent sending and receiving messages Major component of overhead Depends on: –the size of the message –interconnection system structure –mode of the transfer e.g. store-and-forward, cut-through Simple (idealized) timing model: Tmsg = ts + tw. L ts.. startup time (latency) L.. message size in bytes tw.. transfer time per data word  bandwidth (throughput): 1/tw, transfer rate per second, usually recalculated to bits/sec Communication time (1) message length startup time time bandwidth

9 Substantial platform-dependent differences in ts, tw – cf. [Foster 1995] –measurements necessary (ping-pong test) –great impact on the parallelization approach Ex. IBM SP timings: to : tw : ts = 1 : 55 : 8333 to.. arithmetic operation time –latency dominates with small messages! Internode versus intranode communication: location of the communicating tasks: the same x different computing nodes –intranode communication in general conceived faster valid e.g. on Ethernet networks on supercomputers often quite comparable Communication time (2)

10 Real communication timings Communication time Bandwidth data of IBM SP

11 Tidle – time spent waiting for computation and/or data Another component of parallel overhead Due to lack of work –uneven distribution of work to processors (load imbalance) –consequence of synchronization and communication Can be reduced by –load-balancing techniques –overlapping computation and communication In practice difficult to determine –depends on the order of operations Often neglected in performance models Idle time

12 2-D grid N x Z of points, P processors 1-D decomposition to P subgrids of (N/P) x Z points Model parameters: tc.. average computation time at a single grid point, ts.. latency, tw.. transfer time per word Total computation time, summed over all nodes: Tcomp = tc N Z Total communication time, summed over P processors: Tcomm = 2 P (ts + Z tw) Neglecting Tidle (structured, synchronous communication) Execution time per iteration: T = (Tcomp + Tcomm + Tidle) / P = (tc N Z + 2 P (ts + Z tw) + 0) / P = = tc (N / P) Z + 2 (ts + Z tw) ( = Ti,comp + Ti,comm ) Ex.: Timing Jacobi finite differences Z N/P N

13 Idealized multicomputer –no low-level hardware details, e.g. memory hierarchies, network topologies Scale analysis –e.g. neglect one-time initialization step of an iterative algorithm Empirical constants for model calibration instead of modelling details Trade-off between model complexity and acceptable accuracy Reducing model complexity

14 Parallel computing is primarily an experimental discipline Goals of experimental studies: –parameters for performance models (e.g. ts, tw in Tcomm) –comparison of observed and modelled performance –calibration of performance models Design of experiments – issues: –data to be measured –measurement methods and tools –accuracy and reproducibility (always repeat to verify!) Often greater variations in results – possible causes: –a nondeterministic algorithm (e.g. due to random numbers) –timer problems (inaccurate, limited resolution) –startup and shutdown costs (expensive, system dependent) –interference from other programs (even on dedicated processors) –communication contention (e.g. on the Ethernet) –random resource allocation (if processor nodes are not equivalent) Experimental studies

15 Execution time not always convenient –varies with problem size –comparison with original sequential code needed More adequate measures of parallelization quality: –speedup –efficiency –cost Base for qualitative analysis Comparative performance metrics

16 Quantifies the performance gain achieved by parallelizing given application over a sequential implementation Relative speedup on P processors: S r = T 1 / T p T 1.. execution time on one processor of the parallel program of the original sequential program T p.. execution time on P (equal) processors Absolute speedup on P processors: S = T 1 / T p T 1.. execution time for the best-known sequential algorithm T p.. see above S is more objective, S r used in practice –S r more or less predicates scalability 0 < S <= S r <= P expected Speedup

17 Theoretically, (absolute) speedup can never exceed the number of processors –otherwise another sequential algorithm could emulate the parallel run in a shorter time In practice S > P sometimes observed – superlinear speedup –“bonus” of parallelization efforts Reasons:  sequential algorithm is not optimal  sequential algorithm is penalized by hardware  e.g. slower access to data (cache effects)  sequential and parallel algorithms do not perform the same work  e.g. tree search Superlinear speedup [Grama 2003]

18 Typical speedup curves [Lin 2009] superlinear speedup linear speedup Program 1 Program 2

19 Measure of the fraction of time for which a processing element is usefully employed –characterize the effectiveness with which a program uses the resources of a parallel computer Relative efficiency on P processors: E r = S r / P = T 1 / (P · T p ) S r.. relative speedup Absolute efficiency on P processors: E = S / P 0 < E <= E r <= 1 Efficiency

20 Characterizes the amount of work performed by the processors when solving the problem Cost on P processors: C = T p · P = T 1 / E –also called processor-time product –cost of a sequential computation is its execution time Cost-optimal parallel system: The cost of solving a problem on a parallel computer is proportional to (matches) the cost ( = execution time) of the fastest-known sequential algorithm –i.e. efficiency is asymptotically constant, speedup is linear –cost optimality implies very good scalability [further slides] Cost

21 Observation: Every parallel algorithm has a fraction of operations that must be performed sequentially (sequential component); that component limits its speedup Gene Amdahl (1967): If r s (0 < r s <= 1) is the sequential component of the execution time, then the maximal possible speedup achievable on a parallel computer is 1/ r s, no matter how many processors are used E.g. if 5% of the computation is serial (r s = 0.05), then the maximum speedup is 20 Amdahl’s law (1)

22 Proof: Let r p is the parallelizable part of the algorithm, i.e. r s + r p = 1. Then T p, the parallel execution time on P processors, is Thus, for the speedup S p on P processors holds and Amdahl’s law (2)

23 Some retarding effect for the development of parallel computing Practice showed that Amdahl’s reasoning is too pessimistic –greater speedup encountered than Amdahl’s law predicted –sequential components are usually not inherent – reformulation of the problem may eliminate the bottleneck –increasing the problem size may decrease the percentage of the sequential part of the algorithm reflected in the newer Gustafson’s law [next slide] Amdahl’s law relevant when sequential programs are parallelized incrementally / partially –e.g. data-parallel programs with some part not being amenable to a data- parallel formulation Amdahl’s law (3)

24 Observation: A larger multicomputer usually allows larger problems to be solved in reasonable time John Gustafson (1988): Given a parallel program solving a problem of size N using P processors, let r s denotes the sequential component (i.e. (1 – r s ) is the parallelizable component). The maximum speedup S achievable by this program is E.g. if 5% of the computation is sequential (r s = 0.05), then on 20 processors the maximum speedup is ·19 = –Amdahl: Gustafson – time constrained scaling, scaled speedup –the problem size is an increasing function of the processor count constant parallel execution time, decreasing serial component –Amdahl – constant problem size scaling Gustafson(-Barsis)’s law

25 Investigates the adaptability of the parallel system to changes in the computing environment –problem size, number of processors, communication speed, memory size, etc. Based on substitution of machine-specific numeric values for the various parameters in performance models –caution necessary – performance models are idealizations of complex phenomena Most interesting: the ability to utilize increasing number of processors –studied in scalability analysis [next slides] Quantitative analysis

26 Scalability of a parallel system is a measure of its ability to increase performance (speedup) as the number of processors increases –hardware scalability: the parallel computer can incorporate more processors without degrading the communication subsystem Naively, one would assume that more processors (automatically) improve performance Definition of a scalable parallel program (system) varies in literature; often imprecise formalization –e.g. “a parallel system is scalable if the performance is linearly proportional to the number of processors used ” Scalability

27 Scalability with fixed problem size: dependence of the parallel system performance (execution time, efficiency) on the changing processor count when the problem size (and other machine parameters) are fixed Analysis answers questions such as “what is the fastest one can solve the given problem on the given computer?“ Fixed problem size (1) 1 P T P E 1 1 Efficiency will generally decrease monoto- nically with increasing processor count Execution time should actually increase after reaching some maximum number of processors

28 Nontrivial parallel algorithm: In reality, for any fixed problem there is an optimum number of processors that minimizes overall execution time –computation time Tcomp component decreases –communication time Tcomm (+ idle time Tidle) component increases –usually an upper limit on the number of processors that can be usefully employed An execution time model aspiring for perfor- mance extrapolation (prediction) accommodates a term with P x, x > 0 Choosing problem size is difficult, if the processor range is large –must provide enough data for large-scale computations –data must fit into memory for small-scale computations Solution: scaling the problem size with the processor count [next slide] Fixed problem size (2) [Quinn 2004]

29 Scalability with scaled problem size: dependence of parallel system performance on the number of processors when the problem size is allowed to change Encouraged by the fact that parallelization is employed not only to solve (fixed-sized) problems faster, but also to solve larger problems –typically the problem size is increased when moved to more powerful machines with more processors with some problems scaling not possible (e.g. with functional decomposition) Observations: –Efficiency will often increase with increasing problems size and constant processor count  –Efficiency will generally decrease with increasing processor count [prev. slide] Scaled problem size (1) N E

30 Larger problems (N) have higher execution time (T - left) and usually better efficiency (E - right) on the same number of processors (P) than smaller ones Scaled problem size (2) 1 P T N = 500 N = P E 1 N = 500

31 Of particular interest: How the amount of computation must scale with the number of processors to keep the efficiency constant? Isoefficiency function  (P): gives the growth rate of problem size N which is necessary to keep E constant with increasing P –does not exist for unscalable parallel systems T 1 = E (T p P) = E (Tcomp + Tcomm + Tidle) –to maintain constant efficiency, the amount of essential computation must increase at the same rate as overheads If  is O(P), then the parallel system is highly scalable: –the amount of computation needs to increase only linearly with respect to P to keep efficiency constant –ex. Jacobi finite differences: for N = O(P) is T 1 = tc Z N  E (tc Z N + 2 P (ts + Z tw)) thus the problem is highly scalable Isoefficiency metric of scalability Z 2N/2P 2N

32 Extrapolation from observations  statements like “speedup of 10.8 on 12 processors with problem size 100”  small number of observations in a multidimensional space  says little about the quality of the parallel system as a whole Asymptotic analysis  statements like “algorithm requires O(N log N) time on O(N) processors”  deals with large N and P, usually out of scope of practical interest  says nothing about absolute cost  usually assumes idealized machine models (e.g. PRAM)  more important for theory than practice Other evaluation methods

33 The lecture provides only with a “feel and taste“ introduction to the analytical modelling of parallel programs Good knowledge required especially when supercomputing is concerned –practical experience from small parallel system is difficult to extrapolate to large problems targeted on machines with thousands of processors Conclusions

34 Further study Covered to some extent in all textbooks on parallel programming/computing –with attempts to specific point of view The most profound coverage can be probably found in [Grama 2003] Introduction to Parallel Computing

35

36 najit dalsi ??? a cervena mista Comments on the lecture

37 Lin p 77 FLOPS

38 Lin p 64 Sources of performance loss