Download presentation
Presentation is loading. Please wait.
Published byJanel Gilmore Modified over 9 years ago
1
1 Parallel Computing 6 Performance Analysis Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR
2
2 Performance models Execution time: computation, communication, idle Experimental studies Speed, efficiency, cost Amdahl’s and Gustafson’s law Scalability – fixed and scaled problem size Isoefficiency function Outline of the lecture
3
3 Common pursuit at the design of parallel programs: maximum speed –but in fact tradeoffs between performance, simplicity, portability, user friendliness, etc., and also development / maintenance cost higher development cost in comparison with sequential software Mathematical performance models of parallel algorithms can help –predict performance before implementation improvement on increasing number of processors? –compare design alternatives and make decisions –explain barriers to higher performance of existing codes –guide optimization efforts –i.e. (not unlike a scientific theory) explain existing observations predict future behaviour abstract unimportant details –tradeoff between simplicity and accuracy For many common algorithms, perf. models can be found in literature –e.g. [Grama 2003] Introduction to Parallel Computing Why analysis of (parallel) algorithms
4
4 Performance – a multifaceted issue, with application-dependent importance Examples of metrics for measuring parallel performance: –execution time –parallel efficiency –memory requirements –throughput and/or latency –scalability –ratio of execution time to system cost Performance model: mathematical formalization of a given metrics –take into account (parallel application + target parallel architecture) = parallel system Ex: Performance model for the parallel execution time T T = f (N, P, U,...) N – problem size, P – number of processors, U – number of tasks,... – other hw and sw characteristics depending on the level of detail Performance models
5
5 Probably the most important metrics, not only in parallel processing Simple definition: The time elapsed from when the first processor starts executing on the (parallel) program to when the last processor completes the execution Parallel execution time can be divided into computation (comp), communication (comm) and idle (idle) times [next slides] Execution time T equals the execution time Ti on any (i th ) processor T = Ti = Ti,comp + Ti,comm + Ti,idle or, using sums of times Tcomp, Tcomm, Tidle over all P processors T = (Tcomp + Tcomm + Tidle) / P Assumption: one-to-one task-processor mapping, identical processors ( = processing elements) Execution time P1P1 P2P2 P3P3 P4P4 T
6
6 Process-time diagram, real application generated in XPVM
7
7 Tcomp – time spent on the proper computation –sequential programs are supposed to run only in Tcomp Depends on: the performance characteristics of processors and their memory systems the size of the problem N (may be a set of parameters) the number of processors P in particular if replication of computation is applied cannot assume constant computation time when number of processors varies Computation time
8
8 Tcomm – time spent sending and receiving messages Major component of overhead Depends on: –the size of the message –interconnection system structure –mode of the transfer e.g. store-and-forward, cut-through Simple (idealized) timing model: Tmsg = ts + tw. L ts.. startup time (latency) L.. message size in bytes tw.. transfer time per data word bandwidth (throughput): 1/tw, transfer rate per second, usually recalculated to bits/sec Communication time (1) message length startup time time bandwidth
9
9 Substantial platform-dependent differences in ts, tw – cf. [Foster 1995] –measurements necessary (ping-pong test) –great impact on the parallelization approach Ex. IBM SP timings: to : tw : ts = 1 : 55 : 8333 to.. arithmetic operation time –latency dominates with small messages! Internode versus intranode communication: location of the communicating tasks: the same x different computing nodes –intranode communication in general conceived faster valid e.g. on Ethernet networks on supercomputers often quite comparable Communication time (2)
10
10 Real communication timings Communication time Bandwidth data of IBM SP
11
11 Tidle – time spent waiting for computation and/or data Another component of parallel overhead Due to lack of work –uneven distribution of work to processors (load imbalance) –consequence of synchronization and communication Can be reduced by –load-balancing techniques –overlapping computation and communication In practice difficult to determine –depends on the order of operations Often neglected in performance models Idle time
12
12 2-D grid N x Z of points, P processors 1-D decomposition to P subgrids of (N/P) x Z points Model parameters: tc.. average computation time at a single grid point, ts.. latency, tw.. transfer time per word Total computation time, summed over all nodes: Tcomp = tc N Z Total communication time, summed over P processors: Tcomm = 2 P (ts + Z tw) Neglecting Tidle (structured, synchronous communication) Execution time per iteration: T = (Tcomp + Tcomm + Tidle) / P = (tc N Z + 2 P (ts + Z tw) + 0) / P = = tc (N / P) Z + 2 (ts + Z tw) ( = Ti,comp + Ti,comm ) Ex.: Timing Jacobi finite differences Z N/P N
13
13 Idealized multicomputer –no low-level hardware details, e.g. memory hierarchies, network topologies Scale analysis –e.g. neglect one-time initialization step of an iterative algorithm Empirical constants for model calibration instead of modelling details Trade-off between model complexity and acceptable accuracy Reducing model complexity
14
14 Parallel computing is primarily an experimental discipline Goals of experimental studies: –parameters for performance models (e.g. ts, tw in Tcomm) –comparison of observed and modelled performance –calibration of performance models Design of experiments – issues: –data to be measured –measurement methods and tools –accuracy and reproducibility (always repeat to verify!) Often greater variations in results – possible causes: –a nondeterministic algorithm (e.g. due to random numbers) –timer problems (inaccurate, limited resolution) –startup and shutdown costs (expensive, system dependent) –interference from other programs (even on dedicated processors) –communication contention (e.g. on the Ethernet) –random resource allocation (if processor nodes are not equivalent) Experimental studies
15
15 Execution time not always convenient –varies with problem size –comparison with original sequential code needed More adequate measures of parallelization quality: –speedup –efficiency –cost Base for qualitative analysis Comparative performance metrics
16
16 Quantifies the performance gain achieved by parallelizing given application over a sequential implementation Relative speedup on P processors: S r = T 1 / T p T 1.. execution time on one processor of the parallel program of the original sequential program T p.. execution time on P (equal) processors Absolute speedup on P processors: S = T 1 / T p T 1.. execution time for the best-known sequential algorithm T p.. see above S is more objective, S r used in practice –S r more or less predicates scalability 0 < S <= S r <= P expected Speedup
17
17 Theoretically, (absolute) speedup can never exceed the number of processors –otherwise another sequential algorithm could emulate the parallel run in a shorter time In practice S > P sometimes observed – superlinear speedup –“bonus” of parallelization efforts Reasons: sequential algorithm is not optimal sequential algorithm is penalized by hardware e.g. slower access to data (cache effects) sequential and parallel algorithms do not perform the same work e.g. tree search Superlinear speedup [Grama 2003]
18
18 Typical speedup curves [Lin 2009] superlinear speedup linear speedup Program 1 Program 2
19
19 Measure of the fraction of time for which a processing element is usefully employed –characterize the effectiveness with which a program uses the resources of a parallel computer Relative efficiency on P processors: E r = S r / P = T 1 / (P · T p ) S r.. relative speedup Absolute efficiency on P processors: E = S / P 0 < E <= E r <= 1 Efficiency
20
20 Characterizes the amount of work performed by the processors when solving the problem Cost on P processors: C = T p · P = T 1 / E –also called processor-time product –cost of a sequential computation is its execution time Cost-optimal parallel system: The cost of solving a problem on a parallel computer is proportional to (matches) the cost ( = execution time) of the fastest-known sequential algorithm –i.e. efficiency is asymptotically constant, speedup is linear –cost optimality implies very good scalability [further slides] Cost
21
21 Observation: Every parallel algorithm has a fraction of operations that must be performed sequentially (sequential component); that component limits its speedup Gene Amdahl (1967): If r s (0 < r s <= 1) is the sequential component of the execution time, then the maximal possible speedup achievable on a parallel computer is 1/ r s, no matter how many processors are used E.g. if 5% of the computation is serial (r s = 0.05), then the maximum speedup is 20 Amdahl’s law (1)
22
22 Proof: Let r p is the parallelizable part of the algorithm, i.e. r s + r p = 1. Then T p, the parallel execution time on P processors, is Thus, for the speedup S p on P processors holds and Amdahl’s law (2)
23
23 Some retarding effect for the development of parallel computing Practice showed that Amdahl’s reasoning is too pessimistic –greater speedup encountered than Amdahl’s law predicted –sequential components are usually not inherent – reformulation of the problem may eliminate the bottleneck –increasing the problem size may decrease the percentage of the sequential part of the algorithm reflected in the newer Gustafson’s law [next slide] Amdahl’s law relevant when sequential programs are parallelized incrementally / partially –e.g. data-parallel programs with some part not being amenable to a data- parallel formulation Amdahl’s law (3)
24
24 Observation: A larger multicomputer usually allows larger problems to be solved in reasonable time John Gustafson (1988): Given a parallel program solving a problem of size N using P processors, let r s denotes the sequential component (i.e. (1 – r s ) is the parallelizable component). The maximum speedup S achievable by this program is E.g. if 5% of the computation is sequential (r s = 0.05), then on 20 processors the maximum speedup is 20-0.05·19 = 19.05 –Amdahl: 10.26 Gustafson – time constrained scaling, scaled speedup –the problem size is an increasing function of the processor count constant parallel execution time, decreasing serial component –Amdahl – constant problem size scaling Gustafson(-Barsis)’s law
25
25 Investigates the adaptability of the parallel system to changes in the computing environment –problem size, number of processors, communication speed, memory size, etc. Based on substitution of machine-specific numeric values for the various parameters in performance models –caution necessary – performance models are idealizations of complex phenomena Most interesting: the ability to utilize increasing number of processors –studied in scalability analysis [next slides] Quantitative analysis
26
26 Scalability of a parallel system is a measure of its ability to increase performance (speedup) as the number of processors increases –hardware scalability: the parallel computer can incorporate more processors without degrading the communication subsystem Naively, one would assume that more processors (automatically) improve performance Definition of a scalable parallel program (system) varies in literature; often imprecise formalization –e.g. “a parallel system is scalable if the performance is linearly proportional to the number of processors used ” Scalability
27
27 Scalability with fixed problem size: dependence of the parallel system performance (execution time, efficiency) on the changing processor count when the problem size (and other machine parameters) are fixed Analysis answers questions such as “what is the fastest one can solve the given problem on the given computer?“ Fixed problem size (1) 1 P T P E 1 1 Efficiency will generally decrease monoto- nically with increasing processor count Execution time should actually increase after reaching some maximum number of processors
28
28 Nontrivial parallel algorithm: In reality, for any fixed problem there is an optimum number of processors that minimizes overall execution time –computation time Tcomp component decreases –communication time Tcomm (+ idle time Tidle) component increases –usually an upper limit on the number of processors that can be usefully employed An execution time model aspiring for perfor- mance extrapolation (prediction) accommodates a term with P x, x > 0 Choosing problem size is difficult, if the processor range is large –must provide enough data for large-scale computations –data must fit into memory for small-scale computations Solution: scaling the problem size with the processor count [next slide] Fixed problem size (2) [Quinn 2004]
29
29 Scalability with scaled problem size: dependence of parallel system performance on the number of processors when the problem size is allowed to change Encouraged by the fact that parallelization is employed not only to solve (fixed-sized) problems faster, but also to solve larger problems –typically the problem size is increased when moved to more powerful machines with more processors with some problems scaling not possible (e.g. with functional decomposition) Observations: –Efficiency will often increase with increasing problems size and constant processor count –Efficiency will generally decrease with increasing processor count [prev. slide] Scaled problem size (1) N E
30
30 Larger problems (N) have higher execution time (T - left) and usually better efficiency (E - right) on the same number of processors (P) than smaller ones Scaled problem size (2) 1 P T N = 500 N = 1000 1 P E 1 N = 500
31
31 Of particular interest: How the amount of computation must scale with the number of processors to keep the efficiency constant? Isoefficiency function (P): gives the growth rate of problem size N which is necessary to keep E constant with increasing P –does not exist for unscalable parallel systems T 1 = E (T p P) = E (Tcomp + Tcomm + Tidle) –to maintain constant efficiency, the amount of essential computation must increase at the same rate as overheads If is O(P), then the parallel system is highly scalable: –the amount of computation needs to increase only linearly with respect to P to keep efficiency constant –ex. Jacobi finite differences: for N = O(P) is T 1 = tc Z N E (tc Z N + 2 P (ts + Z tw)) thus the problem is highly scalable Isoefficiency metric of scalability Z 2N/2P 2N
32
32 Extrapolation from observations statements like “speedup of 10.8 on 12 processors with problem size 100” small number of observations in a multidimensional space says little about the quality of the parallel system as a whole Asymptotic analysis statements like “algorithm requires O(N log N) time on O(N) processors” deals with large N and P, usually out of scope of practical interest says nothing about absolute cost usually assumes idealized machine models (e.g. PRAM) more important for theory than practice Other evaluation methods
33
33 The lecture provides only with a “feel and taste“ introduction to the analytical modelling of parallel programs Good knowledge required especially when supercomputing is concerned –practical experience from small parallel system is difficult to extrapolate to large problems targeted on machines with thousands of processors Conclusions
34
34 Further study Covered to some extent in all textbooks on parallel programming/computing –with attempts to specific point of view The most profound coverage can be probably found in [Grama 2003] Introduction to Parallel Computing
35
35
36
36 najit dalsi ??? a cervena mista Comments on the lecture
37
37 Lin p 77 FLOPS
38
38 Lin p 64 Sources of performance loss
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.