Presentation is loading. Please wait.

Presentation is loading. Please wait.

October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

Similar presentations


Presentation on theme: "October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre."— Presentation transcript:

1 October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

2 October 2010 2 Overview Aims of performance modelling –Allows the comparison of algorithms. Gives an indication of scalability of an algorithm on a machine (a parallel system) as both the problem size and the number of processors change – “complexity analysis of parallel algorithms”. –Enables reasoned choices at the design stage. Overview of an approach to performance modelling. –Based on the approach of Foster and Grama et al. –Targets a generic multicomputer – (model of message-passing). Limitations. A worked example –Vector sum reduction (i.e. compute the sum of the elements of a vector). Summary.

3 October 2010 3 Aims of performance modelling In this lecture we will look at modelling the performance of algorithms that compute a result; –Issues of correctness are relatively straightforward. We are interested in questions such as: –How long will an algorithm take to execute? –How much memory is required (though we will not consider this in detail here)? –Does the algorithm scale as we vary the number of processors and/or the problem size? What does scaling mean? –How do the performances of different algorithms compare? Typically, focus on one phase of a computation at a time; –e.g. assume start-up and initialisation has been done, or that these phases have been modelled separately.

4 October 2010 4 An approach to performance modelling Based on a generic multiprocessor (see next slide). Defined in terms of Tasks that undertake computation and communicate with other tasks as necessary; –A Task may be an aggolmeration of smaller tasks. Assumes a simple, but realistic, approach to communication between tasks: –Based on channels that connect pairs of tasks. Seeks an analytical expression for execution time ( T ) as a function of (at least) the problem size ( N ), number of processors ( P ) (and, often, the number of tasks ( U )),

5 October 2010 5 A generic multicomputer CPU Memory CPU Memory CPU Memory CPU Memory … Interconnect

6 October 2010 6 Task-channel model Tasks execute concurrently; –The number of tasks can vary during execution. A task encapsulates a sequential program and local memory. Tasks are connected by channels to other tasks; –Channels are input or output channels. In addition to reading from, and writing to, local memory a task can: –Send messages on output channels. –Receive messages on input channels. –Create new tasks. –Terminate.

7 October 2010 7 Task-channel model A channel connecting two tasks acts as a message queue. A send operation is asynchronous: it completes immediately; –Sends are considered to be ‘free’ (take zero time)(?!). A receive operation is synchronous: execution of a task is blocked until a message is available; –Receives may cause waiting (idling) time and take a finite time to complete (as data is transmitted from one task to another). Channels can be created dynamically. Tasks can be mapped to physical processors in various ways; –the mapping does not affect the semantics of the program, but it may well affect performance.

8 October 2010 8 Specifics of performance modelling Assume a processor is either computing, communicating or idling. Thus, the total execution time can be found as the sum of the time spent in each activity for any particular processor ( j ): Or as the sum of each activity over all processors divided by the number of processors ( P ): –These aggregate totals are often easier to calculate.

9 October 2010 9 Definitions

10 October 2010 10 Cost of messages A simple model of the cost of a message is: where: –T msg is the time to receive a message, –t s is the start up cost of receiving a message, –t w is the cost per word (s/word), 1/ t w is the bandwidth (words/s), –L is the number of words in the message.

11 October 2010 11 Cost of messages Thus, is the sum of all message times:

12 October 2010 12 Limitations of the Model The (basic) model presented in this lecture ignores the hierarchical nature of the memory of real computer systems: –Cache behaviour, –The impact of network architecture, –Issues of competition for bandwidth. The basic model can be extended to cope with any/all of these complicating factors. Experience with real performance analysis on real systems helps the designer to choose when and what extra modelling might be helpful.

13 October 2010 13 Performance metrics: Speed-up and Efficiency. Define relative speed-up as the ratio of the execution time of the parallelised algorithm on one processor to the corresponding time on P processors: Define relative efficiency as: This is a measure of the time that processors spend doing useful work (i.e., the time spent doing useful work divided by total time on all P processors). It characterises the effectiveness of an algorithm on a system, for any problem size and any number of processors

14 October 2010 14 Absolute performance metrics Relative speed-up can be misleading! (Why?) Define absolute speed-up (efficiency) with reference to the sequential time, T ref, of an implementation of the best known algorithm for the problem-at-hand: Note: the best known algorithm may take an approach to solving the problem different to that of the parallel algorithm.

15 October 2010 15 Scalability and Isoefficiency What is meant by scalability? –Scalability applies to an algorithm executing on a parallel machine, not simply to an algorithm! How does an algorithm behave for a fixed problem size as the number of processors used increases? –Known as strong scaling. How does an algorithm behave as the problem size changes in addition to changing the number of processors? A key insight is to look at how efficiency changes.

16 October 2010 16 Efficiency and Strong scaling Typically, for a fixed problem size N the efficiency of an algorithm decreases as P increases (compare with ‘brush’ diagrams). Why? –Overheads typically do not get smaller as P increases. They remain ‘fixed’ (e.g. Amdahl fraction), or, worse, they may grow with P (e.g. the number of communications may grow – in an all-to-all comms pattern) Recall that:

17 October 2010 17 Efficiency and Strong scaling PO P is the total overhead in the system. T ref represents the useful work in the algorithm. At some point, with fixed N, efficiency E abs (i.e. how well each processor is being utilised) will drop below an acceptable threshold – say, 50%(?)

18 October 2010 18 Scalability No ‘real’ algorithm scales ‘forever’ on a fixed problem size on a ‘real’ computer. Even ‘embarrassingly’ parallel algorithms will have a limit on the number of processors they can use; –for example, at the point where, with a fixed N, eventually there is only one ‘element’ to be operated on by each processor. So we seek another approach to scalability which applies as both problem size N and the number of processors P change.

19 October 2010 19 Definition of Scalability – Isoefficiency An algorithm can be said to (iso)scale if, for a given parallel system, a specific level of efficiency can be maintained by changing the problem size, N, appropriately as P increases. Not all algorithms isoscale! –e.g. a vector reduction where N = P (see later). This approach is called scaled problem analysis. The function (of P ) describing how the problem size N must change as P increases to maintain a specified efficiency is known as the isoefficiency function. Isoscaling does not apply to all problems; –e.g. weather modelling, where increasing problem size (resolution) is not always an option, –or image processing with a fixed number of pixels.

20 October 2010 20 Weak scaling An alternative approach is to keep the problem size per processor fixed as P increases (total problem size N increases linearly with P) and see how the efficiency is affected; –This is known as weak scaling (as opposed to strong scaling). Summary: strong scaling, weak scaling and isoefficiency are three approaches to understanding the scalabililty of parallel systems (algorithm + machine). We will look at an example shortly but first we need a way of comparing functions, e.g. performance functions and efficiency functions. These concepts will also be explored further in lab exercise 2.

21 October 2010 21 Comparison of functions – asymptotic analysis Performance models are generally functions of problem size ( N ) and the number of processors ( P ) We need relatively easy way to compare models (functions) as N and P vary: –Model A is ‘at most’ as fast or as big as model B; –Model A is ‘at least’ as fast or as big as model B; –Model A is ‘equal’ in performance/size to model B. We will see a similar need when comparing efficiencies and in considering scalabilty. These are all examples of comparing functions. We are often interested in asymptotic behaviour, i.e. the behaviour as some key parameter (e.g. N or P) increases towards infinity.

22 October 2010 22 Comparing functions - example From ‘Introduction to Parallel Computing’, Grama. Consider three functions: –think of the functions as modelling the distance travelled by three cars from time t=0. One car has fixed speed and the others are accelerating (car C makes a standing start (zero initial speed)):

23 October 2010 23 Graphically

24 October 2010 24 We can see that: –For t > 45, B(t) is always greater than A(t). –For t > 20, C(t) is always greater than B(t). –For t > 0, C(t) is always less than 1.25*B(t).

25 October 2010 25 Introducing ‘big-Oh’ notation It is often useful to express a bound on the growth of a particular function in terms of a simpler function. For example, for t > 45, B(t) is always greater than A(t), we can express the relation between A(t) and B(t) using the Ο (Omicron or ‘big-oh’) notation: Meaning A(t) is “at most” B(t) beyond some value of t. Formally, given functions f(x), g(x), f(x)=O(g(x)) if there exist positive constants c and x 0 such that f(x) ≤ cg(x) for all x ≥ x 0 [Definition from JaJa not Grama! – more transparent].

26 October 2010 26 From this definition, we can see: –A(t)=O(t 2 ) (“at most”), –B(t)=O(t 2 ) (“at most” or “of the order t 2 ”), –Also, A(t)=O(t) (“at most” or “of the order t”), –Finally, C(t)= O(t 2 ) too. Informally, big-Oh can be used to identify the simplest function that bounds (above) a more complex function, as the parameter gets (asymptotically) bigger.

27 October 2010 27 Theta and Omega There are two other useful symbols: –Omega (Ω) meaning “at least”: –Theta ( Θ ) “equals” or “goes as”: For formal definitions, see, for example, ‘An Introduction to Parallel Algorithms’ by JaJa or ‘Highly Parallel Computing’ by Almasi and Gottlieb. Note that the definitions in Grama are a little misleading!

28 October 2010 28 Performance modelling example The following slides develop performance models for the example of a vector (sum) reduction. The models are then used to support basic scalability analysis. Consider two parallel systems –First, a binary tree-based vector sum when the number of elements (N) is equal to the number of processors (P), N=P. –Second, the case when N >>P. Develop performance models; –Compare the models, –Consider scalability.

29 October 2010 29 Vector Sum Reduction Assume that –N = P, and –N is a power of 2. Propogate intermediate values through a binary tree –Takes log 2 N steps (one processor is busy with work and communication on each step, the other processors have some idle time). Each step involves the communication of a single word (cost t s +t w ) and a single addition (cost t c ). Thus:

30 October 2010 30 Vector Sum Reduction Speedup: Speedup is ‘poor’ (but monotonically increasing) –If N=128, S abs is ~18 (E = S/P = ~0.14, i.e. 14%), –If N=1024, S abs is ~100 (E = ~0.1), –If N=1M, S abs is ~ 52,000 (E= ~0.05), –If N=1G, S abs is ~ 35M (E = ~ 0.035).

31 October 2010 31 Vector sum scalability Efficiency: But, N=P in this case, so: Strong scaling not ‘good’, as we have seen (E<<0.5). Efficiency is monotonically decreasing –Reaches 50% point, E = 0.5, when (log 2 P) = 2, i.e. when P=4. This does not isoscale either! –E gets smaller as P (hence N) increases and P and N must change together.

32 October 2010 32 Vector Sum Reduction When N>>P, each processor can be allocated N/P elements. Each processor sums its local elements in a first phase. A binary tree sum of size P is then be performed to sum the partial results. The performance model is:

33 October 2010 33 Scalability – strong scaling? Speedup: Strong scaling?? For a given problem size N (>> P), the (log 2 P/N) term is always ‘small’ so speedup will fall off ‘slowly’. P is, of course, limited by the value of N… but we are considering the case where N >> P.

34 October 2010 34 Scalabilty – Isoscaling Efficiency: Now, we can always achieve a required efficiency on P processors by a suitable choice of N.

35 October 2010 35 Scalabilty – Isoscaling For example, for 50% efficiency, choose Or, for efficiencies > 50%, choose –As N gets larger on a given P, E gets closer to 1! –The ‘good’ parallel phase (N/P work) dominates the log 2 P phase as N gets larger – leading to relatively good (iso)scalability.

36 October 2010 36 Summary of performance modelling Performance modelling provides insight into the behaviour of parallel systems (parallel algorithms on parallel machines). Modelling allows the comparison of algorithms and gives insight into their potential scalability. Two forms of scalability: –Strong scaling (fixed problem size N as P varies) –There is always a limit to strong scaling for real algorithms (e.g. a value of P at which efficiency falls below an acceptable limit). –Isoscaling (the ability to maintain a specified level of efficiency by changing N as P varies). –Not all parallel systems isoscale. Asymptotic analysis makes comparison easier but BEWARE the constants! Weak scaling is related to isoscaling – aim to maintain a fixed problem size per processor as P changes and look at the effect on efficiency.


Download ppt "October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre."

Similar presentations


Ads by Google