October 2010 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Program Efficiency & Complexity Analysis
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Fundamentals of Python: From First Programs Through Data Structures
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
1 Complexity of Network Synchronization Raeda Naamnieh.
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
1 Undecidability Andreas Klappenecker [based on slides by Prof. Welch]
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.
Chapter 1 Software Development. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 1-2 Chapter Objectives Discuss the goals of software development.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Parallel System Performance CS 524 – High-Performance Computing.
Elementary Data Structures and Algorithms
CS 240A: Complexity Measures for Parallel Computation.
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Parallel Programming in C with MPI and OpenMP
1 Complexity Lecture Ref. Handout p
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
CSC 201 Analysis and Design of Algorithms Lecture 03: Introduction to a CSC 201 Analysis and Design of Algorithms Lecture 03: Introduction to a lgorithms.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.
Analysis of Algorithms
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
Complexity of Algorithms
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Parallel Programming with MPI and OpenMP
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Data Structures Using C++ 2E
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
C++ How to Program, 7/e © by Pearson Education, Inc. All Rights Reserved.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Concurrency and Performance Based on slides by Henri Casanova.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Computer Systems Architecture Edited by Original lecture by Ian Sunley Areas: Computer users Basic topics What is a computer?
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming in C with MPI and OpenMP
Objective of This Course
COMP60611 Fundamentals of Parallel and Distributed Systems
COMP60621 Designing for Parallelism
COMP60621 Fundamentals of Parallel and Distributed Systems
Asst. Dr.Surasak Mungsing
COMP60611 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

October COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

October Overview Aims of performance modelling –Allows the comparison of algorithms. Gives an indication of scalability of an algorithm on a machine (a parallel system) as both the problem size and the number of processors change – “complexity analysis of parallel algorithms”. –Enables reasoned choices at the design stage. Overview of an approach to performance modelling. –Based on the approach of Foster and Grama et al. –Targets a generic multicomputer – (model of message-passing). Limitations. A worked example –Vector sum reduction (i.e. compute the sum of the elements of a vector). Summary.

October Aims of performance modelling In this lecture we will look at modelling the performance of algorithms that compute a result; –Issues of correctness are relatively straightforward. We are interested in questions such as: –How long will an algorithm take to execute? –How much memory is required (though we will not consider this in detail here)? –Does the algorithm scale as we vary the number of processors and/or the problem size? What does scaling mean? –How do the performances of different algorithms compare? Typically, focus on one phase of a computation at a time; –e.g. assume start-up and initialisation has been done, or that these phases have been modelled separately.

October An approach to performance modelling Based on a generic multiprocessor (see next slide). Defined in terms of Tasks that undertake computation and communicate with other tasks as necessary; –A Task may be an aggolmeration of smaller tasks. Assumes a simple, but realistic, approach to communication between tasks: –Based on channels that connect pairs of tasks. Seeks an analytical expression for execution time ( T ) as a function of (at least) the problem size ( N ), number of processors ( P ) (and, often, the number of tasks ( U )),

October A generic multicomputer CPU Memory CPU Memory CPU Memory CPU Memory … Interconnect

October Task-channel model Tasks execute concurrently; –The number of tasks can vary during execution. A task encapsulates a sequential program and local memory. Tasks are connected by channels to other tasks; –Channels are input or output channels. In addition to reading from, and writing to, local memory a task can: –Send messages on output channels. –Receive messages on input channels. –Create new tasks. –Terminate.

October Task-channel model A channel connecting two tasks acts as a message queue. A send operation is asynchronous: it completes immediately; –Sends are considered to be ‘free’ (take zero time)(?!). A receive operation is synchronous: execution of a task is blocked until a message is available; –Receives may cause waiting (idling) time and take a finite time to complete (as data is transmitted from one task to another). Channels can be created dynamically. Tasks can be mapped to physical processors in various ways; –the mapping does not affect the semantics of the program, but it may well affect performance.

October Specifics of performance modelling Assume a processor is either computing, communicating or idling. Thus, the total execution time can be found as the sum of the time spent in each activity for any particular processor ( j ): Or as the sum of each activity over all processors divided by the number of processors ( P ): –These aggregate totals are often easier to calculate.

October Definitions

October Cost of messages A simple model of the cost of a message is: where: –T msg is the time to receive a message, –t s is the start up cost of receiving a message, –t w is the cost per word (s/word), 1/ t w is the bandwidth (words/s), –L is the number of words in the message.

October Cost of messages Thus, is the sum of all message times:

October Limitations of the Model The (basic) model presented in this lecture ignores the hierarchical nature of the memory of real computer systems: –Cache behaviour, –The impact of network architecture, –Issues of competition for bandwidth. The basic model can be extended to cope with any/all of these complicating factors. Experience with real performance analysis on real systems helps the designer to choose when and what extra modelling might be helpful.

October Performance metrics: Speed-up and Efficiency. Define relative speed-up as the ratio of the execution time of the parallelised algorithm on one processor to the corresponding time on P processors: Define relative efficiency as: This is a measure of the time that processors spend doing useful work (i.e., the time spent doing useful work divided by total time on all P processors). It characterises the effectiveness of an algorithm on a system, for any problem size and any number of processors

October Absolute performance metrics Relative speed-up can be misleading! (Why?) Define absolute speed-up (efficiency) with reference to the sequential time, T ref, of an implementation of the best known algorithm for the problem-at-hand: Note: the best known algorithm may take an approach to solving the problem different to that of the parallel algorithm.

October Scalability and Isoefficiency What is meant by scalability? –Scalability applies to an algorithm executing on a parallel machine, not simply to an algorithm! How does an algorithm behave for a fixed problem size as the number of processors used increases? –Known as strong scaling. How does an algorithm behave as the problem size changes in addition to changing the number of processors? A key insight is to look at how efficiency changes.

October Efficiency and Strong scaling Typically, for a fixed problem size N the efficiency of an algorithm decreases as P increases (compare with ‘brush’ diagrams). Why? –Overheads typically do not get smaller as P increases. They remain ‘fixed’ (e.g. Amdahl fraction), or, worse, they may grow with P (e.g. the number of communications may grow – in an all-to-all comms pattern) Recall that:

October Efficiency and Strong scaling PO P is the total overhead in the system. T ref represents the useful work in the algorithm. At some point, with fixed N, efficiency E abs (i.e. how well each processor is being utilised) will drop below an acceptable threshold – say, 50%(?)

October Scalability No ‘real’ algorithm scales ‘forever’ on a fixed problem size on a ‘real’ computer. Even ‘embarrassingly’ parallel algorithms will have a limit on the number of processors they can use; –for example, at the point where, with a fixed N, eventually there is only one ‘element’ to be operated on by each processor. So we seek another approach to scalability which applies as both problem size N and the number of processors P change.

October Definition of Scalability – Isoefficiency An algorithm can be said to (iso)scale if, for a given parallel system, a specific level of efficiency can be maintained by changing the problem size, N, appropriately as P increases. Not all algorithms isoscale! –e.g. a vector reduction where N = P (see later). This approach is called scaled problem analysis. The function (of P ) describing how the problem size N must change as P increases to maintain a specified efficiency is known as the isoefficiency function. Isoscaling does not apply to all problems; –e.g. weather modelling, where increasing problem size (resolution) is not always an option, –or image processing with a fixed number of pixels.

October Weak scaling An alternative approach is to keep the problem size per processor fixed as P increases (total problem size N increases linearly with P) and see how the efficiency is affected; –This is known as weak scaling (as opposed to strong scaling). Summary: strong scaling, weak scaling and isoefficiency are three approaches to understanding the scalabililty of parallel systems (algorithm + machine). We will look at an example shortly but first we need a way of comparing functions, e.g. performance functions and efficiency functions. These concepts will also be explored further in lab exercise 2.

October Comparison of functions – asymptotic analysis Performance models are generally functions of problem size ( N ) and the number of processors ( P ) We need relatively easy way to compare models (functions) as N and P vary: –Model A is ‘at most’ as fast or as big as model B; –Model A is ‘at least’ as fast or as big as model B; –Model A is ‘equal’ in performance/size to model B. We will see a similar need when comparing efficiencies and in considering scalabilty. These are all examples of comparing functions. We are often interested in asymptotic behaviour, i.e. the behaviour as some key parameter (e.g. N or P) increases towards infinity.

October Comparing functions - example From ‘Introduction to Parallel Computing’, Grama. Consider three functions: –think of the functions as modelling the distance travelled by three cars from time t=0. One car has fixed speed and the others are accelerating (car C makes a standing start (zero initial speed)):

October Graphically

October We can see that: –For t > 45, B(t) is always greater than A(t). –For t > 20, C(t) is always greater than B(t). –For t > 0, C(t) is always less than 1.25*B(t).

October Introducing ‘big-Oh’ notation It is often useful to express a bound on the growth of a particular function in terms of a simpler function. For example, for t > 45, B(t) is always greater than A(t), we can express the relation between A(t) and B(t) using the Ο (Omicron or ‘big-oh’) notation: Meaning A(t) is “at most” B(t) beyond some value of t. Formally, given functions f(x), g(x), f(x)=O(g(x)) if there exist positive constants c and x 0 such that f(x) ≤ cg(x) for all x ≥ x 0 [Definition from JaJa not Grama! – more transparent].

October From this definition, we can see: –A(t)=O(t 2 ) (“at most”), –B(t)=O(t 2 ) (“at most” or “of the order t 2 ”), –Also, A(t)=O(t) (“at most” or “of the order t”), –Finally, C(t)= O(t 2 ) too. Informally, big-Oh can be used to identify the simplest function that bounds (above) a more complex function, as the parameter gets (asymptotically) bigger.

October Theta and Omega There are two other useful symbols: –Omega (Ω) meaning “at least”: –Theta ( Θ ) “equals” or “goes as”: For formal definitions, see, for example, ‘An Introduction to Parallel Algorithms’ by JaJa or ‘Highly Parallel Computing’ by Almasi and Gottlieb. Note that the definitions in Grama are a little misleading!

October Performance modelling example The following slides develop performance models for the example of a vector (sum) reduction. The models are then used to support basic scalability analysis. Consider two parallel systems –First, a binary tree-based vector sum when the number of elements (N) is equal to the number of processors (P), N=P. –Second, the case when N >>P. Develop performance models; –Compare the models, –Consider scalability.

October Vector Sum Reduction Assume that –N = P, and –N is a power of 2. Propogate intermediate values through a binary tree –Takes log 2 N steps (one processor is busy with work and communication on each step, the other processors have some idle time). Each step involves the communication of a single word (cost t s +t w ) and a single addition (cost t c ). Thus:

October Vector Sum Reduction Speedup: Speedup is ‘poor’ (but monotonically increasing) –If N=128, S abs is ~18 (E = S/P = ~0.14, i.e. 14%), –If N=1024, S abs is ~100 (E = ~0.1), –If N=1M, S abs is ~ 52,000 (E= ~0.05), –If N=1G, S abs is ~ 35M (E = ~ 0.035).

October Vector sum scalability Efficiency: But, N=P in this case, so: Strong scaling not ‘good’, as we have seen (E<<0.5). Efficiency is monotonically decreasing –Reaches 50% point, E = 0.5, when (log 2 P) = 2, i.e. when P=4. This does not isoscale either! –E gets smaller as P (hence N) increases and P and N must change together.

October Vector Sum Reduction When N>>P, each processor can be allocated N/P elements. Each processor sums its local elements in a first phase. A binary tree sum of size P is then be performed to sum the partial results. The performance model is:

October Scalability – strong scaling? Speedup: Strong scaling?? For a given problem size N (>> P), the (log 2 P/N) term is always ‘small’ so speedup will fall off ‘slowly’. P is, of course, limited by the value of N… but we are considering the case where N >> P.

October Scalabilty – Isoscaling Efficiency: Now, we can always achieve a required efficiency on P processors by a suitable choice of N.

October Scalabilty – Isoscaling For example, for 50% efficiency, choose Or, for efficiencies > 50%, choose –As N gets larger on a given P, E gets closer to 1! –The ‘good’ parallel phase (N/P work) dominates the log 2 P phase as N gets larger – leading to relatively good (iso)scalability.

October Summary of performance modelling Performance modelling provides insight into the behaviour of parallel systems (parallel algorithms on parallel machines). Modelling allows the comparison of algorithms and gives insight into their potential scalability. Two forms of scalability: –Strong scaling (fixed problem size N as P varies) –There is always a limit to strong scaling for real algorithms (e.g. a value of P at which efficiency falls below an acceptable limit). –Isoscaling (the ability to maintain a specified level of efficiency by changing N as P varies). –Not all parallel systems isoscale. Asymptotic analysis makes comparison easier but BEWARE the constants! Weak scaling is related to isoscaling – aim to maintain a fixed problem size per processor as P changes and look at the effect on efficiency.