COMP60621 Fundamentals of Parallel and Distributed Systems

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Parallel System Performance CS 524 – High-Performance Computing.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
October COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Parallel and Distributed Programming: A Brief Introduction Kenjiro Taura.
These slides are based on the book:
Auburn University
Potential for parallel computers/parallel programming
OPERATING SYSTEMS CS 3502 Fall 2017
Overview Parallel Processing Pipelining
Distributed Shared Memory
Operating Systems (CS 340 D)
Parallel Programming By J. H. Wang May 2, 2017.
What Exactly is Parallel Processing?
Lecture 3 of Computer Science II
Multi-Processing in High Performance Computer Architecture:
Algorithm Analysis CSE 2011 Winter September 2018.
Chapter 12: Query Processing
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Operating Systems (CS 340 D)
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
CMSC 611: Advanced Computer Architecture
COMP60611 Fundamentals of Parallel and Distributed Systems
Data Structures and Algorithms in Parallel Computing
Objective of This Course
COMP60611 Fundamentals of Parallel and Distributed Systems
COMP60621 Designing for Parallelism
CSE8380 Parallel and Distributed Processing Presentation
AN INTRODUCTION ON PARALLEL PROCESSING
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
COMP60611 Fundamentals of Parallel and Distributed Systems
COMP60621 Designing for Parallelism
COMP60611 Fundamentals of Parallel and Distributed Systems
Approximating the Buffer Allocation Problem Using Epochs
COMP60621 Designing for Parallelism
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Potential for parallel computers/parallel programming
Parallel Programming in C with MPI and OpenMP
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 13: I/O Systems.
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

COMP60621 Fundamentals of Parallel and Distributed Systems Lecture 5 Introduction to Performance Modelling John Gurd, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Overview (Lectures 6 & 7) Aims of performance modelling Allows the comparison of algorithms. Gives an indication of scalability of an algorithm on a machine (a parallel system) as both the problem size and the number of processors change – “complexity analysis of parallel algorithms”. Enables reasoned choices at the design stage. Overview of an approach to performance modelling Based on the approach of Foster and Grama et al. Targets a generic multicomputer – (model of message-passing). Limitations A worked example Vector sum reduction (compute the sum of the elements of a vector). Summary 5 February, 2019

Aims of Performance Modelling In this and the next lecture we will look at modelling the performance of algorithms that compute a result Issues of correctness are relatively straightforward We are interested in questions such as: How long will an algorithm take to execute? How much memory is required? (though we will not consider this in detail) Does the algorithm scale as we vary the number of processors and/or the problem size? What does scaling mean? How do the performances of different algorithms compare? Typically, focus on one phase of a computation at a time e.g. assume start-up and initialisation has been done, or that these phases have been modelled separately 5 February, 2019

An Approach to Performance Modelling Based on a generic multiprocessor (see next slide). Defined in terms of tasks that undertake computation and communicate with other tasks as necessary. A task may be an agglomeration of smaller (sub)tasks. Assumes a simple, but realistic, approach to communication between tasks: Based on channels that connect pairs of tasks. Seeks an analytical expression for execution time (T) as a function of (at least) the problem size (N), number of processors (P) (and, often, the number of tasks (U)): 5 February, 2019

A Generic Multicomputer Interconnect CPU Memory CPU Memory CPU Memory CPU Memory … 5 February, 2019

Task-channel Model Tasks execute concurrently. The number of tasks can vary during execution. A task encapsulates a sequential program and local memory. Tasks are connected by channels to other tasks. From the point of view of a task, a channel is either an input or an output channel. In addition to reading from, and writing to, its local memory, a task can: Send messages on output channels. Receive messages on input channels. Create new tasks. Terminate. 5 February, 2019

Task-channel Model A channel connecting two tasks acts as a message queue. A send operation is asynchronous: it completes immediately. Sends are considered to be ‘free’ (take zero time)(?!). A receive operation is synchronous: execution of a task is blocked until a message is available. Receives may cause waiting (idle time) and take a finite time to complete (as data is transmitted from one task to another). Channels can be created dynamically (also taking zero time!). Tasks can be mapped to physical processors in various ways. The mapping does not affect the semantics of the program, but it may affect performance. 5 February, 2019

Specifics of Performance Modelling Assume a processor is either computing, communicating or idling. Thus, the total execution time can be found as either the sum of the time spent in each activity for any particular processor (j): or as the sum of each activity over all processors divided by the number of processors (P): Such aggregate totals are often easier to calculate. 5 February, 2019

Definitions 5 February, 2019

Cost of Messages A simple model of the cost (in time) of a message is: where: Tmsg is the time to receive a message, ts is the start up cost (in time) of receiving a message, tw is the cost (in time) per word (s/word), 1/ tw is the bandwidth (words/s), L is the number of words in the message. 5 February, 2019

Cost of Communication Thus, is the sum of all message times: 5 February, 2019

Limitations of the Model The (basic) model presented in this lecture ignores the hierarchical nature of the memory of real computer systems: Cache behaviour, The impact of network architecture, Issues of competition for bandwidth (contention). The basic model can be extended to cope with any or all of these complicating factors. Experience with real performance analysis on real systems helps the designer to choose when and what extra modelling might be helpful. 5 February, 2019

Relative Performance Metrics: Speed-up and Efficiency Define relative speed-up as the ratio of the execution time of the parallel algorithm on one processor to the corresponding time on P processors: Define relative efficiency as: The latter is a fractional measure of the time that processors spend doing useful work (i.e., the time it takes to do all the necessary useful work divided by the total time on all P processors). It characterises the efficiency of an algorithm on a system, for any given problem size and any number of processors. 5 February, 2019

Absolute Performance Metrics Relative speed-up can be misleading! (Why?) Define absolute speed-up (or absolute efficiency) with reference to the execution time, Tref , of an implementation of the best known sequential algorithm for the problem-at-hand: Note: the best known sequential algorithm may solve the problem in a fashion that is significantly different to that of the parallel algorithm. 5 February, 2019

Overhead Another way of viewing this is to look at the difference between an ideal parallel execution time and that actually observed (usually longer). The ideal is simply the time for the best known sequential algorithm divided by the number of processors. (Why?) The difference between the actually observed and the ideal is termed the execution time overhead, OP, which is the average overhead time per processor. 5 February, 2019

Summary We have introduced two views of (approaches to) performance modelling: the task-channel model, and performance metrics (relative and absolute). Usually the task-channel model, which reflects the underlying hardware activity, is used to develop formulae for the gross execution times found in the performance metrics. Relative performance metrics can be misleading, so we prefer to use absolute performance metrics. 5 February, 2019