2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Slides:



Advertisements
Similar presentations
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 3: Parallel Algorithm Design
1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
4.1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M.
1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
1 Complexity of Network Synchronization Raeda Naamnieh.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
COMPE575 Parallel & Cluster Computing 5.1 Pipelined Computations Chapter 5.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Parallel System Performance CS 524 – High-Performance Computing.
Parallel Programming in C with MPI and OpenMP
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
COMPUTING AGGREGATES FOR MONITORING WIRELESS SENSOR NETWORKS Jerry Zhao, Ramesh Govindan, Deborah Estrin Presented by Hiren Shah.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Advanced Computer Networks Lecture 1 - Parallelization 1.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
HYPERCUBE ALGORITHMS-1
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
CSCI-455/552 Introduction to High Performance Computing Lecture 15.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
PERFORMANCE EVALUATIONS
Introduction to complexity
Algorithm Efficiency Chapter 10.
Pipelined Computations
Introduction to High Performance Computing Lecture 12
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Matrix Addition and Multiplication
Parallel Graph Algorithms
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.2 Sequential execution time, t s : Estimate by counting computational steps of best sequential algorithm. Parallel execution time, t p : In addition to number of computational steps, t comp, need to estimate communication overhead, t comm : t p = t comp + t comm

2a.3 Computational Time Count number of computational steps. When more than one process executed simultaneously, count computational steps of most complex process. Generally, function of n and p, i.e. t comp = f (n, p) Often break down computation time into parts. Then t comp = t comp1 + t comp2 + t comp3 + … Analysis usually done assuming that all processors are same and operating at same speed.

2a.4 Communication Time Many factors, including network structure. As a first approximation, use t comm = t startup + nt data t startup -- startup time, essentially time to send a message with no data. Assumed to be constant. t data -- transmission time to send one data word, also assumed constant, and there are n data words.

2a.5 Idealized Communication Time Number of data items (n) Startup time The equation to compute the communication time ignore the fact that the source and destination may not be directly linked in a real system so that the message may pass through intermediate nodes. It is also assumed that the overhead incurred by including information other than data in the packet is constant and can be part of startup time.

2a.6 Final communication time, t comm Summation of communication times of all sequential messages from one process, i.e. t comm = t comm1 + t comm2 + t comm3 + … Communication patterns of all processes assumed same and take place together so that only one process need be considered. Both t startup and t data, measured in units of one computational step, so that can add t comp and t comm together to obtain parallel execution time, t p.

Communication Time of Broadcast/Gather If broadcast is done through single shared wire for Ethernet, the time complexity is O(1) for single data item and O(w) if w data items. If binary tree is used as the underlying network structure and 1-to-N fan-out broadcast is used, then what about communication cost for p final destinations (leaf nodes) using w messages? –We assume the left and right child will receive the message from their parent in a sequential way. However, at each level, different parent nodes will send out the message at the same time. 2a.7

1-to-N fan-out Broadcast t comm = 2 (log p) (t startup + wt data ) It depends on number of levels and number of nodes at each level. For a binary tree and p final destinations at the leave level. 2a.8

2a.9 Benchmark Factors With t s, t comp, and t comm, can establish speedup factor and computation/communication ratio for a particular algorithm/implementation: Both functions of number of processors, p, and number of data elements, n.

2a.10 Factors give indication of scalability of parallel solution with increasing number of processors and problem size. Computation/communication ratio will highlight effect of communication with increasing problem size and system size. We wish to have dominant factor in computation instead of communication, as n increases, communication can be ignored and adding more processors can improve the performance.

2a.11 Example Adding n numbers using two computers, each adding n/2 numbers each. Numbers initially held in one computer. Computer 1 Computer 2 Send n/2 numbers Send result back Add up n/2 numbers Add partial sums t comm = t startup +(n/2)t data t comm = t startup + t data t comp = n/2 t comp = 1

2a.12 Overall t comm = 2t startup +(n/2 + 1)t data = O(n) t comp = n/2 + 1 = O(n) Computation/Communication ratio = O(1)

2a.13 Another problem Computation time complexity = O(n 2 ) Communication time complexity = O(n) Computation/Communication ratio = O(n)

2a.14 Cost Cost = (execution time ) x (number of processors) Cost of sequential computation = t s Cost of parallel computation = t p x p Cost-optimal algorithm When parallel computation cost is proportional to sequential computation: Cost = t p x p = k x t s k is a constant

2a.15 Example Suppose t s = O(n log n) for the best sequential program where n = number of data item p = number of processors For cost optimality if t p = O(n log n) / p = O(n/p log n) Not cost optimal if t p = O(n^2/p ) A parallel algorithm is cost-optimal if parallel time complexity times the number of processors equals the sequential time complexity.

Evaluating programs Measuring the execution time Time-complexity analysis gives an insight into the parallel algorithm and is useful in comparing different algorithms. We want to know how the algorithm actually performs in a real system. We can measure the elapsed time between two points in the code in seconds. –System calls, such as clock(), time(), or gettimeofday() or MPI_Wtime() –Example: L1: time(&t1);. L2: time(&t2); elapsed_time = difftime(t2, t1); 2a.16

Communication Time by the Ping-Pong Method Point-to-point communication time of a specific system can be found using the ping-pong method. One process p0 sends a message to another process, say p1. Immediately upon receiving the message, p1 sends the message back to p0. The time is divided by two to obtain an estimate of the time of one-way communication. For example, at p0: time(&t1); send(&x, p1); recv(&x, p1); time(&t2); elapsed_time = 0.5* difftime(t2, t1); 2a.17

Profilling A profile of a program is a histogram or graph showing the time spent on different part of the program. Showing the number of times certain source code are executed. It can help to identify certain hot spot places in a program visited many times during the execution. These places could be optimized first. 2a.18

Program Profile Histogram 2a.19 Statement number of region of program