CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Lecture 3: Parallel Algorithm Design
Lecture 10: Performance Metrics Shantanu Dutt ECE Dept. UIC.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Planning under Uncertainty
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
1 Friday, October 06, 2006 Measure twice, cut once. -Carpenter’s Motto.
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Sorting Algorithms CS 524 – High-Performance Computing.
1 Friday, November 17, 2006 “In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance.” -H.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
CS 584. Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.
Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'',
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Parallel System Performance CS 524 – High-Performance Computing.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
CS526 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in providing multiplicity.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Recursion, Complexity, and Searching and Sorting By Andrew Zeng.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
MS 101: Algorithms Instructor Neelima Gupta
Parallel Programming with MPI and OpenMP
Advanced Computer Networks Lecture 1 - Parallelization 1.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
Concurrency and Performance Based on slides by Henri Casanova.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
1 A simple parallel algorithm Adding n numbers in parallel.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Auburn University
Performance Evaluation Frédéric Desprez INRIA
Parallel Tasks Decomposition
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Odd-Even Sort Algorithm Dr. Xiao.
Chapter 3: Principles of Scalable Performance
CSE8380 Parallel and Distributed Processing Presentation
Parallelismo.
Analytical Modeling of Parallel Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Analytical Modeling of Parallel Systems
Presentation transcript:

CS453 Lecture 3

 A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime of a sequential program is identical on any serial platform.  The parallel runtime of a program depends on the input size, the number of processors, and the communication parameters of the machine.  An algorithm must therefore be analyzed in the context of the underlying platform.  A parallel system is a combination of a parallel algorithm and an underlying platform. Ananth Grama, Purdue University, W. Lafayette

 A number of performance measures are intuitive.  Wall clock time - the time from the start of the first processor to the stopping time of the last processor in a parallel assembly. But how does this scale when the number of processors is changed of the program is ported to another machine altogether?  How much faster is the parallel version? This requests the obvious follow-up question – what is the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program look faster? Ananth Grama, Purdue University, W. Lafayette

 If I use two processors, shouldn't my program run twice as fast?  No - a number of overheads, including wasted computation, communication, idling, and contention cause degradation in performance. The execution profile of a hypothetical parallel program executing on eight processing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling. Ananth Grama, Purdue University, W. Lafayette

 Inter-process interactions: Processors working on any non-trivial parallel problem will need to talk to each other.  Idling: Processes may idle because of load imbalance, synchronization, or serial components.  Excess Computation: This is computation not performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication. Ananth Grama, Purdue University, W. Lafayette

 Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer.  The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution.  We denote the serial runtime by T s and the parallel runtime by T P. Ananth Grama, Purdue University, W. Lafayette

 Let T all be the total time collectively spent by all the processing elements.  T S is the serial time.  Observe that T all - T S is then the total time spend by all processors combined in non- useful work. This is called the total overhead.  The total time collectively spent by all the processing elements T all = p T P (p is the number of processors).  The overhead function (T o ) is therefore given by T o = p T P - T S Ananth Grama, Purdue University, W. Lafayette

 What is the benefit from parallelism?  Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements. Ananth Grama, Purdue University, W. Lafayette

 Consider the problem of adding n numbers by using n processing elements.  If n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors. Ananth Grama, Purdue University, W. Lafayette

Computing the global sum of 16 partial sums using 16 processing elements. Σ j i denotes the sum of numbers with consecutive labels from i to j. Ananth Grama, Purdue University, W. Lafayette

T P = Θ (log n)  We know that T S = Θ (n)  Speedup S is given by S = Θ (n / log n) Ananth Grama, Purdue University, W. Lafayette

 For a given problem, there might be many serial algorithms available. These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees.  For the purpose of computing speedup, we always consider the best sequential program as the baseline. Ananth Grama, Purdue University, W. Lafayette

 Consider the problem of parallel bubble sort.  The serial time for bubblesort is 150 seconds.  The parallel time for odd-even sort (efficient parallelization of bubble sort) is 40 seconds.  The speedup would appear to be 150/40 =  But is this really a fair assessment of the system?  What if serial quicksort only took 30 seconds? In this case, the speedup is 30/40 = This is a more realistic assessment of the system. Ananth Grama, Purdue University, W. Lafayette

 Speedup can be as low as 0 (the parallel program never terminates).  Speedup, in theory, should be upper bounded by p - after all, we can only expect a p-fold speedup if we use times as many resources.  A speedup greater than p is possible only if each processing element spends less than time T S / p solving the problem.  In this case, a single processor could be time slided to achieve a faster serial program, which contradicts our assumption of fastest serial program as basis for speedup. Ananth Grama, Purdue University, W. Lafayette

 Efficiency is a measure of the fraction of time for which a processing element is usefully employed  Mathematically, it is given by =  Following the bounds on speedup, efficiency can be as low as 0 and as high as 1. Ananth Grama, Purdue University, W. Lafayette

 The speedup of adding numbers on processors is given by  Efficiency is given by = Ananth Grama, Purdue University, W. Lafayette

 Cost is the product of parallel runtime and the number of processing elements used (p x T P ).  Cost reflects the sum of the time that each processing element spends solving the problem.  A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer is asymptotically identical to serial cost.  Since E = T S / p T P, for cost optimal systems, E = O(1).  Cost is sometimes referred to as work or processor- time product. Ananth Grama, Purdue University, W. Lafayette

Consider the problem of adding numbers on processors.  We have, T P = log n (for p = n).  The cost of this system is given by p T P = n log n.  Since the serial runtime of this operation is Θ(n), the algorithm is not cost optimal. Ananth Grama, Purdue University, W. Lafayette

Consider a sorting algorithm that uses n processing elements to sort the list in time (log n) 2.  Since the serial runtime of a (comparison-based) sort is n log n, the speedup and efficiency of this algorithm are given by n / log n and 1 / log n, respectively.  The p T P product of this algorithm is n (log n) 2.  This algorithm is not cost optimal but only by a factor of log n.  If p < n, assigning n tasks to p processors gives T P = n (log n) 2 / p.  The corresponding speedup of this formulation is p / log n.  This speedup goes down as the problem size n is increased for a given p ! Ananth Grama, Purdue University, W. Lafayette

 Often, using fewer processors improves performance of parallel systems.  Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system.  A naive way of scaling down is to think of each processor in the original case as a virtual processor and to assign virtual processors equally to scaled down processors.  Since the number of processing elements decreases by a factor of n / p, the computation at each processing element increases by a factor of n / p.  The communication cost should not increase by this factor since some of the virtual processors assigned to a physical processors might talk to each other. This is the basic reason for the improvement from building granularity. Ananth Grama, Purdue University, W. Lafayette

 Consider the problem of adding n numbers on p processing elements such that p < n and both n and p are powers of 2.  Use the parallel algorithm for n processors, except, in this case, we think of them as virtual processors.  Each of the p processors is now assigned n / p virtual processors.  The first log p of the log n steps of the original algorithm are simulated in (n / p) log p steps on p processing elements.  Subsequent log n - log p steps do not require any communication. Ananth Grama, Purdue University, W. Lafayette

 The overall parallel execution time of this parallel system is Θ ( (n / p) log p).  The cost is Θ (n log p), which is asymptotically higher than the Θ (n) cost of adding n numbers sequentially. Therefore, the parallel system is not cost-optimal. Ananth Grama, Purdue University, W. Lafayette