CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Distributed Systems CS
Lecture 10: Performance Metrics Shantanu Dutt ECE Dept. UIC.
Potential for parallel computers/parallel programming
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
CS 584. Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'',
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Parallel System Performance CS 524 – High-Performance Computing.
Virtues of Good (Parallel) Software
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CS526 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in providing multiplicity.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.
Analysis of Algorithms
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Parallel Programming with MPI and OpenMP
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Concurrency and Performance Based on slides by Henri Casanova.
Dynamic Load Balancing Tree and Structured Computations.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
DCS/1 CENG Distributed Computing Systems Measures of Performance.
Potential for parallel computers/parallel programming
SCALABILITY ANALYSIS.
Parallel Tasks Decomposition
Parallel Programming By J. H. Wang May 2, 2017.
EE 193: Parallel Computing
Chapter 3: Principles of Scalable Performance
Algorithm Efficiency Chapter 10.
Analytical Modeling Of Parallel Programs
Part 2: Parallel Models (II)
Distributed Systems CS
Parallelismo.
CS 584.
COMP60621 Fundamentals of Parallel and Distributed Systems
Analytical Modeling of Parallel Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallel Computing and Parallel Computers
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Analytical Modeling of Parallel Systems
Virtual Memory: Working Sets
Potential for parallel computers/parallel programming
Parallel Programming in C with MPI and OpenMP
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

CS 420 Design of Algorithms Analytical Models of Parallel Algorithms

Analytical Models of Parallel Algoritms Remember: minimize parallel overhead

Sources of Overhead Interprocess interactions Almost any nontrivial (non-embarrassingly) parallel algorithm will require interprocess interaction. This is overhead with respect to the serial algorithm to achieve the same solution Remember: decomposition and mapping

Sources of Overhead Idling Idling processes in an algorithm = net loss in aggregate computational performance. i.e. not squeezing as much performance out of the paraallel algorithms as (maybe) possible = overhead

Sources of Overhead Excess Computation The best existing serial algorithm may not be readily or efficiently parallelizable -perhaps, can’t just evenly divide serial algorithm into p parallel pieces. Each parallel task may require addition computation (relative to the corresponding work in the serial algorithm) – recall: redundant computation = Excess computation = overhead

Performance Metrics Execution Time Serial Runtime = total lapsed time (wall time) from the beginning to the end of execution for the serial program on a single PE. Parallel Runtime = total lapsed time (wall time) from the beginning of the parallel computation to the end of the parallel computation. Ts = Serial Runtime Tp = Parallel Runtime

Performance Metrics Execution time— As a baseline….from a theoretical perspective Ts is often based on the best available serial algorithm to solution a given problem … not necessarily based on the serial version of the parallel algorithm. From a practical perspective… Sometimes the serial and parallel algorithms are based on the same algorithm Sometimes you want to know the parallel algorithm compares to it serial couterpart.

Performance Metrics Total Parallel Overhead Need to represent the Total Parallel Overhead as an overhead function Will be a function of things like work size (w) and number of PEs (p) Total Parallel Overhead = total parallel runtime (Tp) * the number of PEs (p) minus the serial runtime (Ts) for the best available serial algorithm for the same problem To = pTp - Ts

Performance Metrics Speedup Usually we parallelize an algorithm to speed things up.. … therefore, the obvious question is “how much did it speed things up?” Speed up = runtime of the serial algorithm (Ts) to the runtime of the parallel algorithm (Tp), or… S = Ts/Tp, or S=  (Ts/Tp) For a given number of PEs (p) and given size problem

Performance Metrics Speedup – for example.. Adding up n numbers with n PEs Serial algorithm requires n steps – Communicate a number, add, communicate summ, add,… Parallel algorithm – even PE communicates its number to lower even neighbor, neighbor adds the numbers and passes sum … … binary tree

Performance Metrics Example adding n numbers with n PEs Ts = n Tp = log n So.. S = n/log n, or S =  (n/log n) If n = 16, then Ts = 16, and Tp = log 16 = 4 S = 16/4 = 4

Performance Metrics Speedup In theory S can not be greater than the number of PEs (p) But this does occur… When it does in is called Superlinear speedup

Performance Metrics Superlinear Speedup Why does this happen? Poor serial algorithm design Maybe parallelization removed bottlenecks in the serial program IO contention, for example

Performance Metrics Superlinear Speedup Cache Effects Distributing a problem in smaller pieces may improve the cache hit rate and, therefore, improve the overall performance of the algorithm, more so than in proportion to the number PEs. For example,….

Performance Metrics Superlinear Speedup – Cache effects From A. Grama, et.al 2003 Suppose your serial algorithm has a cache hit rate of 80%, and you have Cache latency of 2ns Memory latency of 100ns Then, effective memory access time is 2 * * 0.2 = 21.6ns If algorithm is memory bound, one FLOP per memory access then algorithm runs at 43.6 MFLOPS

Performance Metrics Superlinear Speedup – Cache effects Now suppose you parallelize this problem on two PEs, so Wp = W/2 Now you have remote data access to deal with, assume each remote memory access requires 400ns (much slower than direct memory and cache) …continued…

Performance Metrics Superlinear Speedup – Cache effects This algorithm only requires remote memory access 20% of the time Since Wp is smaller cache hit rate goes to 90%... … and local memory access is 8% Average memory access time = 2 * * * 0.02 = 17.8ns Each PE processing rate Total execution rate (2 PEs) = MFLOPS So… S = /46.3 = 2.43 (superlinear speedup)

Performance Metrics Superlinear Speedup from Exploratory Decomposition. Recall that Exploratory decomposition is useful for findings solutions where the problem space is defined as a tree of alternatives… and the solution is find the correct node in the tree.

Performance Metrics Superlinear Speedup from Exploratory Decomposition Blue node = solution Use Depth-first search algorithm Assume time to visit a node and test for solutioin = x Serial Algorithm Ts = 12x Parallel Algorithm – p=2 Parallel Algorithm Tp = 3x S = 12/3 = 4 If

Performance Metrics Efficiency – a measure of how fully the algorithm utilizes processing resources Ideally Speedup (S) is equal to p Not typical because of overhead Ideally S = p, and therefore, Efficiency (E) =1 Usually S < p, and 0<E<1 E = S/p Remember: adding n numbers on n PEs E = (n/log n)/n, or 1/(log n)

Performance Metrics Scalability – does the algorithm scale Scalability – how well does the algorithm scale as the number of PEs scales, or… How well does the algorithm scale as the size of the problem scales? What does S do as you increase p? What does S do as you increase w?

Perfomance Metrics Scalability – another way to look at it Scalability – can you maintain a constant E as you vary p or w. Is E =f(w,p)

The End