Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Slides:

Advertisements

Similar presentations

Program Efficiency & Complexity Analysis

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Potential for parallel computers/parallel programming

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Slide 1 Parallel Computation Models Lecture 3 Lecture 4.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.

1 Data Structures A program solves a problem. A program solves a problem. A solution consists of: A solution consists of:  a way to organize the data.

CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 2 Elements of complexity analysis Performance and efficiency Motivation: analysis.

Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.

DCS/2003/1 CENG Distributed Computing Systems Measures of Performance.

Analysis of Algorithms 7/2/2015CS202 - Fundamentals of Computer Science II1.

Virtues of Good (Parallel) Software

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

DATA STRUCTURES AND ALGORITHMS Lecture Notes 1 Prepared by İnanç TAHRALI.

Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.

Time Complexity Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.

CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Week 2 CS 361: Advanced Data Structures and Algorithms

Performance Evaluation of Parallel Processing. Why Performance?

CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 4.

Analysis of Algorithms

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Compiled by Maria Ramila Jimenez

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.

Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

Algorithms & Flowchart

From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules.

Parallel Programming with MPI and OpenMP

CSC310 © Tom Briggs Shippensburg University Fundamentals of the Analysis of Algorithm Efficiency Chapter 2.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.

Complexity of Algorithms Fundamental Data Structures and Algorithms Ananda Guna January 13, 2005.

Analysis of Algorithms Spring 2016CS202 - Fundamentals of Computer Science II1.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

DCS/1 CENG Distributed Computing Systems Measures of Performance.

Supercomputing in Plain English Tuning Blue Waters Undergraduate Petascale Education Program May 29 – June

Potential for parallel computers/parallel programming

4- Performance Analysis of Parallel Programs

Analysis of Algorithms

Analysis of Algorithms

Chapter 3: Principles of Scalable Performance

Objective of This Course

CSE8380 Parallel and Distributed Processing Presentation

Analysis of Algorithms

COMP60621 Fundamentals of Parallel and Distributed Systems

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

COMP60611 Fundamentals of Parallel and Distributed Systems

Analysis of Algorithms

Chapter 2 Parallel Programming background

Presentation transcript:

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science

Lecture 3 – Parallel Performance Theory - 1 Outline  Performance scalability  Analytical performance measures  Amdahl’s law and Gustafson-Barsis’ law 2 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 What is Performance?  In computing, performance is defined by 2 factors ❍ Computational requirements (what needs to be done) ❍ Computing resources (what it costs to do it)  Computational problems translate to requirements  Computing resources interplay and tradeoff 3 TimeEnergy … and ultimately MoneyHardware Performance ~ 1 Resources for solution Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Why do we care about Performance?  Performance itself is a measure of how well the computational requirements can be satisfied  We evaluate performance to understand the relationships between requirements and resources ❍ Decide how to change “solutions” to target objectives  Performance measures reflect decisions about how and how well “solutions” are able to satisfy the computational requirements “The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible.” Charles Babbage, 1791 – Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 What is Parallel Performance?  Here we are concerned with performance issues when using a parallel computing environment ❍ Performance with respect to parallel computation  Performance is the raison d’être for parallelism ❍ Parallel performance versus sequential performance ❍ If the “performance” is not better, parallelism is not necessary  Parallel processing includes techniques and technologies necessary to compute in parallel ❍ Hardware, networks, operating systems, parallel libraries, languages, compilers, algorithms, tools, …  Parallelism must deliver performance ❍ How? How well? 5 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Performance Expectation (Loss)  If each processor is rated at k MFLOPS and there are p processors, should we see k*p MFLOPS performance?  If it takes 100 seconds on 1 processor, shouldn’t it take 10 seconds on 10 processors?  Several causes affect performance ❍ Each must be understood separately ❍ But they interact with each other in complex ways ◆ Solution to one problem may create another ◆ One problem may mask another  Scaling (system, problem size) can change conditions  Need to understand performance space 6 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Embarrassingly Parallel Computations  An embarrassingly parallel computation is one that can be obviously divided into completely independent parts that can be executed simultaneously ❍ In a truly embarrassingly parallel computation there is no interaction between separate processes ❍ In a nearly embarrassingly parallel computation results must be distributed and collected/combined in some way  Embarrassingly parallel computations have potential to achieve maximal speedup on parallel platforms ❍ If it takes T time sequentially, there is the potential to achieve T/P time running in parallel with P processors ❍ What would cause this not to be the case always? 7 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Scalability  A program can scale up to use many processors ❍ What does that mean?  How do you evaluate scalability?  How do you evaluate scalability goodness?  Comparative evaluation ❍ If double the number of processors, what to expect? ❍ Is scalability linear?  Use parallel efficiency measure ❍ Is efficiency retained as problem size increases?  Apply performance metrics 8 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Performance and Scalability  Evaluation ❍ Sequential runtime (T seq ) is a function of ◆ problem size and architecture ❍ Parallel runtime (T par ) is a function of ◆ problem size and parallel architecture ◆ # processors used in the execution ❍ Parallel performance affected by ◆ algorithm + architecture  Scalability ❍ Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem 9 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Performance Metrics and Formulas  T 1 is the execution time on a single processor  T p is the execution time on a p processor system  S(p) (S p ) is the speedup  E(p) (E p ) is the efficiency  Cost(p) (C p ) is the cost  Parallel algorithm is cost-optimal ❍ Parallel time = sequential time (C p = T 1, E p = 100%) S( p) = T1TpT1Tp Efficiency = SppSpp Cost = p  T p 10 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Amdahl’s Law (Fixed Size Speedup)  Let f be the fraction of a program that is sequential ❍ 1-f is the fraction that can be parallelized  Let T 1 be the execution time on 1 processor  Let T p be the execution time on p processors  S p is the speedup S p =T 1 / T p =T 1 / (fT 1 +(1-f)T 1 /p)) =1 / (f +(1-f)/p))  As p  S p =1 / f 11 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Amdahl’s Law and Scalability  Scalability ❍ Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem  When does Amdahl’s Law apply? ❍ When the problem size is fixed ❍ Strong scaling (p  ∞, S p = S ∞  1 / f ) ❍ Speedup bound is determined by the degree of sequential execution time in the computation, not # processors!!! ❍ Uhh, this is not good … Why? ❍ Perfect efficiency is hard to achieve  See original paper by Amdahl on webpage 12 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Gustafson-Barsis’ Law (Scaled Speedup)  Often interested in larger problems when scaling ❍ How big of a problem can be run (HPC Linpack) ❍ Constrain problem size by parallel time  Assume parallel time is kept constant ❍ T p = C = (f +(1-f)) * C ❍ f seq is the fraction of T p spent in sequential execution ❍ f par is the fraction of T p spent in parallel execution  What is the execution time on one processor? ❍ Let C=1, then T s = f seq + p(1 – f seq ) = 1 + (p-1)f par  What is the speedup in this case? ❍ S p = T s / T p = T s / 1 = f seq + p(1 – f seq ) = 1 + (p-1)f par 13 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Gustafson-Barsis’ Law and Scalability  Scalability ❍ Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem  When does Gustafson’s Law apply? ❍ When the problem size can increase as the number of processors increases ❍ Weak scaling (S p = 1 + (p-1)f par ) ❍ Speedup function includes the number of processors!!! ❍ Can maintain or increase parallel efficiency as the problem scales  See original paper by Gustafson on webpage 14 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Amdahl versus Gustafson-Baris 15 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Amdahl versus Gustafson-Baris 16 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 DAG Model of Computation  Think of a program as a directed acyclic graph (DAG) of tasks ❍ A task can not execute until all the inputs to the tasks are available ❍ These come from outputs of earlier executing tasks ❍ DAG shows explicitly the task dependencies  Think of the hardware as consisting of workers (processors)  Consider a greedy scheduler of the DAG tasks to workers ❍ No worker is idle while there are tasks still to execute 17 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Work-Span Model  T P = time to run with P workers  T 1 = work ❍ Time for serial execution ◆ execution of all tasks by 1 worker ❍ Sum of all work  T ∞ = span ❍ Time along the critical path  Critical path ❍ Sequence of task execution (path) through DAG that takes the longest time to execute ❍ Assumes an infinite # workers available 18 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Work-Span Example  Let each task take 1 unit of time  DAG at the right has 7 tasks  T 1 = 7 ❍ All tasks have to be executed ❍ Tasks are executed in a serial order ❍ Can the execute in any order?  T ∞ = 5 ❍ Time along the critical path ❍ In this case, it is the longest pathlength of any task order that maintains necessary dependencies 19 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Lower/Upper Bound on Greedy Scheduling  Suppose we only have P workers  We can write a work-span formula to derive a lower bound on T P ❍ Max(T 1 / P, T ∞ ) ≤ T P  T ∞ is the best possible execution time  Brent’s Lemma derives an upper bound ❍ Capture the additional cost executing the other tasks not on the critical path ❍ Assume can do so without overhead ❍ T P ≤ (T 1 - T ∞ ) / P + T ∞ 20 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Consider Brent’s Lemma for 2 Processors  T 1 = 7  T ∞ = 5  T 2 ≤ (T 1 - T ∞ ) / P + T ∞ ≤ (7 – 5) / ≤ 6 21 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Amdahl was an optimist! 22 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Estimating Running Time  Scalability requires that T ∞ be dominated by T 1 T P ≈ T 1 / P + T ∞ if T ∞ << T 1  Increasing work hurts parallel execution proportionately  The span impacts scalability, even for finite P 23 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Parallel Slack  Sufficient parallelism implies linear speedup 24 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Asymptotic Complexity  Time complexity of an algorithm summarizes how the execution time grows with input size  Space complexity summarizes how memory requirements grow with input size  Standard work-span model considers only computation, not communication or memory  Asymptotic complexity is a strong indicator of performance on large-enough problem sizes and reveals an algorithm’s fundamental limits 25 Introduction to Parallel Computing, University of Oregon, IPCC

Lecture 3 – Parallel Performance Theory - 1 Definitions for Asymptotic Notation  Let T(N) mean the execution time of an algorithm  Big O notation ❍ T(N) is a member of O(f(N)) means that T(N) ≤ c  f(N) for constant c  Big Omega notation ❍ T(N) is a member of Ω(f(N)) means that T(N) ≥ c  f(N) for constant c  Big Theta notation ❍ T(N) is a member of Θ(f(N)) means that c 1  f(n) ≤T(N) < c 2  f(N) for constants c 1 and c 2 26 Introduction to Parallel Computing, University of Oregon, IPCC