Complexity Measures for Parallel Computation

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
SE-292 High Performance Computing
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Potential for parallel computers/parallel programming
1 Potential for Parallel Computation Module 2. 2 Potential for Parallelism Much trivially parallel computing  Independent data, accounts  Nothing to.
1 Parallel Computing 6 Performance Analysis Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel System Performance CS 524 – High-Performance Computing.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
Parallel System Performance CS 524 – High-Performance Computing.
CS 240A: Complexity Measures for Parallel Computation.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.
RAM, PRAM, and LogP models
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
30-Sep-16COMP28112 Lecture 21 A few words about parallel computing Challenges/goals with distributed systems: Heterogeneity Openness Security Scalability.
Distributed and Parallel Processing George Wells.
Measuring Performance II and Logic Design
Potential for parallel computers/parallel programming
A few words about parallel computing
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
Model Problem: Solving Poisson’s equation for temperature
What Exactly is Parallel Processing?
Multi-Processing in High Performance Computer Architecture:
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
CS 140 : Matrix multiplication
Architectural Interactions in High Performance Clusters
Introduction to CILK Some slides are from:
Quiz Questions Parallel Programming Parallel Computing Potential
Parallelismo.
COMP60621 Fundamentals of Parallel and Distributed Systems
Distributed Computing:
Multithreaded Programming in Cilk Lecture 1
Latency Tolerance: what to do when it just won’t go away
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Complexity Measures for Parallel Computation
Matrix Addition and Multiplication
Potential for parallel computers/parallel programming
Quiz Questions Parallel Programming Parallel Computing Potential
Quiz Questions Parallel Programming Parallel Computing Potential
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Introduction to CILK Some slides are from:
CS 140 : Matrix multiplication
Parallel Matrix Multiply
Presentation transcript:

Complexity Measures for Parallel Computation

Several possible models! Execution time and parallelism: Work / Span Model Total cost of moving data: Communication Volume Model Detailed models that try to capture time for moving data: Latency / Bandwidth Model (for message-passing) Cache Memory Model (for hierarchical memory) Other detailed models we won’t discuss: LogP, UMH, ….

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Work / Span Model tp = execution time on p processors

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Work / Span Model tp = execution time on p processors t1 = work

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Work / Span Model tp = execution time on p processors t1 = work t∞ = span * * Also called critical-path length or computational depth.

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Work / Span Model tp = execution time on p processors t1 = work t∞ = span * WORK LAW tp ≥t1/p SPAN LAW tp ≥ t∞ * Also called critical-path length or computational depth.

A B Series Composition Work: t1(A∪B) = Work: t1(A∪B) = t1(A) + t1(B) Span: t∞(A∪B) = Span: t∞(A∪B) = t∞(A) +t∞(B)

A B Parallel Composition Work: t1(A∪B) = t1(A) + t1(B) Span: t∞(A∪B) = max{t∞(A), t∞(B)}

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Speedup Def. t1/tP = speedup on p processors. If t1/tP = (p), we have linear speedup, = p, we have perfect linear speedup, > p, we have superlinear speedup, (which is not possible in this model, because of the Work Law tp ≥ t1/p)

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelism Because the Span Law requires tp ≥ t∞, the maximum possible speedup is t1/t∞ = (potential) parallelism = the average amount of work per step along the span.

Communication Volume Model Network of p processors Each with local memory Message-passing Communication volume (v) Total size (words) of all messages passed during computation Broadcasting one word costs volume p (actually, p-1) No explicit accounting for communication time Thus, can’t really model parallel efficiency or speedup; for that, we’d use the latency-bandwidth model (see next slide)

Complexity Measures for Parallel Computation Problem parameters: n index of problem size p number of processors Algorithm parameters: tp running time on p processors t1 time on 1 processor = sequential time = “work” t∞ time on unlimited procs = critical path length = “span” v total communication volume Performance measures speedup s = t1 / tp efficiency e = t1 / (p*tp) = s / p (potential) parallelism pp = t1 / t∞

Laws of Parallel Complexity Work law: tp ≥ t1 / p Span law: tp ≥ t∞ Amdahl’s law: If a fraction f, between 0 and 1, of the work must be done sequentially, then speedup ≤ 1 / f Exercise: prove Amdahl’s law from the span law.

Detailed complexity measures for data movement I: Latency/Bandwith Model Moving data between processors by message-passing Machine parameters: a or tstartup latency (message startup time in seconds) b or tdata inverse bandwidth (in seconds per word) between nodes of Triton, a ~ 2.2 × 10-6 and b ~ 6.4 × 10-9 Time to send & recv or bcast a message of w words: a + w*b tcomm total commmunication time tcomp total computation time Total parallel time: tp = tcomp + tcomm

Detailed complexity measures for data movement II: Cache Memory Model Moving data between cache and memory on one processor: Assume just two levels in memory hierarchy, fast and slow All data initially in slow memory m = number of memory elements (words) moved between fast and slow memory tm = time per slow memory operation f = number of arithmetic operations tf = time per arithmetic operation, tf << tm q = f / m average number of flops per slow element access Minimum possible time = f * tf when all data in fast memory Actual time f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) Larger q means time closer to minimum f * tf