Programming for Performance CS433 Spring 2001 Laxmikant Kale.

Slides:



Advertisements
Similar presentations
Lecture 12 Reduce Miss Penalty and Hit Time
Advertisements

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Parallel System Performance CS 524 – High-Performance Computing.
Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Topic Overview One-to-All Broadcast and All-to-One Reduction
Strategies for Implementing Dynamic Load Sharing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Main Memory CS448.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.
CS3505: DATA LINK LAYER. data link layer  phys. layer subject to errors; not reliable; and only moves information as bits, which alone are not meaningful.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Basic Communication Operations Carl Tropper Department of Computer Science.
Programming for Performance Laxmikant Kale CS 433.
Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.
Concurrency and Performance Based on slides by Henri Casanova.
Dynamic Load Balancing Tree and Structured Computations.
Programming for Performance Laxmikant Kale CS 433.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Numerical Algorithms Chapter 11.
Auburn University
How will execution time grow with SIZE?
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Performance Evaluation of Adaptive MPI
Unit-2 Divide and Conquer
November 14 6 classes to go! Read
Numerical Algorithms • Parallelizing matrix multiplication
Parallel Sorting Algorithms
COMP60621 Fundamentals of Parallel and Distributed Systems
Memory System Performance Chapter 3
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Introduction to parallel algorithms
Presentation transcript:

Programming for Performance CS433 Spring 2001 Laxmikant Kale

2 Causes of performance loss If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? –Several causes, –Each must be understood separately –but they interact with each other in complex ways Solution to one problem may create another One problem may mask another, which manifests itself under other conditions (e.g. increased p).

3 Causes Sequential: cache performance Communication overhead Algorithmic overhead (“extra work”) Speculative work Load imbalance (Long) Critical paths Bottlenecks

4 Algorithmic overhead Parallel algorithms may have a higher operation count Example: parallel prefix (also called “scan”) –How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I];

5 Parallel Prefix: continued How to this operation in parallel? –Seems inherently sequential –Recursive doubling algorithm –Operation count: log(P). N A better algorithm: –Take blocking of data into account –Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements –N + log(P) +N: doubling of op. Count

6 Bottleneck Consider the “primes” program (or the “pi”) –What happens when we run it on 1000 pes? How to eliminate bottlenecks: –Two structures are useful in most such cases: Spanning trees: organize processors in a tree Hypercube-based dimensional exchange

7 Communication overhead Components: –per message and per byte –sending, receiving and network –capacity constraints Grainsize analysis: –How much computation per message –Computation-to-communication ratio

8 Communication overhead examples Usually, must reorganize data or work to reduce communication Combining communication also helps Examples:

9 Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = microseconds, b: 2-10 ns Communication overhead

10 Grainsize control A Simple definition of grainsize: –Amount of computation per message –Problem: short message/ long message More realistic: –Computation to communication ratio –computation time / (a + bN) for one message

11 Example: matrix multiplication How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J];

12 A simple algorithm: Distribute A by rows, B by columns –So,any processor can request a row of A and get it (in two messages). Same for a col of B, –Distribute the work of computing each element of C using some load balancing scheme So it works even on machines with varying processor capabilities (e.g. timeshared clusters) –What is the computation-to-communication ratio? For each object: 2.N ops, 2 messages with N bytes 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N)

13 A better algorithm: Store A as a collection row-bunches –each bunch stores g rows –Same of B’s columns Each object now computes a gxg section of C Comp to commn ratio: –2*g*g*N ops –2 messages, gN bytes each –alpha ratio: 2g*g*N/2, beta ratio: 2g

14 Alpha vs beta The per message cost is significantly larger than per byte cost –factor of several thousands –So, several optimizations are possible that trade off : get larger beta cost in return for smaller alpha –I.e. send fewer messages –Applications of this idea: Examined in the last two lectures

15 Programming for performance:steps Select/design Parallel algorithm Decide on Decomposition Select Load balancing strategy Plan Communication structure Examine synchronization needs –global synchronizations, critical paths

16 Design Philosophy: Parallel Algorithm design: –Ensure good performance (total op count) –Generate sufficient parallelism –Avoid/minimize “extra work” Decomposition: –Break into many small pieces: Smallest grain that sufficiently amortizes overhead

17 Design principles: contd. Load balancing –Select static, dynamic, or quasi-dynamic strategy Measurement based vs prediction based load estimation –Principle: let a processor idle but avoid overloading one (think about this) Reduce communication overhead –Algorithmic reorganization (change mapping) –Message combining –Use efficient communication libraries

18 Design principles: Synchronization Eliminate unnecessary global synchronization –If T(i,j) is the time during i’th phase on j’th PE With synch: sum ( max {T(i,j)}) Without: max { sum(T (i,j) } Critical Paths: –Look for long chains of dependences Draw timeline pictures with dependences

19 Diagnosing performance problems Tools: –Back of the envelope (I.e. simple) analysis –Post-mortem analysis, with performance logs Visualization of performance data Automatic analysis Phase-by-phase analysis (prog. may have many phases) –What to measure load distribution, (commun.) overhead, idle time Their averages, max/min, and variances Profiling: time spent in individual modules/subroutines

20 Diagnostic technniques Tell-tale signs: –max load >> average, and # PEs > average is >>1 –max load >> average, and # PEs > average is ~ 1 –Profile shows increase in total time in routine f with increase in PEs: –Communication overhead: obvious Load imbalance Possible bottleneck (if there is dependence) Algorithmic overhead

21 Communication Optimization Example problem from earlier lecture: Molecular Dynamics –Each Processor, assumed to house just one cell, needs to send 26 short messages to “neighboring” processors –Assume Send/Receive each: alpha = 10 us, beta: 2ns –Time spent (notice: 26 sends and 26 receives): 26*2(10 ) = 520 us –If there are more than one cells on each PE, multiply this number! –Can this be improved? How?

22 Message combining If there are multiple cells per processor: –Neighbors of a cell may be on the same neighboring processor. –Neighbors of two different cells on the same processor –Combine messages going to the same processor

23 Communication Optimization I Take advantage of the structure of communication, and do communication in stages: –If my coordinates are: (x,y,z): Send to (x+1, y,z), anything that goes to (x+1, *, *) Send to (x-1, y,z), anything that goes to (x-1, *, *) Wait for messages from x neighbors, then Send to y neighbors a combined message –A total of 6 messages instead of 26 –Apparently longer critical path

24 Communication Optimization II Send all migrating atoms to processor 0 –Let processor 0 sort them out and send 1 message to each processor –Works ok if the number of processors is small Otherwise, bottleneck at 0

25 Communication Optimization 3 Generalized problem: – Each to all, individualized messages –Apply all previously learned techniques

26 Intro to Load Balancing Example: 500 processors, units of work What should the objective of load balancing be?

27 Causes of performance loss If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? –Several causes, –Each must be understood separately –but they interact with each other in complex ways Solution to one problem may create another One problem may mask another, which manifests itself under other conditions (e.g. increased p).

28 Causes Sequential: cache performance Communication overhead Algorithmic overhead (“extra work”) Speculative work Load imbalance (Long) Critical paths Bottlenecks

29 Algorithmic overhead Parallel algorithms may have a higher operation count Example: parallel prefix (also called “scan”) –How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I];

30 Parallel Prefix: continued How to this operation in parallel? –Seems inherently sequential –Recursive doubling algorithm –Operation count: log(P). N A better algorithm: –Take blocking of data into account –Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements –N + log(P) +N: doubling of op. Count

31 Bottleneck Consider the “primes” program (or the “pi”) –What happens when we run it on 1000 pes? How to eliminate bottlenecks: –Two structures are useful in most such cases: Spanning trees: organize processors in a tree Hypercube-based dimensional exchange

32 Communication overhead Components: –per message and per byte –sending, receiving and network –capacity constraints Grainsize analysis: –How much computation per message –Computation-to-communication ratio

33 Communication overhead examples Usually, must reorganize data or work to reduce communication Combining communication also helps Examples:

34 Communication overhead Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = microseconds, b: 2-10 ns

35 Grainsize control A Simple definition of grainsize: –Amount of computation per message –Problem: short message/ long message More realistic: –Computation to communication ratio

36 Example: matrix multiplication How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J];

37 A simple algorithm: Distribute A by rows, B by columns –So,any processor can request a row of A and get it (in two messages). Same for a col of B, –Distribute the work of computing each element of C using some load balancing scheme So it works even on machines with varying processor capabilities (e.g. timeshared clusters) –What is the computation-toc-mmunication ratio? For each object: 2.N ops, 2 messages with N bytes

38 A better algorithm: Store A as a collection row-bunches –each bunch stores g rows –Same of B’s columns Each object now computes a gxg section of C Comp to commn ratio: –2*g*g*N ops –2 messages, gN bytes each –alpha ratio: 2g*g*N/2, beta ratio: g

39 Alpha vs beta The per message cost is significantly larger than per byte cost –factor of several thousands –So, several optimizations are possible that trade off : get larger beta cost for smaller alpha –I.e. send fewer messages –Applications of this idea: Message combining Complex communication patterns: each-to-all,..

40 Example: Each to all communication: –each processor wants to send N bytes, distinct message to each other processor –Simple implementation: alpha*P + N * beta *P typical values?

41 Programming for performance: steps Select/design Parallel algorithm Decide on Decomposition Select Load balancing strategy Plan Communication structure Examine synchronization needs –global synchronizations, critical paths

42 Design Philosophy: Parallel Algorithm design: –Ensure good performance (total op count) –Generate sufficient parallelism –Avoid/minimize “extra work” Decomposition: –Break into many small pieces: Smallest grain that sufficiently amortizes overhead

43 Design principles: contd. Load balancing –Select static, dynamic, or quasi-dynamic strategy Measurement based vs prediction based load estimation –Principle: let a processor idle but avoid overloading one (think about this) Reduce communication overhead –Algorithmic reorganization (change mapping) –Message combining –Use efficient communication libraries

44 Design principles: Synchronization Eliminate unnecessary global synchronization –If T(i,j) is the time during i’th phase on j’th PE With synch: sum ( max {T(i,j)}) Without: max { sum(T (i,j) } Critical Paths: –Look for long chains of dependences Draw timeline pictures with dependences

45 Diagnosing performance problems Tools: –Back of the envelope (I.e. simple) analysis –Post-mortem analysis, with performance logs Visualization of performance data Automatic analysis Phase-by-phase analysis (prog. may have many phases) –What to measure load distribution, (commun.) overhead, idle time Their averages, max/min, and variances Profiling: time spent in individual modules/subroutines

46 Diagnostic technniques Tell-tale signs: –max load >> average, and # Pes > average is >>1 Load imbalance –max load >> average, and # Pes > average is ~ 1 Possible bottleneck (if there is dependence) –profile shows increase in total time in routine f with increase in Pes: algorithmic overhead –Communication overhead: obvious