Introduction to Parallel Programming Language notation: message passing 5 parallel algorithms of increasing complexity: –Matrix multiplication –Successive.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Lecture 19: Parallel Algorithms
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
EECC756 - Shaaban #1 lec # 8 Spring Synchronous Iteration Iteration-based computation is a powerful method for solving numerical (and some.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
DIVIDE AND CONQUER APPROACH. General Method Works on the approach of dividing a given problem into smaller sub problems (ideally of same size).  Divide.
1 Linear Triangular System L – lower triangular matrix, nonsingular Lx=b L: nxn nonsingular lower triangular b: known vector b(1) = b(1)/L(1,1) For i=2:n.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms Matrix multiplication
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms • Matrix multiplication
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
Lecture 21: Parallel Algorithms
Introduction to Parallel Programming Language notation: message passing Distributed-memory machine (e.g., workstations on a network) 5 parallel algorithms.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1/26 Design of parallel algorithms Linear equations Jari Porras.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Friday, November 03, 2006 “The greatest obstacle to discovery is not ignorance, but the illusion of knowledge.” -D. Boorstin.
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Introduction to Parallel Programming Language notation: message passing Distributed-memory machine –(e.g., workstations on a network) 5 parallel algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Introduction to Parallel Programming
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Introduction to Parallel Programming Language notation: message passing Distributed-memory machine –(e.g., workstations on a network) 5 parallel algorithms.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
All Pairs Shortest Path Algorithms Aditya Sehgal Amlan Bhattacharya.
Numerical Algorithms Chapter 11.
Auburn University
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming with MPI and OpenMP
Introduction to Parallel Programming
Lecture 22: Parallel Algorithms
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
Parallel Programming in C with MPI and OpenMP
CSCE569 Parallel Computing
COMP60621 Fundamentals of Parallel and Distributed Systems
Dense Linear Algebra (Data Distributions)
Matrix Addition and Multiplication
To accompany the text “Introduction to Parallel Computing”,
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Introduction to Parallel Programming Language notation: message passing 5 parallel algorithms of increasing complexity: –Matrix multiplication –Successive overrelaxation –All-pairs shortest paths –Linear equations –Search problem

Message Passing SEND (destination, message) – blocking: wait until message has arrived – non blocking: continue immediately RECEIVE (source, message) RECEIVE-FROM-ANY (message) – blocking: wait until message is available – non blocking: test if message is available

Parallel Matrix Multiplication Given two N x N matrices A and B Compute C = A x B C ij = A i1 B 1j + A i2 B 2j A iN B Nj

Sequential Matrix Multiplication for (i = 1; i <= N; i++) for (j = 1; j <= N; j++) C [i,j] = 0; for (k = 1; k <= N; k++) C[i,j] += A[i,k] * B[k,j]; The order of the operations is overspecied Everything can be computed in parallel

Parallel Algorithm 1 Each processor computes 1 element of C Requires N 2 processors Need 1 row of A and 1 column of B as input

Parallel Algorithm 1 Master (processor 0): for (i = 1; i <= N; i++) for (j = 1; j <= N; j++) SEND(p++, A[i,*], B[*,j], i, j); for (x = 1; x <= N*N; x++) RECEIVE_FROM_ANY(&result, &i, &j); C[i,j] = result; Slaves: int Aix[N], Bxj[N], Cij; RECEIVE(0, &Aix, &Bxj, &i, &j); Cij = 0; for (k = 1; k <= N; k++) Cij += Aix[k] * Bxj[k]; SEND(0, Cij, i, j);

Parallel Algorithm 2 Each processor computes 1 row (N elements) of C Requires N processors Need entire B matrix and 1 row of A as input

Parallel Algorithm 2 Master (processor 0): for (i = 1; i <= N; i++) SEND (i, A[i,*], B[*,*], i); for (x = 1; x <= N; x++) RECEIVE_FROM_ANY (&result, &i); C[i,*] = result[*]; Slaves: int Aix[N], B[N,N], C[N]; RECEIVE(0, &Aix, &B, &i); for (j = 1; j <= N; j++) C[j] = 0; for (k = 1; k <= N; k++) C[j] += Aix[k] * B[j,k]; SEND(0, C[*], i);

Problem: need larger granularity So far, each parallel task needs as much communication as computation Assumption: N >> P (i.e. we solve a large problem) Assign many rows to each processor

Parallel Algorithm 3 Each processor computes N/P rows of C Need entire B matrix and N/P rows of A as input

Parallel Algorithm 3 Master (processor 0): int result [N, N/nprocs]; int inc = N/nprocs; /* number of rows per cpu */ int lb = 1; for (i = 1; i <= nprocs; i++) SEND (i, A[lb.. lb+inc-1, *], B[*,*], lb, lb+inc-1); lb += inc; for (x = 1; x <= nprocs; x++) RECEIVE_FROM_ANY (&result, &lb); for (i = 1; i <= N/nprocs; i++) C[lb+i-1, *] = result[i, *]; Slaves: int A[N/nprocs, N], B[N,N], C[N/nprocs, N]; RECEIVE(0, &A, &B, &lb, &ub); for (i = lb; i <= ub; i++) for (j = 1; j <= N; j++) C[i,j] = 0; for (k = 1; k <= N; k++) C[i,j] += A[i,k] * B[k,j]; SEND(0, C[*,*], lb);

Comparison If N >> P, algorithm 3 will have low communication overhead Its grain size is high AlgorithmParallelism (#jobs)Communication per jobComputation per jobRatio (comp/comm) 1N2N2 N + N + 1NO(1) 2NN + N 2 +NN2N2 O(1) 3PN 2 /P + N 2 + N 2 /PN 3 /PO(N/P)

Discussion Matrix multiplication is trivial to parallelize Getting good performance is a problem Need right grain size Need large input problem

Successive Over relaxation (SOR) Iterative method for solving Laplace equations Repeatedly updates elements of a grid float G[1:N, 1:M], Gnew[1:N, 1:M]; for (step = 0; step < NSTEPS; step++) for (i = 2; i < N; i++) /* update grid */ for (j = 2; j < M; j++) Gnew[i,j] = f(G[i,j], G[i-1,j], G[i+1,j],G[i,j-1], G[i,j+1]); G = Gnew;

SOR example

Parallelizing SOR Domain decomposition on the grid Each processor owns N/P rows Need communication between neighbors to exchange elements at processor boundaries

SOR example partitioning

Communication scheme Each CPU communicates with left & right neighbor (if existing)

Parallel SOR float G[lb-1:ub+1, 1:M], Gnew[lb-1:ub+1, 1:M]; for (step = 0; step < NSTEPS; step++) SEND(cpuid-1, G[lb]); /* send 1st row left */ SEND(cpuid+1, G[ub]); /* send last row right */ RECEIVE(cpuid-1, G[lb-1]); /* receive from left */ RECEIVE(cpuid+1, G[ub+1]); /* receive from right */ for (i = lb; i <= ub; i++) /* update my rows */ for (j = 2; j < M; j++) Gnew[i,j] = f(G[i,j], G[i-1,j], G[i+1,j], G[i,j-1], G[i,j+1]); G = Gnew;

Performance of SOR Communication and computation during each iteration: Each processor sends/receives 2 messages with M reals Each processor computes N/P * M updates The algorithm will have good performance if Problem size is large: N >> P Message exchanges can be done in parallel

All-pairs Shorts Paths (ASP) Given a graph G with a distance table C: C [ i, j ] = length of direct path from node i to node j Compute length of shortest path between any two nodes in G

Floyd's Sequential Algorithm Basic step: for (k = 1; k <= N; k++) for (i = 1; i <= N; i++) for (j = 1; j <= N; j++) C [ i, j ] = MIN ( C [i, j], C [i,k] +C [k, j]);

Parallelizing ASP Distribute rows of C among the P processors During iteration k, each processor executes C [i,j] = MIN (C[i,j], C[i,k] + C[k,j]); on its own rows i, so it needs these rows and row k Before iteration k, the processor owning row k sends it to all the others

Parallel ASP Algorithm int lb, ub; /* lower/upper bound for this CPU */ int rowK[N], C[lb:ub, N]; /* pivot row ; matrix */ for (k = 1; k <= N; k++) if (k >= lb && k <= ub) /* do I have it? */ rowK = C[k,*]; for (p = 1; p <= nproc; p++) /* broadcast row */ if (p != myprocid) SEND(p, rowK); else RECEIVE_FROM_ANY(&rowK); /* receive row */ for (i = lb; i <= ub; i++) /* update my rows */ for (j = 1; j <= N; j++) C[i,j] = MIN(C[i,j], C[i,k] + rowK[j]);

Parallel ASP Algorithm int lb, ub; /* lower/upper bound for this CPU */ int rowK[N], C[lb:ub, N]; /* pivot row ; matrix */ for (k = 1; k <= N; k++) for (i = lb; i <= ub; i++) /* update my rows */ for (j = 1; j <= N; j++) C[i,j] = MIN(C[i,j], C[i,k] + rowK[j]);

Parallel ASP Algorithm int lb, ub; /* lower/upper bound for this CPU */ int rowK[N], C[lb:ub, N]; /* pivot row ; matrix */ for (k = 1; k <= N; k++) if (k >= lb && k <= ub) /* do I have it? */ rowK = C[k,*]; for (p = 1; p <= nproc; p++) /* broadcast row */ if (p != myprocid) SEND(p, rowK); else RECEIVE_FROM_ANY(&rowK); /* receive row */ for (i = lb; i <= ub; i++) /* update my rows */ for (j = 1; j <= N; j++) C[i,j] = MIN(C[i,j], C[i,k] + rowK[j]);

Performance Analysis ASP Per iteration: 1 CPU sends P -1 messages with N integers Each CPU does N/P x N comparisons Communication/ computation ratio is small if N >> P

... but, is the Algorithm Correct?

Parallel ASP Algorithm int lb, ub; int rowK[N], C[lb:ub, N]; for (k = 1; k <= N; k++) if (k >= lb && k <= ub) rowK = C[k,*]; for (p = 1; p <= nproc; p++) if (p != myprocid) SEND(p, rowK); else RECEIVE_FROM_ANY (&rowK); for (i = lb; i <= ub; i++) for (j = 1; j <= N; j++) C[i,j] = MIN(C[i,j], C[i,k] + rowK[j]);

Non-FIFO Message Ordering Row 2 may be received before row 1

FIFO Ordering Row 5 may be received before row 4

Correctness Problems: Asynchronous non-FIFO SEND Messages from different senders may overtake each other

Correctness Problems: Asynchronous non-FIFO SEND Messages from different senders may overtake each other Solutions:

Correctness Problems: Asynchronous non-FIFO SEND Messages from different senders may overtake each other Solutions: Synchronous SEND (less efficient)

Correctness Problems: Asynchronous non-FIFO SEND Messages from different senders may overtake each other Solutions: Synchronous SEND (less efficient) Barrier at the end of outer loop (extra communication)

Correctness Problems: Asynchronous non-FIFO SEND Messages from different senders may overtake each other Solutions: Synchronous SEND (less efficient) Barrier at the end of outer loop (extra communication) Order incoming messages (requires buffering)

Correctness Problems: Asynchronous non-FIFO SEND Messages from different senders may overtake each other Solutions: Synchronous SEND (less efficient) Barrier at the end of outer loop (extra communication) Order incoming messages (requires buffering) RECEIVE (cpu, msg) (more complicated)

Linear equations Linear equations: a 1,1 x 1 + a 1,2 x 2 + …a 1,n x n = b 1... a n,1 x 1 + a n,2 x 2 + …a n,n x n = b n Matrix notation: Ax = b Problem: compute x, given A and b Linear equations have many important applications Practical applications need huge sets of equations

Solving a linear equation Two phases: Upper-triangularization -> U x = y Back-substitution -> x Most computation time is in upper-triangularization Upper-triangular matrix: U [i, i] = 1 U [i, j] = 0 if i > j

Sequential Gaussian elimination for (k = 1; k <= N; k++) for (j = k+1; j <= N; j++) A[k,j] = A[k,j] / A[k,k] y[k] = b[k] / A[k,k] A[k,k] = 1 for (i = k+1; i <= N; i++) for (j = k+1; j <= N; j++) A[i,j] = A[i,j] - A[i,k] * A[k,j] b[i] = b[i] - A[i,k] * y[k] A[i,k] = 0 Converts Ax = b into Ux = y Sequential algorithm uses 2/3 N 3 operations

Parallelizing Gaussian elimination Row-wise partitioning scheme Each cpu gets one row (striping ) Execute one (outer-loop) iteration at a time Communication requirement: During iteration k, cpus P k+1 … P n-1 need part of row k This row is stored on CPU P k -> need partial broadcast (multicast)

Communication

Performance problems Communication overhead (multicast) Load imbalance CPUs P 0 …P K are idle during iteration k In general, number of CPUs is less than n Choice between block-striped and cyclic-striped distribution Block-striped distribution has high load-imbalance Cyclic-striped distribution has less load-imbalance

Block-striped distribution

Cyclic-striped distribution

A Search Problem Given an array A[1..N] and an item x, check if x is present in A int present = false; for (i = 1; !present && i <= N; i++) if ( A [i] == x) present = true;

Parallel Search on 2 CPUs int lb, ub; int A[lb:ub]; for (i = lb; i <= ub; i++) if (A [i] == x) print(“ Found item"); SEND(1-cpuid); /* send other CPU empty message*/ exit(); /* check message from other CPU: */ if (NONBLOCKING_RECEIVE(1-cpuid)) exit()

Performance Analysis How much faster is the parallel program than the sequential program for N=100 ?

Performance Analysis How much faster is the parallel program than the sequential program for N=100 ? 1. if x not present => factor 2

Performance Analysis How much faster is the parallel program than the sequential program for N=100 ? 1. if x not present => factor 2 2. if x present in A[1.. 50] => factor 1

Performance Analysis How much faster is the parallel program than the sequential program for N=100 ? 1. if x not present => factor 2 2. if x present in A[1.. 50] => factor 1 3. if A[51] = x => factor 51

Performance Analysis How much faster is the parallel program than the sequential program for N=100 ? 1. if x not present => factor 2 2. if x present in A[1.. 50] => factor 1 3. if A[51] = x => factor if A[75] = x => factor 3

Performance Analysis How much faster is the parallel program than the sequential program for N=100 ? 1. if x not present => factor 2 2. if x present in A[1.. 50] => factor 1 3. if A[51] = x => factor if A[75] = x => factor 3 In case 2 the parallel program does more work than the sequential program => search overhead

Performance Analysis How much faster is the parallel program than the sequential program for N=100 ? 1. if x not present => factor 2 2. if x present in A[1.. 50] => factor 1 3. if A[51] = x => factor if A[75] = x => factor 3 In case 2 the parallel program does more work than the sequential program => search overhead In cases 3 and 4 the parallel program does less work => negative search overhead

Discussion Several kinds of performance overhead

Discussion Several kinds of performance overhead Communication overhead

Discussion Several kinds of performance overhead Communication overhead Load imbalance

Discussion Several kinds of performance overhead Communication overhead Load imbalance Search overhead

Discussion Several kinds of performance overhead Communication overhead Load imbalance Search overhead Making algorithms correct is nontrivial

Discussion Several kinds of performance overhead Communication overhead Load imbalance Search overhead Making algorithms correct is nontrivial Message ordering

Designing Parallel Algorithms Source: Designing and building parallel programs (Ian Foster, 1995) Partitioning Communication Agglomeration Mapping

Figure 2.1 from Foster's book

Partitioning Domain decomposition Partition the data Partition computations on data (owner-computes rule) Functional decomposition Divide computations into subtasks E.g. search algorithms

Communication Analyze data-dependencies between partitions Use communication to transfer data Many forms of communication, e.g. Local communication with neighbors (SOR) Global communication with all processors (ASP) Synchronous (blocking) communication Asynchronous (non blocking) communication

Agglomeration Reduce communication overhead by –increasing granularity – improving locality

Mapping On which processor to execute each subtask? Put concurrent tasks on different CPUs Put frequently communicating tasks on same CPU? Avoid load imbalances

Summary Hardware and software models Example applications Matrix multiplication - Trivial parallelism (independent tasks) Successive over relaxation - Neighbor communication All-pairs shortest paths - Broadcast communication Linear equations - Load balancing problem Search problem - Search overhead Designing parallel algorithms