Parallel Programming in C with MPI and OpenMP

Slides:



Advertisements
Similar presentations
Section 13-4: Matrix Multiplication
Advertisements

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Maths for Computer Graphics
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1/26 Design of parallel algorithms Linear equations Jari Porras.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallel Programming in C with MPI and OpenMP
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Row 1 Row 2 Row 3 Row m Column 1Column 2Column 3 Column 4.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Sieve of Eratosthenes by Fola Olagbemi. Outline What is the sieve of Eratosthenes? Algorithm used Parallelizing the algorithm Data decomposition options.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
MATRIX MULTIPLICATION 4 th week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MATRIX MULTIPLICATION 4 th week References Sequential matrix.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Section 4.3 – Multiplying Matrices. MATRIX MULTIPLICATION 1. The order makes a difference…AB is different from BA. 2. The number of columns in first matrix.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.
Numerical Algorithms Chapter 11.
12-1 Organizing Data Using Matrices
Matrix Operations Free powerpoints at
High Altitude Low Opening?
Matrix Operations.
Parallel Programming By J. H. Wang May 2, 2017.
Matrix Operations Free powerpoints at
Introduction To Matrices
Matrix Operations.
Parallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP
Matrix Operations Free powerpoints at
Parallel Programming with MPI and OpenMP
Multiplying Matrices.
CSCE569 Parallel Computing
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
CSCE569 Parallel Computing
Multiplying Matrices.
Matrix Addition and Multiplication
To accompany the text “Introduction to Parallel Computing”,
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Parallel Programming in C with MPI and OpenMP
Multiplying Matrices.
Multiplying Matrices.
Multiplying Matrices.
Presentation transcript:

Parallel Programming in C with MPI and OpenMP Michael J. Quinn

Matrix Multiplication Chapter 11 Matrix Multiplication

Outline Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannon’s algorithm

Iterative, Row-oriented Algorithm Series of inner product (dot product) operations  =

Performance as n Increases

Reason: Matrix B Gets Too Big for Cache Computing a row of C requires accessing every element of B

Block Matrix Multiplication  = Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition

Recurse Until B Small Enough

Comparing Sequential Performance

First Parallel Algorithm Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring

First Parallel Algorithm (cont.) Agglomeration and mapping Fixed number of tasks, each requiring same amount of computation Regular communication among tasks Strategy: Assign each process a contiguous group of rows

Communication of B A A A B C A B C A A A B C A B C

Communication of B A A A B C A B C A A A B C A B C

Communication of B A A A B C A B C A A A B C A B C

Communication of B A A A B C A B C A A A B C A B C

Complexity Analysis Algorithm has p iterations During each iteration a process multiplies (n / p)  (n / p) block of A by (n / p)  n block of B: (n3 / p2) Total computation time: (n3 / p) Each process ends up passing (p-1)n2/p = (n2) elements of B

Weakness of Algorithm 1 Blocks of B being manipulated have p times more columns than rows Each process must access every element of matrix B Ratio of computations per communication is poor: only 2n / p

Parallel Algorithm 2 (Cannon’s Algorithm) Associate a primitive task with each matrix element Agglomerate tasks responsible for a square (or nearly square) block of C Computation-to-communication ratio rises to n / p

Elements of A and B Needed to Compute a Process’s Portion of C Algorithm 1 Cannon’s Algorithm

Blocks Must Be Aligned Before After

Blocks Need to Be Aligned Each triangle represents a matrix block Only same-color triangles should be multiplied B10 B11 A11 B12 B13 A10 A12 A13 B20 B21 B22 B23 A20 A21 A22 A23 B30 B31 B32 B33 A30 A31 A32 A33

Rearrange Blocks Block Aij cycles left i positions Block Bij cycles up j positions B10 B21 A11 A12 A13 B32 A10 B03 B20 B31 B02 B13 A22 A23 A20 A21 B30 B01 B12 B23 A33 A30 A31 A32

Consider Process P1,2 B22 A11 A12 A13 B32 A10 Step 1 B02 B12

Consider Process P1,2 B32 A12 A13 A10 B02 A11 Step 2 B12 B22

Consider Process P1,2 B02 A13 A10 A11 B12 A12 Step 3 B22 B32

Consider Process P1,2 B12 A10 A11 A12 B22 A13 Step 4 B32 B02

Complexity Analysis Algorithm has p iterations During each iteration process multiplies two (n / p )  (n / p ) matrices: (n3 / p 3/2) Computational complexity: (n3 / p) During each iteration process sends and receives two blocks of size (n / p )  (n / p ) Communication complexity: (n2/ p)

Summary Considered two sequential algorithms Iterative, row-oriented algorithm Recursive, block-oriented algorithm Second has better cache hit rate as n increases Developed two parallel algorithms First based on rowwise block striped decomposition Second based on checkerboard block decomposition Second algorithm is scalable, while first is not