Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3, 2014 1.

Slides:



Advertisements
Similar presentations
Chapter Matrices Matrix Arithmetic
Advertisements

Lecture 19: Parallel Algorithms
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Numerical Algorithms Matrix multiplication
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Numerical Algorithms • Matrix multiplication
Maths for Computer Graphics
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Algorithms on Rings of Processors
Design of parallel algorithms
Chapter 2 Matrices Definition of a matrix.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1/26 Design of parallel algorithms Linear equations Jari Porras.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.
Matrices The Basics Vocabulary and basic concepts.
Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.
CE 311 K - Introduction to Computer Methods Daene C. McKinney
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
By: David McQuilling; Jesus Caban Deng Li Numerical Linear Algebra.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
By: David McQuilling and Jesus Caban Numerical Linear Algebra.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallelizing Gauss-Seidel Solver/Pre-conditioner Aim: To parallelize a Gauss-Seidel Solver, which can be used as a pre-conditioner for the finite element.
Matrix Multiply Methods. Some general facts about matmul High computation to communication hides multitude of sins Many “symmetries” meaning that the.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
3.4 Solution by Matrices. What is a Matrix? matrix A matrix is a rectangular array of numbers.
Data Structures and Algorithms in Parallel Computing Lecture 10.
Lecture 9 Architecture Independent (MPI) Algorithm Design
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
CS 450: COMPUTER GRAPHICS TRANSFORMATIONS SPRING 2015 DR. MICHAEL J. REALE.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
2 2.4 © 2016 Pearson Education, Ltd. Matrix Algebra PARTITIONED MATRICES.
A very brief introduction to Matrix (Section 2.7) Definitions Some properties Basic matrix operations Zero-One (Boolean) matrices.
Numerical Algorithms Chapter 11.
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
Parallel Programming in C with MPI and OpenMP
MATRICES MATRIX OPERATIONS.
Dense Linear Algebra (Data Distributions)
Matrix Addition and Multiplication
To accompany the text “Introduction to Parallel Computing”,
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Matrix Multiplication Sec. 4.2
Presentation transcript:

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

Matix Algorithms: Introduction Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition. Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.

Matrix-Vector Multiplication We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y. The serial algorithm requires n 2 multiplications and additions.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning The n x n matrix is partitioned among p processors, with each processor storing n/p complete rows of the matrix. The n x 1 vector x is distributed such that each process owns n/p of its elements.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Multiplication of an n x n matrix with an n x 1 vector using rowwise block 1-D partitioning. For the one-row-per-process case, p = n.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Consider the case when p < n and we use block 1D partitioning. Each process initially stores n/p complete rows of the matrix and a portion of the vector of size n/p. The all-to-all broadcast takes place among p processes and involves messages of size n/p. This is followed by n/p local dot products. Thus, the parallel run time of this procedure is

Matrix-Vector Multiplication: 2-D Partitioning The n x n matrix is partitioned among p processors such that each processor owns n 2 /p elements. The n x 1 vector x is distributed only in the last column of processors.

Matrix-Vector Multiplication: 2-D Partitioning Matrix-vector multiplication with block 2-D partitioning. For the one-element-per-process case, p = n 2 if the matrix size is n x n.

Matrix-Vector Multiplication: 2-D Partitioning We must first align the vector with the matrix appropriately. The first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. The second step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. Finally, the result vector is computed by performing an all-to-one reduction along the columns.

Matrix-Vector Multiplication: 2-D Partitioning When using fewer than n 2 processors, each process owns an block of the matrix. The vector is distributed in portions of elements in the last process-column only. In this case, the message sizes for the alignment, broadcast, and reduction are all. The computation is a product of an submatrix with a vector of length.

Matrix-Vector Multiplication: 2-D Partitioning The first alignment step takes time The broadcast and reductions take time Local matrix-vector products take time Total time is

Matrix-Matrix Multiplication Consider the problem of multiplying two n x n dense, square matrices A and B to yield the product matrix C = A x B. The serial complexity is O(n 3 ). We do not consider better serial algorithms (Strassen's method), although, these can be used as serial kernels in the parallel algorithms. A useful concept in this case is called block operations. In this view, an n x n matrix A can be regarded as a q x q array of blocks A i,j (0 ≤ i, j < q ) such that each block is an (n/q) x (n/q) submatrix. In this view, we perform q 3 matrix multiplications, each involving (n/q) x (n/q) matrices.

Matrix-Matrix Multiplication Consider two n x n matrices A and B partitioned into p blocks A i,j and B i,j (0 ≤ i, j < ) of size each. Process P i,j initially stores A i,j and B i,j and computes block C i,j of the result matrix. Computing submatrix C i,j requires all submatrices A i,k and B k,j for 0 ≤ k <. All-to-all broadcast blocks of A along rows and B along columns. Perform local submatrix multiplication.

Matrix-Matrix Multiplication The two broadcasts take time The computation requires multiplications of sized submatrices. The parallel run time is approximately Major drawback of the algorithm is that it is not memory optimal.

Matrix-Matrix Multiplication: Cannon's Algorithm In this algorithm, we schedule the computations of the processes of the i th row such that, at any given time, each process is using a different block A i,k. These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh A i,k after each rotation.

Matrix-Matrix Multiplication: Cannon's Algorithm Communication steps in Cannon's algorithm on 16 processes.

Matrix-Matrix Multiplication: Cannon's Algorithm Align the blocks of A and B in such a way that each process multiplies its local submatrices. This is done by shifting all submatrices A i,j to the left (with wraparound) by i steps and all submatrices B i,j up (with wraparound) by j steps. Perform local block multiplication. Each block of A moves one step left and each block of B moves one step up (again with wraparound). Perform next block multiplication, add to partial result, repeat until all blocks have been multiplied.