CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

Linear Algebra Applications in Matlab ME 303. Special Characters and Matlab Functions.

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

1 Linear Triangular System L – lower triangular matrix, nonsingular Lx=b L: nxn nonsingular lower triangular b: known vector b(1) = b(1)/L(1,1) For i=2:n.

MATH 685/ CSI 700/ OR 682 Lecture Notes

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Lecture 9: Introduction to Matrix Inversion Gaussian Elimination Sections 2.4, 2.5, 2.6 Sections 2.2.3, 2.3.

Numerical Algorithms Matrix multiplication

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

Lesson 8 Gauss Jordan Elimination

Numerical Algorithms • Matrix multiplication

Linear Algebraic Equations

1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CS 584 Lecture 20 n Assignment –Glenda program n Project Proposal is coming up! (March 13) »2 pages text + 1 page plan of action »3 references n No class.

Design of parallel algorithms

CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1/26 Design of parallel algorithms Linear equations Jari Porras.

CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 Friday, November 03, 2006 “The greatest obstacle to discovery is not ignorance, but the illusion of knowledge.” -D. Boorstin.

Design of parallel algorithms Matrix operations J. Porras.

Dense Matrix Algorithms CS 524 – High-Performance Computing.

Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.

Pam Perlich Urban Planning 5/6020

Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.

CE 311 K - Introduction to Computer Methods Daene C. McKinney

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.

ECON 1150 Matrix Operations Special Matrices

By: David McQuilling; Jesus Caban Deng Li Numerical Linear Algebra.

Copyright © Cengage Learning. All rights reserved. 7.4 Matrices and Systems of Equations.

Linear Systems Gaussian Elimination CSE 541 Roger Crawfis.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Matrix Solutions to Linear Systems. 1. Write the augmented matrix for each system of linear equations.

CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.

By: David McQuilling and Jesus Caban Numerical Linear Algebra.

Lecture 28: Mathematical Insight and Engineering.

Chapter 3 Solution of Algebraic Equations 1 ChE 401: Computational Techniques for Chemical Engineers Fall 2009/2010 DRAFT SLIDES.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Solution of Sparse Linear Systems

Copyright © 2011 Pearson Education, Inc. Solving Linear Systems Using Matrices Section 6.1 Matrices and Determinants.

Matrices and Systems of Equations

CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

Introduction to Parallel Programming Language notation: message passing 5 parallel algorithms of increasing complexity: –Matrix multiplication –Successive.

Numerical Algorithms.Matrix Multiplication.Gaussian Elimination.Jacobi Iteration.Gauss-Seidel Relaxation.

Lecture 9 Architecture Independent (MPI) Algorithm Design

Basic Communication Operations Carl Tropper Department of Computer Science.

PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.

Linear Algebra Engineering Mathematics-I. Linear Systems in Two Unknowns Engineering Mathematics-I.

Numerical Computation Lecture 6: Linear Systems – part II United International College.

Numerical Algorithms Chapter 11.

Parallel Matrix Multiplication and other Full Matrix Algorithms

Parallel Programming with MPI and OpenMP

Introduction to Parallel Programming

Parallel Matrix Operations

Numerical Algorithms • Parallelizing matrix multiplication

CSCE569 Parallel Computing

Parallel Programming in C with MPI and OpenMP

CSCE569 Parallel Computing

Dense Linear Algebra (Data Distributions)

To accompany the text “Introduction to Parallel Computing”,

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Presentation transcript:

CS 484

Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square

Mapping Matrices How do we partition a matrix for parallel processing? There are two basic ways Striped partitioning Block partitioning

Striped Partitioning P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Block striping Cyclic striping

Block Partitioning P0P1 P2P3 P0P1P2P3 P4P5P6P7 P0P1P2P3 P4P5P6P7 Block checkerboard Cyclic checkerboard

Block vs. Striped Partitioning Scalability? Striping is limited to n processors Checkerboard is limited to n x n processors Complexity? Striping is easy Block could introduce more dependencies

Dense Matrix Algorithms Transposition Matrix - Vector Multiplication Matrix - Matrix Multiplication Solving Systems of Linear Equations Gaussian Elimination

Matrix Transposition The transpose of A is A T such that A T [i,j] = A[j,i] All elements below the diagonal move above the diagonal and vice-versa If we assume unit time to exchange and one element per processor: Transpose takes (n 2 - n)/2

Transpose Consider case where each processor has more than one element. Hypothesis: The transpose of the full matrix can be done by first sending the multiple element messages to their destination and then transposing the contents of the message.

Transpose (Striped Partitioning)

Transpose (Block Partitioning)

Matrix Multiplication

One Dimensional Decomposition Each processor "owns" black portion To compute the owned portion of the answer, each processor requires all of A          P N ttPT ws 2 )1(

Two Dimensional Decomposition Requires less data per processor Algorithm can be performed stepwise. Fox’s algorithm

Broadcast an A submatrix to the other processors in row. Compute Rotate the B submatrix upwards

Algorithm Set B' = B local for j = 0 to sqrt(P) -2 in each row I the [(I+j) mod sqrt(P)]th task broadcasts A' = A local to the other tasks in the row accumulate A' * B' send B' to upward neighbor done                  P N tt P PT ws log 1

Cannon’s Algorithm Broadcasting a submatrix to all who need it is costly. Suggestion: Shift both submatrices          P N ttPT ws 2 12

Blocks Need to Be Aligned A00 B00 A01 B01 A02 B02 A03 B03 A10 B10 A11 B11 A12 B12 A13 B13 A20 B20 A21 B21 A22 B22 A23 B23 A30 B30 A31 B31 A32 B32 A33 B33 Each triangle represents a matrix block Only same-color triangles should be multiplied

Rearrange Blocks A00 B00 A01 B01 A02 B02 A03 B03 A10 B10 A11 B11 A12 B12 A13 B13 A20 B20 A21 B21 A22 B22 A23 B23 A30 B30 A31 B31 A32 B32 A33 B33 Block Aij cycles left i positions Block Bij cycles up j positions

Consider Process P 1,2 B02 A10A11A12 B12 A13 B22 B32 Step 1

Consider Process P 1,2 B12 A11A12A13 B22 A10 B32 B02 Step 2

Consider Process P 1,2 B22 A12A13A10 B32 A11 B02 B12 Step 3

Consider Process P 1,2 B32 A13A10A11 B02 A12 B12 B22 Step 4

Complexity Analysis Algorithm has  p iterations During each iteration process multiplies two (n /  p )  (n /  p ) matrices:  (n 3 / p 3/2 ) Computational complexity:  (n 3 / p) During each iteration process sends and receives two blocks of size (n /  p )  (n /  p ) Communication complexity:  (n 2 /  p)

Divide and Conquer A pp A pq A qp A qq B pp B pq B qp B qq P0 = App * Bpp P1 = Apq * Bpq P2 = App * Bpq P3 = Aqp * Bqq P4 = Aqp * Bpp P5 = Aqq * Bqp P6 = Aqp * Bpq P7 = Aqq * Bqq P0 + P1P2 + P3 P4 + P5P6 + P7 =x

Systems of Linear Equations A linear equation in n variables has the form A set of linear equations is called a system. A solution exists for a system iff the solution satisfies all equations in the system. Many scientific and engineering problems take this form. a 0 x 0 + a 1 x 1 + … + a n-1 x n-1 = b

Solving Systems of Equations Many such systems are large. Thousands of equations and unknowns a 0,0 x 0 + a 0,1 x 1 + … + a 0,n-1 x n-1 = b 0 a 1,0 x 0 + a 1,1 x 1 + … + a 1,n-1 x n-1 = b 1 a n-1,0 x 0 + a n-1,1 x 1 + … + a n-1,n-1 x n-1 = b n-1

Solving Systems of Equations A linear system of equations can be represented in matrix form a 0,0 a 0,1 … a 0,n-1 x 0 b 0 a 1,0 a 1,1 … a 1,n-1 x 1 b 1 a n-1,0 a n-1,1 … a n-1,n-1 x n-1 b n-1 = Ax = b

Solving Systems of Equations Solving a system of linear equations is done in two steps: Reduce the system to upper-triangular Use back-substitution to find solution These steps are performed on the system in matrix form. Gaussian Elimination, etc.

Solving Systems of Equations Reduce the system to upper-triangular form Use back-substitution a 0,0 a 0,1 … a 0,n-1 x 0 b 0 0 a 1,1 … a 1,n-1 x 1 b … a n-1,n-1 x n-1 b n-1 =

Reducing the System Gaussian elimination systematically eliminates variable x[k] from equations k+1 to n-1. Reduces the coefficients to zero This is done by subtracting a appropriate multiple of the k th equation from each of the equations k+1 to n-1

Procedure GaussianElimination(A, b, y) for k = 0 to n-1 /* Division Step */ for j = k + 1 to n - 1 A[k,j] = A[k,j] / A[k,k] y[k] = b[k] / A[k,k] A[k,k] = 1 /* Elimination Step */ for i = k + 1 to n - 1 for j = k + 1 to n - 1 A[i,j] = A[i,j] - A[i,k] * A[k,j] b[i] = b[i] - A[i,k] * y[k] A[i,k] = 0 endfor end

Parallelizing Gaussian Elim. Use domain decomposition Rowwise striping Division step requires no communication Elimination step requires a one-to-all broadcast for each equation. No agglomeration Initially map one to to each processor

Communication Analysis Consider the algorithm step by step Division step requires no communication Elimination step requires one-to-all bcast only bcast to other active processors only bcast active elements Final computation requires no communication.

Communication Analysis One-to-all broadcast log 2 q communications q = n - k - 1 active processors Message size q active processors q elements required T = (t s + t w q)log 2 q

Computation Analysis Division step q divisions Elimination step q multiplications and subtractions Assuming equal time --> 3q operations

Computation Analysis In each step, the active processor set is reduced by one resulting in: 2/)1(      nnCompTime kn n k

Can we do better? Previous version is synchronous and parallelism is reduced at each step. Pipeline the algorithm Run the resulting algorithm on a linear array of processors. Communication is nearest-neighbor Results in O(n) steps of O(n) operations

Pipelined Gaussian Elim. Basic assumption: A processor does not need to wait until all processors have received a value to proceed. Algorithm If processor p has data for other processors, send the data to processor p+1 If processor p can do some computation using the data it has, do it. Otherwise, wait to receive data from processor p-1

Conclusion Using a striped partitioning method, it is natural to pipeline the Gaussian elimination algorithm to achieve best performance. Pipelined algorithms work best on a linear array of processors. Or something that can be linearly mapped Would it be better to block partition? How would it affect the algorithm?

Row Ordering When dealing with a sparse matrix, sometimes operations can cause a zero space in the matrix to become non-zero

Nested Disection Ordering Complete these slides using notes in the black binder.