Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Lecture 19: Parallel Algorithms
CSCI-455/552 Introduction to High Performance Computing Lecture 25.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 7 Intersection of Hyperplanes and Matrix Inverse Shang-Hua Teng.
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Numerical Algorithms Matrix multiplication
Lecture 6 Matrix Operations and Gaussian Elimination for Solving Linear Systems Shang-Hua Teng.
Part 3 Chapter 9 Gauss Elimination
P1 RJM 07/08/02EG1C2 Engineering Maths: Matrix Algebra 5 Solving Linear Equations Given the equations 3x + 4y = 10 2x + z = 7 y + 2z = 7 Solve by removing.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Numerical Algorithms • Matrix multiplication
1cs542g-term Notes  Assignment 1 is out (questions?)
CS267 L20 Dense Linear Algebra II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - II James Demmel
CS267 L19 Dense Linear Algebra I.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 19: Dense Linear Algebra - I James Demmel
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
02/21/2007CS267 Lecture DLA11 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
02/09/2006CS267 Lecture 81 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
1/26 Design of parallel algorithms Linear equations Jari Porras.
CS267 Dense Linear Algebra I.1 Demmel Fa 2001 CS 267 Applications of Parallel Computers Dense Linear Algebra James Demmel
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Scientific Computing Linear Systems – LU Factorization.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Lesson 13-1: Matrices & Systems Objective: Students will: State the dimensions of a matrix Solve systems using matrices.
Yasser F. O. Mohammad Assiut University Egypt. Previously in NM Introduction to NM Solving single equation System of Linear Equations Vectors and Matrices.
 6.2 Pivoting Strategies 1/17 Chapter 6 Direct Methods for Solving Linear Systems -- Pivoting Strategies Example: Solve the linear system using 4-digit.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Relations, Functions, and Matrices Mathematical Structures for Computer Science Chapter 4 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Relations, Functions.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
H. Simon - CS267 - L8 2/9/20161 CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics - Large Dense Linear Systems 2/19/97.
Linear Systems Dinesh A.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Section 2.1 Determinants by Cofactor Expansion. THE DETERMINANT Recall from algebra, that the function f (x) = x 2 is a function from the real numbers.
Unit #1 Linear Systems Fall Dr. Jehad Al Dallal.
Numerical Computation Lecture 6: Linear Systems – part II United International College.
1 Numerical Methods Solution of Systems of Linear Equations.
Part 3 Chapter 9 Gauss Elimination
Spring Dr. Jehad Al Dallal
Advanced Algorithms Analysis and Design
Chapter 8: Lesson 8.1 Matrices & Systems of Equations
CS 267 Dense Linear Algebra: Parallel Matrix Multiplication
Lecture 22: Parallel Algorithms
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
RECORD. RECORD COLLABORATE: Discuss: Is the statement below correct? Try a 2x2 example.
James Demmel CS 267 Applications of Parallel Computers Lecture 19: Dense Linear Algebra - I James Demmel.
Dense Linear Algebra (Data Distributions)
To accompany the text “Introduction to Parallel Computing”,
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
1.8 Matrices.
1.8 Matrices.
3.5 Perform Basic Matrix Operations Algebra II.
Presentation transcript:

Dense Linear Algebra Sathish Vadhiyar

Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k) i i i,i X x j k

Gaussian Elimination - Review Version 2 – Remove A(j, i)/A(i, i) from inner loop for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k) i i i,i X x j k

Gaussian Elimination - Review Version 3 – Don’t compute what we already know for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k) i i i,i X x j k

Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i i i,i X x j k

GE - Runtime  Divisions  Multiplications / subtractions  Total … (n-1) = n 2 /2 (approx.) …. (n-1) 2 = n 3/ 3 – n 2 /2 2n 3 /3

Parallel GE  1 st step – 1-D block partitioning along blocks of n columns by p processors i i i,i X x j k

1D block partitioning - Steps 1. Divisions 2. Broadcast 3. Multiplications and Subtractions Runtime: n 2 /2 xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n 2 logp (n-1)n/p + (n-2)n/p + …. 1x1 = n 3 /p (approx.) < n 2 /2 +n 2 logp + n 3 /p

2-D block  To speedup the divisions i i i,i X x j k P Q

2D block partitioning - Steps 1. Broadcast of (k,k) 2. Divisions 3. Broadcast of multipliers logQ n 2 /Q (approx.) xlog(P) + ylog(P-1) + zlog(P-2) + …. = n 2 /Q logP 4. Multiplications and subtractions n 3 /PQ (approx.)

Problem with block partitioning for GE  Once a block is finished, the corresponding processor remains idle for the rest of the execution  Solution? -

Onto cyclic  The block partitioning algorithms waste processor cycles. No load balancing throughout the algorithm.  Onto cyclic cyclic 1-D block-cyclic2-D block-cyclic Load balance Load balance, block operations, but column factorization bottleneck Has everything

Block cyclic  Having blocks in a processor can lead to block-based operations (block matrix multiply etc.)  Block based operations lead to high performance  Operations can be split into 3 categories based on number of operations per memory reference  Referred to BLAS levels

Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations  Memory hierarchy efficiently exploited by higher level BLAS BLASMemo ry Refs. FlopsFlops/ Memor y refs. Level-1 (vector) y=y+ax Z=y.x 3n2n2/3 Level-2 (Matrix-vector) y=y+Ax A = A+(alpha) xy T n2n2 2n 2 2 Level-3 (Matrix-Matrix) C=C+AB 4n 2 2n 3 n/2

Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i i i,i X x j k What GE really computes: for i=1 to n-1 A(i+1:n, i) = A(i+1:n, i) / A(i, i) A(i+1:n, i+1:n) -= A(i+1:n, i)*A(i, i+1:n) A(i,i) A(j,i)A(j,k) A(i,k) Finished multipliers i i A(i+1:n, i) A(i, i+1:n) A(i+1:n, i+1:n) BLAS1 and BLAS2 operations

Converting BLAS2 to BLAS3  Use blocking for optimized matrix- multiplies (BLAS3)  Matrix multiplies by delayed updates  Save several updates to trailing matrices  Apply several updates in the form of matrix multiply

Modified GE using BLAS3 Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end)+1 */ A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ b ibend ib end b Completed part of L Completed part of U A(ib:end, ib:end) A(end+1:n, ib:end) A(end+1:n, end+1:n)

GE: Miscellaneous GE with Partial Pivoting  1D block-column partitioning: which is better? Column or row pivoting  2D block partitioning: Can restrict the pivot search to limited number of columns Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1) The exchange information is passed to the other processes by piggybacking with the multiplier information Row pivoting Involves distributed search and exchange – O(n/P)+O(logP)

Triangular Solve – Unit Upper triangular matrix  Sequential Complexity -  Complexity of parallel algorithm with 2D block partitioning (P 0.5 *P 0.5 ) O(n 2 ) O(n 2 )/P 0.5  Thus (parallel GE / parallel TS) < (sequential GE / sequential TS)  Overall (GE+TS) – O(n 3 /P)