Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Lecture 19: Parallel Algorithms
§ 3.4 Matrix Solutions to Linear Systems.
CSCI-455/552 Introduction to High Performance Computing Lecture 25.
Linear Algebra Applications in Matlab ME 303. Special Characters and Matlab Functions.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
MATH 685/ CSI 700/ OR 682 Lecture Notes
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 7 Intersection of Hyperplanes and Matrix Inverse Shang-Hua Teng.
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
CISE301_Topic3KFUPM1 SE301: Numerical Methods Topic 3: Solution of Systems of Linear Equations Lectures 12-17: KFUPM Read Chapter 9 of the textbook.
Numerical Algorithms Matrix multiplication
Linear Systems Pivoting in Gaussian Elim. CSE 541 Roger Crawfis.
Lecture 6 Matrix Operations and Gaussian Elimination for Solving Linear Systems Shang-Hua Teng.
Part 3 Chapter 9 Gauss Elimination
P1 RJM 07/08/02EG1C2 Engineering Maths: Matrix Algebra 5 Solving Linear Equations Given the equations 3x + 4y = 10 2x + z = 7 y + 2z = 7 Solve by removing.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Numerical Algorithms • Matrix multiplication
1cs542g-term Notes  Assignment 1 is out (questions?)
CS267 L20 Dense Linear Algebra II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - II James Demmel
CS267 L19 Dense Linear Algebra I.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 19: Dense Linear Algebra - I James Demmel
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
Lecture 21: Parallel Algorithms
02/21/2007CS267 Lecture DLA11 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
02/09/2006CS267 Lecture 81 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
1/26 Design of parallel algorithms Linear equations Jari Porras.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Table of Contents Solving Systems of Linear Equations - Gaussian Elimination The method of solving a linear system of equations by Gaussian Elimination.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Scientific Computing Linear Systems – LU Factorization.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
MATH 250 Linear Equations and Matrices
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Lesson 13-1: Matrices & Systems Objective: Students will: State the dimensions of a matrix Solve systems using matrices.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Yasser F. O. Mohammad Assiut University Egypt. Previously in NM Introduction to NM Solving single equation System of Linear Equations Vectors and Matrices.
ΑΡΙΘΜΗΤΙΚΕΣ ΜΕΘΟΔΟΙ ΜΟΝΤΕΛΟΠΟΙΗΣΗΣ 4. Αριθμητική Επίλυση Συστημάτων Γραμμικών Εξισώσεων Gaussian elimination Gauss - Jordan 1.
 6.2 Pivoting Strategies 1/17 Chapter 6 Direct Methods for Solving Linear Systems -- Pivoting Strategies Example: Solve the linear system using 4-digit.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
H. Simon - CS267 - L8 2/9/20161 CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics - Large Dense Linear Systems 2/19/97.
Linear Systems Dinesh A.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Part 3 Chapter 9 Gauss Elimination PowerPoints organized by Dr. Michael R. Gustafson II, Duke University All images copyright © The McGraw-Hill Companies,
Unit #1 Linear Systems Fall Dr. Jehad Al Dallal.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Numerical Computation Lecture 6: Linear Systems – part II United International College.
1 Numerical Methods Solution of Systems of Linear Equations.
Part 3 Chapter 9 Gauss Elimination
ECE 3301 General Electrical Engineering
Spring Dr. Jehad Al Dallal
Lecture 22: Parallel Algorithms
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
RECORD. RECORD COLLABORATE: Discuss: Is the statement below correct? Try a 2x2 example.
Numerical Analysis Lecture10.
James Demmel CS 267 Applications of Parallel Computers Lecture 19: Dense Linear Algebra - I James Demmel.
Dense Linear Algebra (Data Distributions)
To accompany the text “Introduction to Parallel Computing”,
Presentation transcript:

Dense Linear Algebra Sathish Vadhiyar

Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k) i i i,i X x j k

Gaussian Elimination - Review Version 2 – Remove A(j, i)/A(i, i) from inner loop for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k) i i i,i X x j k

Gaussian Elimination - Review Version 3 – Don’t compute what we already know for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k) i i i,i X x j k

Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i i i,i X x j k

GE - Runtime  Divisions  Multiplications / subtractions  Total … (n-1) = n 2 /2 (approx.) …. (n-1) 2 = n 3/ 3 – n 2 /2 2n 3 /3

Parallel GE  1 st step – 1-D block partitioning along blocks of n columns by p processors i i i,i X x j k

1D block partitioning - Steps 1. Divisions 2. Broadcast 3. Multiplications and Subtractions Runtime: n 2 /2 xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n 2 logp (n-1)n/p + (n-2)n/p + …. 1x1 = n 3 /p (approx.) < n 2 /2 +n 2 logp + n 3 /p

2-D block  To speedup the divisions i i i,i X x j k P Q

2D block partitioning - Steps 1. Broadcast of (k,k) 2. Divisions 3. Broadcast of multipliers logQ n 2 /Q (approx.) xlog(P) + ylog(P-1) + zlog(P-2) + …. = n 2 /Q logP 4. Multiplications and subtractions n 3 /PQ (approx.)

Problem with block partitioning for GE  Once a block is finished, the corresponding processor remains idle for the rest of the execution  Solution? -

Onto cyclic  The block partitioning algorithms waste processor cycles. No load balancing throughout the algorithm.  Onto cyclic cyclic 1-D block-cyclic2-D block-cyclic Load balance Load balance, block operations, but column factorization bottleneck Has everything

Block cyclic  Having blocks in a processor can lead to block-based operations (block matrix multiply etc.)  Block based operations lead to high performance  Operations can be split into 3 categories based on number of operations per memory reference  Referred to BLAS levels

Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations  Memory hierarchy efficiently exploited by higher level BLAS BLASMemo ry Refs. FlopsFlops/ Memor y refs. Level-1 (vector) y=y+ax Z=y.x 3n2n2/3 Level-2 (Matrix-vector) y=y+Ax A = A+(alpha) xy T n2n2 2n 2 2 Level-3 (Matrix-Matrix) C=C+AB 4n 2 2n 3 n/2

Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i i i,i X x j k What GE really computes: for i=1 to n-1 A(i+1:n, i) = A(i+1:n, i) / A(i, i) A(i+1:n, i+1:n) -= A(i+1:n, i)*A(i, i+1:n) A(i,i) A(j,i)A(j,k) A(i,k) Finished multipliers i i A(i+1:n, i) A(i, i+1:n) A(i+1:n, i+1:n) BLAS1 and BLAS2 operations

Converting BLAS2 to BLAS3  Use blocking for optimized matrix- multiplies (BLAS3)  Matrix multiplies by delayed updates  Save several updates to trailing matrices  Apply several updates in the form of matrix multiply

Modified GE using BLAS3 Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end)+1 */ A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ b ibend ib end b Completed part of L Completed part of U A(ib:end, ib:end) A(end+1:n, ib:end) A(end+1:n, end+1:n)

Parallel GE with 2-D block-cyclic & pivoting (Steps) for ib=1 to n-1 step b end = min(ib+b-1, n) for i = ib to end (1) find pivot row k, column broadcast

Parallel GE with 2-D block-cyclic (Steps) (2) swap rows k and i in block column, broadcast row k (3) A(i+1:n, i) /= A(i, i)

Parallel GE in ScaLAPACK (Steps) (4) A(i+1:n, i+1:end) -= A(i+1:n, i)* A(i, i+1:end) end for

Parallel GE in ScaLAPACK (Steps) (5) Communicate swap information to left and right (6) Apply all row swaps to other columns

Parallel GE in ScaLAPACK (Steps) (7) Broadcast LL right (8) A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n)

Parallel GE in ScaLAPACK (Steps) (9) Broadcast A(ib:end, end+1:n) down (10) Broadcast A(end+1:n, ib:end) right (11) Eliminate A(end+1:n, end+1:n)

GE: Miscellaneous GE with Partial Pivoting  1D block-column partitioning: which is better? Column or row pivoting  2D block partitioning: Can restrict the pivot search to limited number of columns Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1) The exchange information is passed to the other processes by piggybacking with the multiplier information Row pivoting Involves distributed search and exchange – O(n/P)+O(logP)

Triangular Solve – Unit Upper triangular matrix  Sequential Complexity -  Complexity of parallel algorithm with 2D block partitioning (P 0.5 *P 0.5 ) O(n 2 ) O(n 2 )/P 0.5  Thus (parallel GE / parallel TS) < (sequential GE / sequential TS)  Overall (GE+TS) – O(n 3 /P)

2-D partitioning with pipelining – Demonstration of advancing front Communication Computation

2-D Partitioning with Pipeline  Communication and computation move as a “front”  The completion time is only O(n)  The “front” or wave propagation is used in other algorithms as well

Parallel solution of linear system with special matrices Tridiagonal Matrices a1 h1 g2 a2 h2 g3 a3 h3 gn an x1 x2 x3. xn = b1 b2 b3. bn In general: g i x i-1 + a i x i + h i x i+1 = b i Substituting for x i-1 and x i+1 in terms of {x i-2, x i } and {x i, x i+2 } respectively: G i x i-2 + A i x i + H i x i+2 = B i

Tridiagonal Matrices A1 H1 A2 H2 G3 A3 H4 G4 A4 H4 Gn-2 An x1 x2 x3. xn = B1 B2 B3. Bn Reordering:

Tridiagonal Matrices A2 H2 G4 A4 H4 Gn An A1 H1 G3 A3 H3 Gn-3 An-1 x2 x4. xn x1 x3. xn-1 = B2 B4. Bn B1 B3. Bn-1

Tridiagonal Systems  Thus the problem of size n has been split into even and odd equations of size n/2  This is odd–even reduction  For parallelization, each process can divide the problem into subproblems of smaller size and solve the subproblems  This is divide-and-conquer technique

Tridiagonal Systems - Parallelization  At each stage one representative process of the domain of processes is chosen  This representative performs the odd-even reduction of problem i to two problems of size i/2  The problems are distributed to 2 representatives n n/2 n/4 n/8