Parallel Matrix Multiply

Slides:

Advertisements

Similar presentations

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Advertisements

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

Numerical Algorithms Matrix multiplication

CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.

CSE5304—Project Proposal Parallel Matrix Multiplication Tian Mi.

9/12/2007CS194 Lecture1 Shared Memory Hardware: Case Study in Matrix Multiplication Kathy Yelick

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

CS 240A : Matrix multiplication Matrix multiplication I : parallel issues Matrix multiplication II: cache issues Thanks to Jim Demmel and Kathy Yelick.

Numerical Algorithms • Matrix multiplication

CS267 L20 Dense Linear Algebra II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - II James Demmel

1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.

02/21/2007CS267 Lecture DLA11 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel

CS267 Dense Linear Algebra I.1 Demmel Fa 2002 CS 267 Applications of Parallel Computers Dense Linear Algebra James Demmel

02/09/2006CS267 Lecture 81 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel

CS267 Dense Linear Algebra I.1 Demmel Fa 2001 CS 267 Applications of Parallel Computers Dense Linear Algebra James Demmel

Chapter 5, CLR Textbook Algorithms on Grids of Processors.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Design of parallel algorithms Matrix operations J. Porras.

Dense Matrix Algorithms CS 524 – High-Performance Computing.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

CS 240A: Complexity Measures for Parallel Computation.

1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

1 Minimizing Communication in Numerical Linear Algebra Case Study: Matrix Multiply Jim Demmel EECS & Math Departments, UC Berkeley.

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix multiplication I: parallel issues Matrix multiplication II: cache.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Matrix Multiply Methods. Some general facts about matmul High computation to communication hides multitude of sins Many “symmetries” meaning that the.

CS240A: Conjugate Gradients and the Model Problem.

3/02/2010CS267 Lecture 131 CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel

Inverse of a Matrix Multiplicative Inverse of a Matrix For a square matrix A, the inverse is written A -1. When A is multiplied by A -1 the result is the.

Programming for Performance CS433 Spring 2001 Laxmikant Kale.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.

PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.

Programming for Performance Laxmikant Kale CS 433.

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

Numerical Algorithms Chapter 11.

Optimizing Cache Performance in Matrix Multiplication

Optimizing Cache Performance in Matrix Multiplication

CS 267 Dense Linear Algebra: Parallel Matrix Multiplication

Parallel Programming in C with MPI and OpenMP

James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel

BLAS: behind the scenes

Parallel Matrix Multiplication and other Full Matrix Algorithms

Kathy Yelick CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply Kathy Yelick

Complexity Measures for Parallel Computation

James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel

Unit-2 Divide and Conquer

CS 140 : Matrix multiplication

Parallel Matrix Operations

Numerical Algorithms • Parallelizing matrix multiplication

Optimizing MMM & ATLAS Library Generator

James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel

Parallel Matrix Multiplication and other Full Matrix Algorithms

James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel

Parallel Programming in C with MPI and OpenMP

MATLAB HPCS Extensions

Complexity Measures for Parallel Computation

Matrix Addition and Multiplication

To accompany the text “Introduction to Parallel Computing”,

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Parallel Programming in C with MPI and OpenMP

CS 140 : Matrix multiplication

Presentation transcript:

Parallel Matrix Multiply Computing C = C + A*B Using basic algorithm: 2*n3 flops Variables are: Data layout Structure of communication Schedule of communication Simple model for analyzing algorithm performance: tcomm = communication time = tstartup + (#words * tdata) = latency + ( #words * 1/(words/second) ) = a + w*b

Latency Bandwidth Model Network of fixed number p of processors fully connected each with local memory Latency (a) accounts for varying performance with number of messages Inverse bandwidth (b) accounts for performance varying with volume of data Parallel efficiency: e(p) = serial time / (p * parallel time) perfect speedup  e(p) = 1

MatMul with 2D Layout Consider processors in 2D grid (physical or logical) Processors can send along rows and columns Assume p is square s x s grid p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) = * p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2)

SUMMA Algorithm SUMMA = Scalable Universal Matrix Multiply Slightly less efficient than Cannon, but simpler and easier to generalize Presentation from van de Geijn and Watts www.netlib.org/lapack/lawns/lawn96.ps Similar ideas appeared many times Used in practice in PBLAS = Parallel BLAS www.netlib.org/lapack/lawns/lawn100.ps

SUMMA * k is a single row or column or a block of b rows or columns J B(k,J) k * = I C(I,J) A(I,k) I, J represent all rows, columns owned by a processor k is a single row or column or a block of b rows or columns C(I,J) = C(I,J) + Sk A(I,k)*B(k,J) Assume a pr by pc processor grid (pr = pc = 4 above) Need not be square

SUMMA * = For k=0 to n-1 … or n/b-1 where b is the block size J B(k,J) k * = I C(I,J) A(I,k) For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(I,k) and # rows in B(k,J) for all I = 1 to pr … in parallel owner of A(I,k) broadcasts it to whole processor row for all J = 1 to pc … in parallel owner of B(k,J) broadcasts it to whole processor column Receive A(I,k) into Acol Receive B(k,J) into Brow C( myproc , myproc ) = C( myproc , myproc) + Acol * Brow

SUMMA performance To simplify analysis only, assume s = sqrt(p) For k=0 to n/b-1 for all I = 1 to s … s = sqrt(p) owner of A(I,k) broadcasts it to whole processor row … time = log s *( a + b * b*n/s), using a tree for all J = 1 to s owner of B(k,J) broadcasts it to whole processor column Receive A(I,k) into Acol Receive B(k,J) into Brow C( myproc , myproc ) = C( myproc , myproc) + Acol * Brow … time = 2*(n/s)2*b Total time = 2*n3/p + a * log p * n/b + b * log p * n2 /s

SUMMA performance Total time = 2*n3/p + a * log p * n/b + b * log p * n2 /s Parallel Efficiency = 1/(1 + a * log p * p / (2*b*n2) + b * log p * s/(2*n) ) ~Same b term as Cannon, except for log p factor log p grows slowly so this is ok Latency (a) term can be larger, depending on b When b=1, get a * log p * n As b grows to n/s, term shrinks to a * log p * s (log p times Cannon) Temporary storage grows like 2*b*n/s Can change b to tradeoff latency cost with memory