Parallel Matrix Multiply

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Numerical Algorithms Matrix multiplication
CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.
CSE5304—Project Proposal Parallel Matrix Multiplication Tian Mi.
9/12/2007CS194 Lecture1 Shared Memory Hardware: Case Study in Matrix Multiplication Kathy Yelick
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
CS 240A : Matrix multiplication Matrix multiplication I : parallel issues Matrix multiplication II: cache issues Thanks to Jim Demmel and Kathy Yelick.
Numerical Algorithms • Matrix multiplication
CS267 L20 Dense Linear Algebra II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - II James Demmel
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
02/21/2007CS267 Lecture DLA11 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
CS267 Dense Linear Algebra I.1 Demmel Fa 2002 CS 267 Applications of Parallel Computers Dense Linear Algebra James Demmel
02/09/2006CS267 Lecture 81 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
CS267 Dense Linear Algebra I.1 Demmel Fa 2001 CS 267 Applications of Parallel Computers Dense Linear Algebra James Demmel
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
CS 240A: Complexity Measures for Parallel Computation.
1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
1 Minimizing Communication in Numerical Linear Algebra Case Study: Matrix Multiply Jim Demmel EECS & Math Departments, UC Berkeley.
CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix multiplication I: parallel issues Matrix multiplication II: cache.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Matrix Multiply Methods. Some general facts about matmul High computation to communication hides multitude of sins Many “symmetries” meaning that the.
CS240A: Conjugate Gradients and the Model Problem.
3/02/2010CS267 Lecture 131 CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel
Inverse of a Matrix Multiplicative Inverse of a Matrix For a square matrix A, the inverse is written A -1. When A is multiplied by A -1 the result is the.
Programming for Performance CS433 Spring 2001 Laxmikant Kale.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Programming for Performance Laxmikant Kale CS 433.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
Numerical Algorithms Chapter 11.
Optimizing Cache Performance in Matrix Multiplication
Optimizing Cache Performance in Matrix Multiplication
CS 267 Dense Linear Algebra: Parallel Matrix Multiplication
Parallel Programming in C with MPI and OpenMP
James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel
BLAS: behind the scenes
Parallel Matrix Multiplication and other Full Matrix Algorithms
Kathy Yelick CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply Kathy Yelick
Complexity Measures for Parallel Computation
James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel
Unit-2 Divide and Conquer
CS 140 : Matrix multiplication
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
Optimizing MMM & ATLAS Library Generator
James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel
Parallel Matrix Multiplication and other Full Matrix Algorithms
James Demmel CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel
Parallel Programming in C with MPI and OpenMP
MATLAB HPCS Extensions
Complexity Measures for Parallel Computation
Matrix Addition and Multiplication
To accompany the text “Introduction to Parallel Computing”,
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Parallel Programming in C with MPI and OpenMP
CS 140 : Matrix multiplication
Presentation transcript:

Parallel Matrix Multiply Computing C = C + A*B Using basic algorithm: 2*n3 flops Variables are: Data layout Structure of communication Schedule of communication Simple model for analyzing algorithm performance: tcomm = communication time = tstartup + (#words * tdata) = latency + ( #words * 1/(words/second) ) = a + w*b

Latency Bandwidth Model Network of fixed number p of processors fully connected each with local memory Latency (a) accounts for varying performance with number of messages Inverse bandwidth (b) accounts for performance varying with volume of data Parallel efficiency: e(p) = serial time / (p * parallel time) perfect speedup  e(p) = 1

MatMul with 2D Layout Consider processors in 2D grid (physical or logical) Processors can send along rows and columns Assume p is square s x s grid p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) = * p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2)

SUMMA Algorithm SUMMA = Scalable Universal Matrix Multiply Slightly less efficient than Cannon, but simpler and easier to generalize Presentation from van de Geijn and Watts www.netlib.org/lapack/lawns/lawn96.ps Similar ideas appeared many times Used in practice in PBLAS = Parallel BLAS www.netlib.org/lapack/lawns/lawn100.ps

SUMMA * k is a single row or column or a block of b rows or columns J B(k,J) k * = I C(I,J) A(I,k) I, J represent all rows, columns owned by a processor k is a single row or column or a block of b rows or columns C(I,J) = C(I,J) + Sk A(I,k)*B(k,J) Assume a pr by pc processor grid (pr = pc = 4 above) Need not be square

SUMMA * = For k=0 to n-1 … or n/b-1 where b is the block size J B(k,J) k * = I C(I,J) A(I,k) For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(I,k) and # rows in B(k,J) for all I = 1 to pr … in parallel owner of A(I,k) broadcasts it to whole processor row for all J = 1 to pc … in parallel owner of B(k,J) broadcasts it to whole processor column Receive A(I,k) into Acol Receive B(k,J) into Brow C( myproc , myproc ) = C( myproc , myproc) + Acol * Brow

SUMMA performance To simplify analysis only, assume s = sqrt(p) For k=0 to n/b-1 for all I = 1 to s … s = sqrt(p) owner of A(I,k) broadcasts it to whole processor row … time = log s *( a + b * b*n/s), using a tree for all J = 1 to s owner of B(k,J) broadcasts it to whole processor column Receive A(I,k) into Acol Receive B(k,J) into Brow C( myproc , myproc ) = C( myproc , myproc) + Acol * Brow … time = 2*(n/s)2*b Total time = 2*n3/p + a * log p * n/b + b * log p * n2 /s

SUMMA performance Total time = 2*n3/p + a * log p * n/b + b * log p * n2 /s Parallel Efficiency = 1/(1 + a * log p * p / (2*b*n2) + b * log p * s/(2*n) ) ~Same b term as Cannon, except for log p factor log p grows slowly so this is ok Latency (a) term can be larger, depending on b When b=1, get a * log p * n As b grows to n/s, term shrinks to a * log p * s (log p times Cannon) Temporary storage grows like 2*b*n/s Can change b to tradeoff latency cost with memory