Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267 CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Lecture 19: Parallel Algorithms
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Algebraic, transcendental (i.e., involving trigonometric and exponential functions), ordinary differential equations, or partial differential equations...
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Scientific Computing Linear Systems – Gaussian Elimination.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Numerical Algorithms Matrix multiplication
CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.
CSE5304—Project Proposal Parallel Matrix Multiplication Tian Mi.
9/12/2007CS194 Lecture1 Shared Memory Hardware: Case Study in Matrix Multiplication Kathy Yelick
CS 240A : Matrix multiplication Matrix multiplication I : parallel issues Matrix multiplication II: cache issues Thanks to Jim Demmel and Kathy Yelick.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Numerical Algorithms • Matrix multiplication
1cs542g-term Notes  Assignment 1 is out (questions?)
CS267 L20 Dense Linear Algebra II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - II James Demmel
CS267 L19 Dense Linear Algebra I.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 19: Dense Linear Algebra - I James Demmel
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
02/21/2007CS267 Lecture DLA11 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
02/09/2006CS267 Lecture 81 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CS267 Dense Linear Algebra I.1 Demmel Fa 2001 CS 267 Applications of Parallel Computers Dense Linear Algebra James Demmel
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Scientific Computing Linear Systems – LU Factorization.
ECON 1150 Matrix Operations Special Matrices
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix multiplication I: parallel issues Matrix multiplication II: cache.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Linear algebra: matrix Eigen-value Problems Eng. Hassan S. Migdadi Part 1.
CS240A: Conjugate Gradients and the Model Problem.
Section 7-3 Solving 3 x 3 systems of equations. Solving 3 x 3 Systems  substitution (triangular form)  Gaussian elimination  using an augmented matrix.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
H. Simon - CS267 - L8 2/9/20161 CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics - Large Dense Linear Systems 2/19/97.
Linear Systems Dinesh A.
Improvement to Hessenberg Reduction
Numerical Computation Lecture 6: Linear Systems – part II United International College.
Numerical Algorithms Chapter 11.
Generalized and Hybrid Fast-ICA Implementation using GPU
Numerical Linear Algebra
Optimizing Cache Performance in Matrix Multiplication
Optimizing Cache Performance in Matrix Multiplication
CS 267 Dense Linear Algebra: Parallel Matrix Multiplication
BLAS: behind the scenes
Parallel Matrix Multiplication and other Full Matrix Algorithms
Lecture 22: Parallel Algorithms
CS 140 : Matrix multiplication
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
Elementary Matrix Methid For find Inverse
CSCE569 Parallel Computing
Parallel Matrix Multiplication and other Full Matrix Algorithms
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
CSE 541 – Numerical Methods
RECORD. RECORD COLLABORATE: Discuss: Is the statement below correct? Try a 2x2 example.
James Demmel CS 267 Applications of Parallel Computers Lecture 19: Dense Linear Algebra - I James Demmel.
Dense Linear Algebra (Data Distributions)
To accompany the text “Introduction to Parallel Computing”,
CS 140 : Matrix multiplication
Parallel Matrix Multiply
Presentation transcript:

Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267 CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267

- Parallel Matrix multiply Outline - Recap - Sources of large dense linear systems - BLAS for linear algebra - Parallel Matrix multiply

Model overview Work-depth PRAM Latency/Bandwidth model a is the 1-time cost per message (latency) b is the per byte cost of communication Use this today LogP model correction: gap should be greater than overhead more on this with parallel sorting Topology-specific models

Dense Linear Algebra in Electromagentics

Computational Electromagnetics developed during 1980s, driven by defense applications determine the RCS (radar cross section) of airplane reduce signature of plane (stealth technology) other applications are antenna design, medical equipment two fundamental numerical approaches: MOM methods of moments ( frequency domain), and finite differences (time domain)

Computational Electromagnetics - discretize surface into triangular facets using standard modeling tools - amplitude of currents on surface are unknowns - integral equation is discretized into a set of linear equations image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/

Computational Electromagnetics (MOM) After discretization the integral equation has the form Z J = V where Z is the impedance matrix, J is the unknown vector of amplitudes, and V is the excitation vector. (see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542)

Computational Electromagnetics (MOM) The main steps in the solution process are A) computing the matrix elements B) factoring the dense matrix C) solving for one or more excitations (RHS) D) computing the fields scattered from the object

Analysis of MOM for Parallel Implementation Task Work Parallelism Parallel Speed Fill O(n**2) embarrassing low Factor O(n**3) moderately diff. very high Solve O(n**2) moderately diff. high Field Calc. O(n) embarrassing high For most scientific applications the biggest gain in performance is from parallelism within each task.

Results for Parallel Implementation on Delta Task Time (hours) Fill 9.20 Factor 8.25 Solve 2.17 Field Calc. 0.12 The problem solved was for a matrix of size 48,672. (The world record in 1991.)

Current Records for Solving Dense Systems Year System Size Machine 1950's O(100) 1991 55,296 CM-2 1992 75,264 Intel 1993 75,264 Intel 1994 76,800 CM-5 1995 128,600 Paragon XP 1996 215,000 ASCI Red (Tflop) source: Alan Edelman http://www-math.mit.edu/~edelman/records.html

Sources for large dense linear systems - Not many basic factorizations outside CEM - Large dense eigen problems used in chemistry - Alternatives often debated Choice for algorithms in existing codes are not the result of careful planning and design. - Reflect the start-of-the-art at the time, - May be purely coincidental.

Solving Large Dense Linear Systems Gaussian elimination to solve Ax=b where A is a dense matrix Add multiples of each row to subsequent rows in order to create zeros below the diagonal End up with an upper triangular matrix U. Solve a linear system with U by substitution, starting with the last variable. Solving these systems uses basic vector and matrix operations called BLAS. see Demmel http://HTTP.CS.Berkeley.EDU/~demmel/cs267/lecture12/lecture12.html

Parallel Matrix Multiply

Parallel Matrix Multiply Computing C=C+A*B Using basic algorithm: 2*n3 Flops Variables are: Data layout Topology of machine Scheduling communication Use of performance models for algorithm design

1D Layout Assume matrices are nxn and n is divisible by p A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i)) B(i,j) is the n/p by n/p sublock of B(i) in rows j*n/p through (j+1)*n/p p0 p1 p2 p3 p5 p4 p6 p7

Matrix Multiply: 1D Layout on Bus Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + S A(j)*B(j,i) First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time Second consider a bus-connected machine with broadcast: may send from one to many in single step j <p j =0

MatMul on 1D Bus without Broadcast Naïve algorithm: C(myproc) = C(myproc) + A(myproc)*b(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j // message passing if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrier Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: a*p2 + b*p*n2 // approximately

Naïve MatMul (continued) Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: a*p2 + b*p*n2 // approximately Only 1 pair of processors (i and j) are active on any iteration, an of those, only i is doing computation => the algorithm is almost entirely serial Running time: (p*(p-1) + 1)*computation + p*(p-1)*communication ~= 2*n3 + p2*a + p*n2*b this is worse than the serial time and grows with p

Better MatMul on a Bus Remove the barrier and send A(i) multiple times C(myproc) = C(myproc) + A(myproc)*b(myproc,myproc) for i = 0 to myproc-1 receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) for i = 0 to p-1 except myproc send A(myproc) to processor i for i = myproc+1 to p-1 Program is indeterminate: sends/receives may happen in different orders

Performance of “Better” Algorithm Intuitively, if a computation step is sufficient large compared to communication, efficiency will be good communication: a + n*(n/p)*b computation: 2*n3/p2 Assume computation > communication i-j represents communication from i to j iC represents computation step by processor i

Performance of “Better” Algorithm Timeline of execution If computation<= (p-2)*communication No bubbles in the pipeline Time = p*(p-1)*communication + 2*computation Time <= (p2 + p-4)*communication If communication = computation/(p-2), close to lower bound of 2*n3/p If communication is faster, only small impact on performance 0C 0-1 0-2 0-3 0- p-1 1C 1C 1- 0 2C 2C 2-0 p-1C

MatMul: 1D Layout and Broadcast Modify the previous algorithm so that each of sends of A(i) is a broadcast (assumed 1 comm. step) Time is now 2*n3/p + p*a +n2*b Again, we require n>>p for good efficiency p times less communication time broadcast helps performance (as expected)

MatMul with 2D Layout Consider processors in 2D grid (physical or logical) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2)

Cannon’s Algorithm C(i,j) = C(i,j) + S A(i,k)*B(k,j) k<s k=0 C(i,j) = C(i,j) + S A(i,k)*B(k,j) Algorithm (s = sqrt(p)): for i=0 to s-1 // skew A left-circular-shift row i of A by i so that A(i,j) overwritten by A(i,(j+i)mod s) for i=0 to s-1 // skew B up-circular shift B so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 for i=0 to s-1 and j=0 to s-1 C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each row of B by 1

Communication in Cannon

BLAS Review

Fast linear algebra kernels: BLAS Simple linear algebra kernels such as matrix-matrix multiply More complicated algorithms can be built from these basic kernels. The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines (BLAS). Early agreement on standard interface (~1980) Led to portable libraries for vector and shared memory parallel machines. On distributed memory, there is a less-standard interface called the PBLAS

Level 1 BLAS Operate on vectors or pairs of vectors perform O(n) operations; return either a vector or a scalar. saxpy y(i) = a * x(i) + y(i), for i=1 to n. s stands for single precision, daxpy is for double precision, caxpy for complex, and zaxpy for double complex, sscal y = a * x, for scalar a and vectors x,y sdot computes s = S ni=1 x(i)*y(i)

Level 2 BLAS Operate on a matrix and a vector; return a matrix or a vector; O(n^2) operations sgemv: matrix-vector multiply y = y + A*x where A is m-by-n, x is n-by-1 and y is m-by-1. sger: rank-one update A = A + y*x', i.e., A(i,j) = A(i,j)+y(i)*x(j) where A is m-by-n, y is m-by-1, x is n-by-1, x' is x transpose strsv: triangular solve solves y=T*x for x, where T is triangular

Level 3 BLAS Operate on pairs or triples of matrices returning a matrix; complexity is O(n**3). sgemm: Matrix-matrix multiplication C = C +A*B, where C is m-by-n, A is m-by-k, and B is k-by-n sgtrsm: multiple triangular solve solves Y = T*X for X, where T is a triangular matrix, and X is a rectangular matrix.

Performance of BLAS Level 3 Level 2 Level 1

Performance of BLAS BLAS are specially optimized by the vendor Sun BLAS uses features in the Ultrasparc Big payoff for algorithms that can be expressed in terms of the BLAS3 instead of BLAS2 or BLAS1. The top speed of the BLAS3 Algorithms like Gaussian elimination organized so that they use BLAS3