Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

PERMUTATION CIRCUITS Presented by Wooyoung Kim, 1/28/2009 CSc 8530 Parallel Algorithms, Spring 2009 Dr. Sushil K. Prasad.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Numerical Algorithms • Matrix multiplication
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
Communication operations Efficient Parallel Algorithms COMP308.
DS - IV - TT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 4 Topological Testing Wintersemester 2000/2001 Leitung:
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CS 684.
CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
MATRICES. Matrices A matrix is a rectangular array of objects (usually numbers) arranged in m horizontal rows and n vertical columns. A matrix with m.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
© by Kenneth H. Rosen, Discrete Mathematics & its Applications, Sixth Edition, Mc Graw-Hill, 2007 Chapter 9 (Part 2): Graphs  Graph Terminology (9.2)
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Matrices Matrices A matrix (say MAY-trix) is a rectan- gular array of objects (usually numbers). An m  n (“m by n”) matrix has exactly m horizontal.
2009/9 1 Matrices(§3.8)  A matrix is a rectangular array of objects (usually numbers).  An m  n (“m by n”) matrix has exactly m horizontal rows, and.
Module #9: Matrices Rosen 5 th ed., §2.7 Now we are moving on to matrices, section 7.
MATRIX MULTIPLICATION 4 th week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MATRIX MULTIPLICATION 4 th week References Sequential matrix.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Copyright 2004 Koren & Krishna ECE655/Koren Part.8.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE.
Matrices Section 2.6. Section Summary Definition of a Matrix Matrix Arithmetic Transposes and Powers of Arithmetic Zero-One matrices.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic
HYPERCUBE ALGORITHMS-1
Lecture 9 Architecture Independent (MPI) Algorithm Design
Basic Communication Operations Carl Tropper Department of Computer Science.
1 Connected Components & All Pairs Shortest Paths Presented by Wooyoung Kim 3/4/09 CSc 8530 Parallel Algorithms, Spring 2009 Dr. Sushil K. Prasad.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Interconnection Networks Communications Among Processors.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
Numerical Algorithms Chapter 11.
7.1 Matrices, Vectors: Addition and Scalar Multiplication
Lecture 3: Parallel Algorithm Design
Chapter 9 (Part 2): Graphs
Multi-Node Broadcasting in Hypercube and Star Graphs
PRAM Algorithms.
Parallel Matrix Operations
Communication operations
Discrete Mathematics and its Applications
Parallel Sorting Algorithms
Presentation transcript:

Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath

OutlineOutline Introduction Introduction Hypercube Interconnection Network Hypercube Interconnection Network The Parallel Algorithm The Parallel Algorithm Matrix Transposition Matrix Transposition Communication Efficient Matrix Multiplication on Hypercubes (The paper) Communication Efficient Matrix Multiplication on Hypercubes (The paper)

IntroductionIntroduction Matrix multiplication is important algorithm design in parallel computation. Matrix multiplication is important algorithm design in parallel computation. Matrix multiplication on hypercube Matrix multiplication on hypercube – Diameter is smaller – Degree = log(p) Straightforward RAM algorithm for MatMul requires O(n 3 ) time. Straightforward RAM algorithm for MatMul requires O(n 3 ) time. – Sequential Algorithm: for (i=0; i<n; i++){ for (i=0; i<n; i++){ for (j=0; j<n; j++) { for (j=0; j<n; j++) { t=0; t=0; for(k = 0; k<n; k++){ for(k = 0; k<n; k++){ t=t +a ik *b kj ; t=t +a ik *b kj ; } c ij =t; c ij =t; } }

Hypercube Interconnection Network

Hypercube Interconnection Network (contd.) The formal specification of a Hypercube Interconnection Network. The formal specification of a Hypercube Interconnection Network. – Let processors be available – Let i and i (b) be two integers whose binary representation differ only in position b, – Specifically, If is binary representation of i If is binary representation of i Then is the binary representation of i (b) where is the complement of bit Then is the binary representation of i (b) where is the complement of bit A g-Dimentional Hypercube interconnection network is formed by connection each processor p i to by two way link for all A g-Dimentional Hypercube interconnection network is formed by connection each processor p i to by two way link for all

The Parallel Algorithm Example (parallel algorithm) Example (parallel algorithm) A = B = Step1: 2*2 n=2 1 #processors N N=n 3 =2 3 =8 X,X,X ijk Initial step Step 1.1 Step 1.2 A(0,j,k) & B(0,j,k) -> processors (i,j,k), where 1 processors (i,j,k), where 1<=i<=n-1. A(i,j,i) -> processors (i,j,k) where 0<=k<=n-1 B(i,j,k) -> processors (i,j,k) where 0<=j<=n-1

The Parallel Algorithm (contd.) Step2: Step3:

The Parallel Algorithm (contd.) Implementation of straightforward RAM algorithm on HC. Implementation of straightforward RAM algorithm on HC. The multiplication of two n x n matrices A, B where n=2 q The multiplication of two n x n matrices A, B where n=2 q Use HC with N = n 3 = 2 3q Use HC with N = n 3 = 2 3q Each processor P r occupying position (i,j,k) Each processor P r occupying position (i,j,k) where r = in 2 + jn + k for 0<= i,j,k <= n-1 If the binary representation of r is : If the binary representation of r is : r 3q-1 r 3q-2 …r 2q r 2q-1 …r q r q-1 …r 0 r 3q-1 r 3q-2 …r 2q r 2q-1 …r q r q-1 …r 0 then the binary representation of i, j, k are then the binary representation of i, j, k are r 3q-1 r 3q-2 …r 2q, r 2q-1 …r q, r q-1 …r 0 respectively r 3q-1 r 3q-2 …r 2q, r 2q-1 …r q, r q-1 …r 0 respectively

The Parallel Algorithm (contd.) Example (positioning) Example (positioning) – The multiplication of two 2 x 2 matrices A, B where n=2=2 1 q=1 – Use HC with N = n 3 = 2 3q =8 processors – Each processor P r occupying position (i,j,k) – where r = i2 2 + j2 + k for 0<= i,j,k <= 1 – If the binary representation of r is : – r 2 r 1 r 0 – then the binary representation of i, j, k are – r 2, r 1, r 0 respectively

The Parallel Algorithm (contd.) All processors with same index value in the one of i,j,k form a HC with n 2 processors All processors with same index value in the one of i,j,k form a HC with n 2 processors All processors with the same index value in two field coordinates form a HC with n processors All processors with the same index value in two field coordinates form a HC with n processors Each processor will have 3 registers A r, B r and C r also denoted A(I,j,k) B(I,j,k) and C(i,j,k) Each processor will have 3 registers A r, B r and C r also denoted A(I,j,k) B(I,j,k) and C(i,j,k) ArBr Cr 101

The Parallel Algorithm (contd.) Step 1:The elements of A and B are distributed to the n 3 processors so that the processor in position i,j,k will contain a ji and b ik Step 1:The elements of A and B are distributed to the n 3 processors so that the processor in position i,j,k will contain a ji and b ik – 1.1 Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = a ij and B(i,j,k) = b jk for 0<=i<=n-1. – 1.2 Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = a ji for 0<=k<=n-1. – 1.3 Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = b ik for 0<=j<=n-1. Step 2: Each processor in position (i,j,k) computes the product Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1. Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.

The Parallel Algorithm (contd.) Analysis Analysis – Steps 1.1,1.1,1.3 and 3 consists of q constant time iterations. – Step 2 requires constant time – So, T(n 3 ) = O(q) = O(logn) = O(logn) – Cost pT(p) = O(n 3 logn) – Not cost optimal.

n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Copies of data initially in A(0,j,k) and B(0,j,k)

n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1

n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Copies of data initially in A(0,j,k) and B(0,j,k)

n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Senders of A Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1

n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Senders of B Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1

Matrix Transposition The number of processors used is N = n 2 = 2 2q and processor P r occupies position (i,j) where r = in + j where 0<=i,j<=n-1. The number of processors used is N = n 2 = 2 2q and processor P r occupies position (i,j) where r = in + j where 0<=i,j<=n-1. Initially, processor P r holds all of the elements of matrix A where r = in + j. Initially, processor P r holds all of the elements of matrix A where r = in + j. Upon termination, processor P s holds element a ij where s = jn + i. Upon termination, processor P s holds element a ij where s = jn + i.

Matrix Transposition (contd.) A recursive interpretation of the algorithm A recursive interpretation of the algorithm – Divide the matrix into 4 sub-matrices – n/2 x n/2 – The first level of recursion The elements of bottom left sub-matrix are swapped with the corresponding elements of the top right sub-matrix. The elements of bottom left sub-matrix are swapped with the corresponding elements of the top right sub-matrix. The elements of other two sub-matrix are untouched. The elements of other two sub-matrix are untouched. – The same step is now applied to each of four (n/2)*(n/2) matrices. – This continues until 2*2 matrices are transposed. Analysis Analysis – The algorithm consists of q constant time iterations. – T(n) = log(n) – Cost = n 2 log(n) – not cost optimal (n(n-1)/2 operations on n*n matrix on the RAM by swapping a ij wit a ji for all i<j)

Matrix Transposition (contd.) Example Example A= 1eb2cfdg hxvy3zw4A=1eb2hxvycfdg3zw4 A=1bhve2xY cd3w fgz4 A=1be2hvxycdfg3wz

OutlineOutline 2D Diagonal Algorithm 2D Diagonal Algorithm The 3-D Diagonal Algorithm The 3-D Diagonal Algorithm

2D Diagonal Algorithm A*0A*1A*2A*3 B0*B1* B2* B3* A*0B0*A*1B1* A*2B2* A*3B3* AB Step 1 4*44*4

2D Diagonal Algorithm (Contd.) A*0B0*A*0A*0A*0 A*1A*1B1*A*1A*1 A*2A*2A*2B2*A*2 A*3A*3A*3A*3B3* A*0B0*A*0B01A*0B02A*0B03A*1B10A*1B1*A*1B12A*1B13 A*2B20A*2B21A*2B2*A*2B23 A*3B30A*3B31A*3B32A*3B3* Step 2 Step 3 C00,C10C20,C30C01,C11C21,C31 C02,C12C22,C32 C03,C13C23,C33 Step 4

2D Diagonal Algorithm (Contd.) Above algorithm can be extended to a 3-D mesh embedded in a hypercube with A *i and B i* being initially distributed along the third dimension z. Above algorithm can be extended to a 3-D mesh embedded in a hypercube with A *i and B i* being initially distributed along the third dimension z. Processor p iik holding the sub-blocks of A ki and B ik Processor p iik holding the sub-blocks of A ki and B ik One-to-all-personalized broadcast of Bi* then replaced by point-to-point communication of B ik from p iik to p kik One-to-all-personalized broadcast of Bi* then replaced by point-to-point communication of B ik from p iik to p kik It fallows one-to-all broadcast B ik to p kik along the z direction. It fallows one-to-all broadcast B ik to p kik along the z direction.

The 3-D Diagonal Algorithm HC consisting of p processors HC consisting of p processors Can be visualized as a 3-D mesh of size Can be visualized as a 3-D mesh of size Matrices A and B are partitioned into blocks of p ⅔ with blocks along each dimension. Matrices A and B are partitioned into blocks of p ⅔ with blocks along each dimension. Initially, it is assumed that A and B are mapped onto the 2-D plane x = y Initially, it is assumed that A and B are mapped onto the 2-D plane x = y processor p iik containing the blocks of A ki and B ki processor p iik containing the blocks of A ki and B ki

The 3-D Diagonal Algorithm (contd.) Algorithm consists of 3 phases Algorithm consists of 3 phases – Point to point communication of Bki by piik to pikk – One to all broadcast of blocks of A along the x direction and the newly acquired block of B along the z direction. Now processor pijk has the blocks of Akj and Bji Now processor pijk has the blocks of Akj and Bji Each processor calculates the products of blocks of A and B. Each processor calculates the products of blocks of A and B. – The reduction by adding the result sub matrices along the z direction.

The 3-D Diagonal Algorithm (contd.) Analysis Analysis – Phase 1: Passing messages of size n 2 / p ⅔ require log(3√p(t s + t w (n 2 / p ⅔ ))) where t s is the time it takes to start up for message sending and t w is time it takes to send a word from one processor to its neighbor. – Phase 2: takes twice as much time as phase 1. – Phase 3: Can be completed in the same amount of time as Phase 1. Overall, the algorithm takes (4/3 log p, n 2 / p ⅔ (4/3 log p)) Overall, the algorithm takes (4/3 log p, n 2 / p ⅔ (4/3 log p))

BibliographyBibliography Akl, Parallel Computation, Models and Methods, Prentice Hall Akl, Parallel Computation, Models and Methods, Prentice Hall Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997 Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997