High Performance Computing 1 Numerical Linear Algebra An Introduction.

Slides:

Advertisements

Similar presentations

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

Advertisements

CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

CS 240A : Matrix multiplication Matrix multiplication I : parallel issues Matrix multiplication II: cache issues Thanks to Jim Demmel and Kathy Yelick.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Introduction CS 524 – High-Performance Computing.

Data Locality CS 524 – High-Performance Computing.

1 Lecture 6 Performance Measurement and Improvement.

01/19/2006CS267 Lecure 21 High Performance Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance Tuning James.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

02/21/2007CS267 Lecture DLA11 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel

02/09/2006CS267 Lecture 81 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel

08/29/2002CS267 Lecure 21 CS 267: Optimizing for Uniprocessors—A Case Study in Matrix Multiplication Katherine Yelick

Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.

CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.

9/10/2007CS194 Lecture1 Memory Hierarchies and Optimizations: Case Study in Matrix Multiplication Kathy Yelick

Uniprocessor Optimizations and Matrix Multiplication

Data Locality CS 524 – High-Performance Computing.

Single Processor Optimizations Matrix Multiplication Case Study

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

CES 524 May 6 Eleven Advanced Cache Optimizations (Ch 5) parallel architectures (Ch 4) Slides adapted from Patterson, UC Berkeley.

Optimizing for the serial processors Scaled speedup: operate near the memory boundary. Memory systems on modern processors are complicated. The performance.

Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Scientific Computing Linear Systems – LU Factorization.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

1 Lecture 2 Single Processor Machines: Memory Hierarchies and Processor Features UCSB CS240A, Winter 2013 Modified from Demmel/Yelick’s slides.

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix multiplication I: parallel issues Matrix multiplication II: cache.

1 Single Processor Machines: Memory Hierarchies and Processor Features Case Study: Tuning Matrix Multiply Based on slides by James Demmel

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

1. RIOT: I/O-Efficient Numerical Computing in Yi Zhang Herodotos Herodotou Jun Yang.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

Lecture 9 Architecture Independent (MPI) Algorithm Design

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June

A few words on locality and arrays

Ioannis E. Venetis Department of Computer Engineering and Informatics

CS 140 : Numerical Examples on Shared Memory with Cilk++

Numerical Linear Algebra

Optimizing Cache Performance in Matrix Multiplication

Optimizing Cache Performance in Matrix Multiplication

CS 267 Dense Linear Algebra: Parallel Matrix Multiplication

BLAS: behind the scenes

Uniprocessor Optimizations and Matrix Multiplication

CS 140 : Matrix multiplication

Parallel Matrix Operations

Memory System Performance Chapter 3

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

CS 140 : Matrix multiplication

Applied Discrete Mathematics Week 4: Functions

Presentation transcript:

High Performance Computing 1 Numerical Linear Algebra An Introduction

High Performance Computing 1 Levels of multiplication vector-vector a[i]*b[i] matrix-vector A[i,j]*b[j] matrix-matrix A[I,k]*B[k,j]

High Performance Computing 1 Matrix-Matrix for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n 3 ) work

High Performance Computing 1 Block matrix Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n 2.8 ) work

High Performance Computing 1 Pipeline method Pump data left to right and top to bottom recv(&A,P[i,j-1]); recv(&B,P[i-1,j]); C=C+A*B send(&A,P[i,j+1]); send(&B,P[i+1, j]);

High Performance Computing 1 Pipeline method C[0,0] C[3,1]C[3,3]C[3,2] C[1,0] C[2,0] C[3,0] C[1,1] C[0,3]C[0,2]C[0,1] C[1,2] C[2,1]C[2,2] C[1,3] C[2,3] A[0,*] A[1,*] A[2,*] A[3,*] B[*,0]B[*,1]B[*,2]B[*,3]

High Performance Computing 1 Pipeline method Similar method for matrix-vector multiplication. But you lose some of the cache reuse

High Performance Computing 1 A sense of speed – vector ops LoopFlops per passOperation per passoperation 12v1(i)=v1(i)+a*v2(i)update 28v1(i)=v1(i)+Σ sk*vk(i)4-fold vector update 31v1(i)=v1(i)/v2(i)divide 42v1(i)=v1(i)+s*v2(ind(i))update+gather 52v1(i)=v2(i)-v3(i)*v1(i-1)Bidiagonal 62s=s+v1(i)*v2(i)Inner product

High Performance Computing 1 A sense of speed – vector ops LoopJ90 cft77 (100 nsec clock) r_∞ n1/2 T90 1 processor cft77 (2.2 nsec clock) r_∞ n1/

High Performance Computing 1 Observations Simple do loops not effective Cache and memory hierarchy bottlenecks For better performance, –combine loops –minimize memory transfer

High Performance Computing 1 LINPACK library of subroutines to solve linear algebra example – LU decomposition and system solve (dgefa and dgesl, resp.) In turn, built on BLAS see netlib.org

High Performance Computing 1 BLAS Basic Linear Algebra Subprograms a library of subroutines designed to provide efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies. The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM. The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3.

High Performance Computing 1 BLAS Level 1

High Performance Computing 1 BLAS Level 2

High Performance Computing 1 BLAS Level 3

High Performance Computing 1 How efficient is the BLAS? load/store float ops refs/ops level 1 SAXPY 3N 2N 3/2 level 2 SGEMV MN+N+2M 2MN 1/2 level 3 SGEMM 2MN+MK+KN 2MNK 2/N

High Performance Computing 1 Matrix-vector read x(1:n) into fast memory read y(1:n) into fast memory for i = 1:n read row i of A into fast memory for j = 1:n y(i) = y(i) + A(i,j)*x(j) write y(1:n) back to slow memory

High Performance Computing 1 Matrix-vector m=# slow memory refs = n^2 +3n f=# arithmetic ops = 2n^2 q=f/m ~2 Mat-vec multiple limited by slow memory

High Performance Computing 1 Matrix-matrix

High Performance Computing 1 Matrix Multiply - unblocked for i = 1 to n read row i of A into fast memory for j = 1 to n read C(i,j) into fast memory read column j of B into fast memory for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) write C(i,j) back to slow memory *

High Performance Computing 1 Matrix Multiply unblocked Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix- vector multiply

High Performance Computing 1 Matrix Multiply blocked Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N read block C(i,j) into fast memory for k = 1 to N read block A(i,k) into fast memory read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} write block C(i,j) back to slow memory *

High Performance Computing 1 Matrix Multiply blocked Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n

High Performance Computing 1 Matrix Multiply blocked So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3)

High Performance Computing 1 More on BLAS Industry standard interface(evolving) Vendors, others supply optimized implementations History –BLAS1 (1970s): vector operations: dot product, saxpy m=2*n, f=2*n, q ~1 or less –BLAS2 (mid 1980s) matrix-vector operations m=n^2, f=2*n^2, q~2, less overhead somewhat faster than BLAS1

High Performance Computing 1 More on BLAS –BLAS3 (late 1980s) matrix-matrix operations: matrix matrix multiply, etc m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 Good algorithms used BLAS3 when possible (LAPACK)

High Performance Computing 1 BLAS on an IBM RS6000/590 BLAS 3 BLAS 2 BLAS 1 BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) Peak speed = 266 Mflops Peak