Matrix Operations on the GPU CIS 665: GPU Programming and Architecture TA: Joseph Kider.

Slides:



Advertisements
Similar presentations
Matrix Operations on the GPU CIS 665: GPU Programming and Architecture TA: Joseph Kider.
Advertisements

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
CS 450: COMPUTER GRAPHICS LINEAR ALGEBRA REVIEW SPRING 2015 DR. MICHAEL J. REALE.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Numerical Algorithms • Matrix multiplication
Maths for Computer Graphics
Data Locality CS 524 – High-Performance Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.
GLSL Applications: 2 of 2 Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 Neural Nets Applications Vectors and Matrices. 2/27 Outline 1. Definition of Vectors 2. Operations on Vectors 3. Linear Dependence of Vectors 4. Definition.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Linear Equations in Linear Algebra
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Lecture 7: Matrix-Vector Product; Matrix of a Linear Transformation; Matrix-Matrix Product Sections 2.1, 2.2.1,
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Chapter 10 Review: Matrix Algebra
Graphics CSE 581 – Interactive Computer Graphics Mathematics for Computer Graphics CSE 581 – Roger Crawfis (slides developed from Korea University slides)
Little Linear Algebra Contents: Linear vector spaces Matrices Special Matrices Matrix & vector Norms.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Enhancing GPU for Scientific Computing Some thoughts.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
CO1301: Games Concepts Dr Nick Mitchell (Room CM 226) Material originally prepared by Gareth Bellaby.
CSE123 Lecture 5 Arrays and Array Operations. Definitions Scalars: Variables that represent single numbers. Note that complex numbers are also scalars,
Statistics and Linear Algebra (the real thing). Vector A vector is a rectangular arrangement of number in several rows and one column. A vector is denoted.
Texture Memory -in CUDA Perspective TEXTURE MEMORY IN - IN CUDA PERSPECTIVE VINAY MANCHIRAJU.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
If A and B are both m × n matrices then the sum of A and B, denoted A + B, is a matrix obtained by adding corresponding elements of A and B. add these.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slide Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Copyright © Cengage Learning. All rights reserved. 2 SYSTEMS OF LINEAR EQUATIONS AND MATRICES.
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
CS 450: COMPUTER GRAPHICS TRANSFORMATIONS SPRING 2015 DR. MICHAEL J. REALE.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Matrix Algebra Definitions Operations Matrix algebra is a means of making calculations upon arrays of numbers (or data). Most data sets are matrix-type.
Improvement to Hessenberg Reduction
Numerical Algorithms Chapter 11.
MTH108 Business Math I Lecture 20.
Jens Krüger Technische Universität München
Graphics Processing Unit
BLAS: behind the scenes
Parallel Matrix Operations
GPGPU: Parallel Reduction and Scan
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
RADEON™ 9700 Architecture and 3D Performance
Memory System Performance Chapter 3
Math review - scalars, vectors, and matrices
Presentation transcript:

Matrix Operations on the GPU CIS 665: GPU Programming and Architecture TA: Joseph Kider

Matrix Operations (thanks too…) Slide information sources Suresh Venkatasubramanian CIS700 – Matrix Operations Lectures Fast matrix multiplies using graphics hardware by Larsen and McAllister Dense Matrix Multiplication by Ádám Moravánszky Cache and Bandwidth Aware Matrix Multiplication on the GPU, by Hall, Carr and Hart Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication by Fatahalian, Sugerman, and Harahan Linear algebra operators for GPU implementation of numerical algorithms by Krüger and Westermann

Overview 3 Basic Linear Algebra Operations Vector-Vector Operations c=a. b Matrix-Matrix Operations C=A+B D=A*B E = E -1 Matrix-Vector Operations y=Ax

Efficiency/Bandwidth Issues GPU algorithms are severely bandwidth limited! Minimize Texture Fetches Effective cache bandwidth… so no algorithm would be able to read data from texture very much faster with texture fetches

Vector-Vector Operations Inner Product Review An inner product on a vector space (V) over a field (K) (which must be either the field R of real numbers or the field C of complex numbers) is a function :VxV→K such that, k 1, k 2 in K for all v,w in V the following properties hold: 1. = + 2. = ά (linearity constraints) ____ 3. = (conjugate symmetry) 4. ≥ 0 (positive definite)

Vector-Vector Operations Inner Product Review A vector space together with an inner product on it is called an inner product space. Examples include: 1. The real numbers R where the inner product is given by = xy 2. The Euclidean space R n where the inner product is given by the dot product: c = a.b c = c = a 1 b 1 +a 2 b 2 +…+a n b n c = ∑a i b i 3. The vector space of real functions with a closed domain [a,b] = ∫ f g dx

Vector-Vector Operations Dot Product: Technique 1 (Optimized for memory) - Store each vector as a 1D texture a and b - In the ith rendering pass we render a single point at coordinates (0,0) which has a single texture coordinate i - The Fragment program uses I to index into the 2 textures and return the value s + a i *b i ( s is the running sum maintained over the previous i-1 passes) c 0 = 0 c 1 = c 0 + a 0 *b 0 c 2 = c 1 + a 1 *b 1 …..

Vector-Vector Operations Dot Product: Technique 1: Problems?  We cannot read and write to the location s is stored in a single pass, we need to use a ping-pong trick to maintain s accurately  Takes n-passes Requires only a fixed number of texture locations (1 unit of memory)  Does not take advantage of 2D spatial texture caches on the GPU that are optimized by the rasterizer  Limited length of 1d textures, especially in older cards

Vector-Vector Operations Dot Product: Technique 2 (optimized for passes) - Wrap a and b as 2D textures

Dot Product: Technique 2 - Multiply the two 2D textures by rendering a single quad with the answer - Add the elements in (c) the result 2D texture together

Vector-Vector Operations Adding up a texture elements to a scalar value Additive blending Or parallel reduction algorithm (log n passes) //example Fragment program for performing a reduction float main (float2 texcoord: TEXCOORD0, uniform sampler2D img): COLOR { float a, b, c, d; a=tex2D(img, texcoord); b=tex2D(img, texcoord + float2(0,1) ); c=tex2D(img, texcoord + float2(1,0) ); d=tex2D(img, texcoord + float2(1,1) ); return (a+b+c+d); }

Matrix-Matrix Operations Store matrices as 2D textures Addition is now a trivial fragment program /additive blend

Matrix-Matrix Operations Matrix Multiplication Review So in other words we have: In general: (AB) ij = ∑ r=0 a ir b rj Naïve O(n 3 ) CPU algorithm for i = 1 to n for j = 1 to n C[i,j] = ∑ A[I,k] * B[k,j]

Matrix-Matrix Operations GPU Matrix Multiplication: Technique 1 Express multiplication of two matrices as dot product of vector of matrix row and columns Compute matrix C by: for each cell of cij take the dot product of row I of matrix A with column j of matrix B

Matrix-Matrix Operations GPU Matrix Multiplication: Technique 1 Pass1 Output = a x1 * b 1y Pass2 Output = Output 1 +a x2 * b 2y ….. PassK Output = Output k-1 + a xk * b ky Uses: n passes Uses: N=n2 space

Matrix-Matrix Operations GPU Matrix Multiplication: Technique 2 Blocking Instead of making one computation per pass. Compute multiple additions per pass in the fragment program. Pass1 Output = a x1 * b 1y + a x2 * b 2y +… + a xb * b by ….. Passes: = n/Blockssize Now there is a tradeoff between passes and program size/fetches

Matrix-Matrix Operations GPU Matrix Multiplication: Technique 3 Modern fragment shaders allow up to 4 instructions to be executed simultaneously (1) output = v1.abgr*v2.ggab This is issued as a single GPU instruction and numerically equivalent to the following 4 instructions being executed in parallel (2) output.r = v1.a *v2.g output.g = v1.b * v2.g output.b = v1.g * v2.a output.a = v1.r * v2.b In v1.abgr the color channels are referenced in arbitrary order. This is referred to as swizzling. In v2.ggab the color channel (g) is referenced multiple times. This is referred to as smearing.

Matrix-Matrix Operations Up until now we have been using 1 channel, the red component to store the data, why now store data across all the channels (RGBA) and compute instructions 4 at a time GPU Matrix Multiplication: Technique 3 Smearing/Swizzling The matrix multiplication can be expressed as follows: Suppose we have 2 large matrices A B, wog whose dimensions are power of 2s A 11, a 12 … are sub matrices of 2 i-1 rows/columns

Matrix-Matrix Operations Note on Notation: C(r)=A(r)*B(r) used to denote the channels Example: So now the final matrix multiplication can be expressed recursively by:

Matrix-Matrix Operations Efficiency/Bandwidth Issues Problem with matrix multiplication is each input contributes to multiple outputs O(n) Arithmetic performance is limited by cache bandwidth Multipass algorthims tend to be more cache friendly 2 Types of Bandwidth -External Bandwidth: Data from the CPU  GPU transfers limited by the AGP or PCI express bus - Internal Bandwidth (Blackbox): read from textures/write to textures tend to be expensive - Back of the envelope calculation: ((2 texture read/write lookups) *blocksize + 2(previous pass lookup)*(prescion)(n 2 ) - (2*32 + 2)(32)(1024) = 4GB of Data being thrown around

GPU Benchmarks ATI9800ATIX800 GFLOPS Peak Arithmetic Rate 7800 Pent IV ATIX

Previous Generation GPUs P4 3Ghz 5900 Ultra9800 XT GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

Next Generation GPUs P4 3Ghz 6800 UltraX800 XT PE GFLOPS Bandwidth Multiplication of 1024x1024 Matrices GFLOPS GB/sec

Matrix-Vector Operations Matrix Vector Operation Review Example 1: Example 2:

Matrix-Vector Operations Technique 1: Just use a Dense Matrix Multiply Pass1 Output = a x1 * b 11 + a x2 * b 21 +… + a xb * b b1 ….. Passes: = n/Blockssize

Matrix-Vector Operations Technique 2: Sparse Banded Matrices (A * x = y) A band matrix is a sparse matrix whose nonzero elements are confined to diagonal bands Algorithm: - Convert Diagonal Bands to vectors - Convert (N) vectors to 2D-textures, pad with 0 if they do not fill the texture completely

Matrix-Vector Operations Technique 2: Sparse Banded Matrices - Convert the multiplication Vector (x) to a 2D texture - Pointwise multiply (N) Diagonal textures with (x) texuture - Add the (N) resulting matrices to form a 2D texuture - unwrap the 2D texture for the final answer

Matrix-Vector Operations Technique 3: Sparse Matrices Create a texture lookup scheme