CUDA Linear Algebra Library and Next Generation

Slides:



Advertisements
Similar presentations
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Solving Linear Systems (Numerical Recipes, Chap 2)
OpenFOAM on a GPU-based Heterogeneous Cluster
CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Processor Technology and Architecture
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Irregular Applications –Sparse Matrix Vector Multiplication
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Martin Kruliš by Martin Kruliš (v1.0)1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CUDA Interoperability with Graphical Environments
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
GPU Computing CIS-543 Lecture 10: CUDA Libraries
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Basic CUDA Programming
Mattan Erez The University of Texas at Austin
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University

Sparse Matrix-Vector Multiplication

Sparse Matrix-Vector Multiplication Dense approach is wasteful - unclear how to map work to parallel processors - irregular elements accessing for global memory 3

Sparse Matrix-Vector Multiplication structured unstructured DIA - diagonal format ELL - ellpack format CSR - compressed row format HYB - hybrid format COO - coordinate format 4

global memory coalescing Sparse Matrix-Vector Multiplication Diagonal format - diagonal should be mostly populated format - high parallelism to map one thread for one row - good parallel efficiency and good memory behavior global memory coalescing 5

Sparse Matrix-Vector Multiplication Ellpack format - assign one thread to compute one row again - but the load imbalance hurts parallel efficiency 6

Sparse Matrix-Vector Multiplication Coordinate format - insensitive to sparsity pattern but slower than ellpack - assign one thread for one element and combine the results from all elements in a row to get output element 7

Sparse Matrix-Vector Multiplication Hybrid format - combine regular ellpack format and flexible coo format typical exceptional 8

fixed number of nonzeros and variable matrix size Sparse Matrix-Vector Multiplication Property comparison fixed number of nonzeros and variable matrix size Matrix Format Granularity Coalescing DIA thread/row full ELL CSR(scalar) rare CSR(vector) warp/row partial COO thread/nonzero HYB 9

Sparse Matrix-Vector Multiplication Sparse matrices for parallel efficiency: ellpack format - one thread per row is efficient for memory accessing Sparse matrices for load imbalance: coordinate format - one thread per element is insensitive to matrix structure Conclusion for all structures - hybrid structure gives the best performance averagely - irregularity is manageable if regularize the common case 10

Sparse Matrix-Vector Multiplication Performance comparison 11

Sparse Matrix-Vector Multiplication Performance comparison 12

Sparse Matrix-Vector Multiplication Performance comparison 13

Linear Algebra Library

Linear Algebra Library CUBLAS: CUDA Basic Linear Algebra Subroutines - implement basic linear algebra subroutines on runtime level - only available for single device not implement for multiple devices CUFFT: CUDA Fast Fourier Transforms Library - use divide-and-conquer algorithm for discrete transform - support real and complex data for in-place or out-of-place - support the stream operation for simultaneous execution - use complex-to-complex to replace real-to-complex - problem size in power-of-two gives best performance 15

Linear Algebra Library CUDPP: CUDA Data Parallel Primitive Library - a library of data-parallel algorithm primitives - parallel prefix-sum and sorting and data reduction - stream compaction and random number generator 16

comparison with multicore CPU Linear Algebra Library CUDPP: CUDA Data Parallel Primitive Library comparison with multicore CPU 17

Linear Algebra Library CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface linear system solves least square solvers orthogonal factorization symmetric eigenproblem non-symmetric eigenproblem singular value decompositions 18

Linear Algebra Library CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface double precision QR-factorization double precision LU-factorization 19

Linear Algebra Library CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface double precision symmetric eigenvalue problem double precision QR-factorization 20

Linear Algebra Library CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface double precision symmetric eigenvalue problem double precision singular value decomposition 21

Linear Algebra Library MAGMA: Matrix Algebra on GPU and Multicore Architecture - open source project to develop a dense linear algebra library similar to basic linear algebra package but for heterogeneous and hybrid architecture with manycore CPUs and GPUs systems 22

Linear Algebra Library MAGMA: Matrix Algebra on GPU and Multicore Architecture single precision QR-factorization double precision matrix-matrix multiplication 23

Linear Algebra Library MAGMA: Matrix Algebra on GPU and Multicore Architecture solving Ax=b by using LU-factorization single precision QR-factorization 24

Linear Algebra Library MAGMA: Matrix Algebra on GPU and Multicore Architecture single precision Cholesky-factorization solving Ax=b by using LU-factorization 25

Linear Algebra Library Thrust - thrust is a CUDA library of parallel algorithm with an interface resembling the C++ Standard Template Library STL to provide flexible high-level interface that greatly enhance productivity 26

Linear Algebra Library int main(int argc,char** argv) { //allocate memory space on the host thrust::host_vector<float> hvec(1024); //generate random number on the host thrust::generate(hvec.begin(),hvec.end(),rand); //allocate and transfer data to device thrust::device_vector<float> dvec=hvec; //manipulate device values from the host dvec[0]=(float)rand()/(float)(RAND_MAX-1); dvec[1]=(float)rand()/(float)(RAND_MAX-1); //sum all data on device by parallel reduction sum=thrust::reduce(dvec.begin(),dvec.end()); //sort all data on device by radix sort thrust::sort(dvec.begin(),dvec.end()); //transfer final data back to host thrust::copy(dvec.begin(),dvec.end(),hvec.begin()); } 27

Linear Algebra Library int main(int argc,char** argv) { //create list container on the host std::list<int> hlist; hlist.push_back(13); hlist.push_back(27); //copy host data from list into device vector thrust::device_vector<int> dvec(hlist.size()); thrust::copy(hlist.begin(),hlist.end(),dvec.begin()); //alternative method to convert from host to device thrust::device_vector<int> dvec(hlist.begin(),hlist.end()); //obtain raw pointer from device memory int* dpointer=thrust::raw_pointer_cast(dvec); //launch device kernel function kernel<<<blocknum,blocksize>>>(dpointer,dvec.size()); //deallocate device memory cudaFree(dpointer); } 28

Linear Algebra Library CUSP: Generic Parallel Algorithm for Sparse Matrix Computations - cusp provides a high-level and flexible interface for manipulating sparse matrix and solving sparse linear systems by iterative method - cusp is implemented on the thrust template interface structure "Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“ Nathan Bell and Michael Garland, in "Supercomputing 09", 2009 29

Linear Algebra Library CUSP: Generic Parallel Algorithm for Sparse Matrix Computations 30

Linear Algebra Library Matrix format - cusp natively supports several sparse matrix formats - cusp make it is easy to transfer sparse matrix data between host and device and convert between sparse matrix format //allocate storage space for a CSR matrix on the //host with 5 row 8 column and 12 nonzero elements cusp::csr_matrix<int,float,cusp::host_memory> A(5,8,12); //allocate and transfer from host to device memory cusp::csr_matrix<int,float,cusp::device_memory> B=A; //convert the CSR matrix format to HYB matrix format cusp::hyb_matrix<int,float,cusp::device_memory> C=A; 31

Linear Algebra Library Algorithm and iterative solver - matrix-vector multiplication sand transpose - conjugate gradient and biconjugate gradient stab //matrix-vector multiplication cusp::multiply(A,x,y) //sparse matrix transpose cusp::transpose(A,At) //conjugate gradient cusp::krylov::cg(A,x,b) //biconjugate gradient stab cusp::krylov::bicgstab(A,x,b) 32

Linear Algebra Library int main(int argc,char** argv) { //create an empty HYB sparse matrix structure cusp::hyb_matrix<int,float,cusp::device_memory> A; //load a matrix stored in the matrix market format cusp::io::read_matrix_market_file(A,”5pt_10x10.mtx”); //allocate storage for solution x and right-hand side b cusp::array1d<float,cusp::device_memory> x(A.num_rows,0); cusp::array1d<float,cusp::device_memory> b(A.num_rows,1); //set the iteration and residual stopping criteria cusp::verbose_monitor<ValueType> monitor(100,1e-6); //setup the matrix preconditioner cusp::precond::diagonal<ValueType,MemorySpace> M(A); //solve the linear system with conjugate gradient method cusp::krylov::cg(A,x,b,monitor,M); return 0; } 33

Linear Algebra Library OpenNL: Open Numerical Library - efficient sparse matrix data structure - sparse direct linear solver for SuperLU - matrix preconditioner for Jacobi and SSOR - iterative builder for sparse least-square problems - iterative solvers for conjugate gradient, BICGSTAB, GMRES 34

Linear Algebra Library ViennaCL - a basic linear algebra for computations on GPUs based on OpenCL - support basic linear algebra subroutines - generalized minimal residual method - direct linear system solver with LU-factorization - sparse conjugate gradient and biconjugate gradient - optimal incomplete LU preconditioner with threshold GATLAS: GPU Automatically Tuned Linear Algebra Subroutines - automatically tuned the kernel of level 3 BLAS based on OpenCL 35

Next Generation Architecture

Next Generation Architecture Next GPU generation architecture is called Fermi 37

Next Generation Architecture Next GPU generation architecture is called Fermi 38

Next Generation Architecture Third generation Streaming Multiprocessor dual thread/warp scheduler 32 processors for each SM double precision 50% of single (8X faster than GT200) 4 special function units 64 KB of RAM for shared memory and configurable L1 cache 39

Next Generation Architecture Second generation Parallel Thread Execution IEEE 754-2008 floating point standard, surpassing even the most advanced CPU Fused multiply-add FMA instruction for both single and double precision Newly designed 32-bit integer ALU and extended precision operations 40

Next Generation Architecture Improved Memory System first GPU architecture to support true cache hierarchy in combination with on-chip shared memory L1 cache for each multiprocessor improve bandwidth/reduce latency unified L2 cache (768 KB) coherent data sharing across all cores ECC support GDDR5 memory interface which is almost 2X faster than GDDR3 41

Next Generation Architecture GigaThread Hardware Scheduler Hierarchically manage thousands of simultaneously active threads 10X faster application context switching to support concurrent kernel execution 42

concurrent kernel execution + faster context switch Next Generation Architecture GigaThread Hardware Scheduler concurrent kernel execution + faster context switch 43

Next Generation Architecture GigaThread Hardware Scheduler Dual DMA engines for simultaneous data transfer to fully overlap with CPU and GPU processing time 44

Next Generation Architecture Third generation Streaming Multiprocessor fully pipeline of integer arithmetic logic unit and floating-point unit improve floating-point arithmetic from IEEE 745-1985 to IEEE 745-2008 to support FMA instruction improve integer ALU from 24-bit precision into 32-bit precision 45

original multiply-add Next Generation Architecture What is NEW on the floating-point operation? - support fused multiply-add instructions for both single and double fused multiply-add original multiply-add A x B = product retain all digits truncate extra digits + C = result 46

Next Generation Architecture What is NEW on the floating-point operation? - support subnormal numbers for both single and double precision which are small numbers that lie between the zero and smallest normalized number of a given floating point number system - prior generation flush subnormal operand and results to zero - CPU typically perform subnormal calculation in exception-handling software taking thousands of cycles, but Fermi handle subnormal calculations in hardware with no additional performance penalty 47

Next Generation Architecture Third generation Streaming Multiprocessor 16 load/store units to allow source and destination addresses to be calculated for 16 threads per cycle 32 single precision FMA units 16 double precision FMA units 48

double precision application performance Next Generation Architecture Third generation Streaming Multiprocessor double precision application performance 49

instruction dispatch units Next Generation Architecture Third generation Streaming Multiprocessor two warp scheduler and instruction dispatch units 50

Next Generation Architecture Third generation Streaming Multiprocessor dual warp scheduler allowing two warps to be issued and executed concurrently for 32 cores 51

Next Generation Architecture Third generation Streaming Multiprocessor two warp scheduler and instruction dispatch units 64KB configurable shared memory and L1 cache 52

radix sort using shared memory Next Generation Architecture 64KB configurable shared memory and L1 cache - 48KB shared memory and 16KB L1 cache - 16KB shared memory and 48KB L1 cache radix sort using shared memory 53

Next Generation Architecture Unified memory address space - combine three separate addresses space for load and store - this feature enable Fermi to support all C++ specific programs virtual function, function pointer, new and delete object, try and catch 54

Next Generation Architecture summary table 55

Next Generation Architecture scheduler bottleneck L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Work Distribution Pixel Thread Issue Input Assembler Host 56

Next Generation Architecture new bottleneck old bottleneck 57

- Mark Harris http://www.markmark.net/ Reference - Mark Harris http://www.markmark.net/ - Wei-Chao Chen http://www.cs.unc.edu/~ciao/ - Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php 58