FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Heterogeneous and Reconfigurable Computing Group This material is based upon work supported by the National Science Foundation under Grant Nos. CCF and CCF
Sparse Matrix Vector Multiplication SpMV used as a kernel in many methods –Iterative Principal Component Analysis (PCA) –Matrix Decomposition: LU, SVD, Cholesky, QR, etc –Iterative Linear System Solvers: CG, BCG, GMRES, Jacobi etc –Other Matrix Operations
Talk Outline GPU –Microarchitecture & Memory Hierarchy Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations
NVIDIA GT200 Microarchitecture Many-core architecture –24 or 30 On-chip Streaming Multiprocessors –8 Scalar Processor per SMs –Each SP can issue up to four threads –Warp: Group of 32 threads having common control path
GPU Memory Hierarchy Off-Chip Device Memory –On board –Host and GPU exchange I/O data –GPU stores state data On-Chip Memories –A large Set of 32-bit regs per processor –Shared Memory –Constant Cache (Read Only) –Texture Cache (Read Only) Multiprocessor 1 Multiprocessor 2 Multiprocessor n Constant Texture Constant Memory Texture Memory Device Memory
GPU Utilization and Throughput Metrics CUDA Profiler used to measure –Occupancy Ratio of active warps to the maximum number of active warps per SM Limiting Factors: –Number of registers –Amount of shared memory –Instruction count required by the threads Not an accurate indicator of SM utilization –Instruction Throughput Ratio of achieved instruction rate to peak instruction rate Limiting Factors: –Memory latency –Bank conflicts on shared memory –Inactive threads within a warp caused by thread divergence
Talk Outline GPU –Memory Hierarchy & Microarchitecture Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations
Sparse Matrix Sparse Matrices can be very large but contain few non-zero elements SpMV: Ax = b Need special storage format –Compressed Storage Row (CSR) val col ptr
GPU SpMV Multiplication State of the art –NVIDIA Research (Nathan Bell) –Ohio State University and IBM (Rajesh Bordawekar) Built on top of NVIDIA’s SpMV CSR kernel Memory management optimizations added In general, performance depends on effective use of GPU memories
OSU/IBM SpMV Matrix stored in device memory –Zero padding: Elements per row to be a multiple of sixteen Input vector in SM’s texture cache Shared memory stores output vector Extracting Global Memory Bandwidth –Instruction and variable alignment necessary Fulfilled by built-in types –Global memory access by all threads of a half-warp coalesced into a transaction of 32, 64, or 128 bytes
Analysis Each thread reads 1/16 th of non-zero elements in a row Accessing device memory (128 byte interface): –Access val array => 16 threads read 16 x 8 bytes = 128 bytes –Access col array => 16 threads read 16 x 4 bytes = 64 bytes Occupancy achieved by all matrices was ONE –Each thread uses sufficiently small amount of registers and shared memory –Each SM capable of executing the maximum number of threads possible Instruction throughput ratio : to 0.886
Talk Outline GPU –Memory Hierarchy & Microarchitecture Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations
SpMV FPGA Implementation Generally Implemented Architecture (from literature) –Multipliers followed by a binary tree of adders followed by accumulator –Values delivered serially to the accumulator –For a set of n values, n-1 additions required to reduce Problem –Accumulation of FP values is an iterative procedure M1M1 M2M2 V1V1 V2V2 Accumulator
The Reduction Problem + + Mem Control Partial sums Basic Accumulator Architecture Adder Pipeline Required Design Reduction Ckt Feedback Loop
Previous Reduction Ckt Implementations GroupFPGA Reduc’n Logic Reduc’n BRAM D.p. adder speed Accumulator speed Prasanna ’07 Virtex2 Pro100 DSA3170 MHz142 MHz Prasanna ’07 Virtex2 Pro100 SSA6170 MHz165 MHz Gerards ’08Virtex 4 Rule Based 9324 MHz200 MHz We need better architecture Feedback Reduction Circuit −Simple and Resource Efficient Reduce the performance gap between adder and accumulator –Move logic outside the feedback loop
A Close Look at Floating Point Addition Compare exponents Add mantissas De- normalize smaller value Round Re- normalize x x x x x 2 24 Round x 2 24 IEEE 754 adder pipeline (assume 4-bit significand): x x 2 21
Base Conversion Idea: –Shift both inputs to the left by amount specified in low-order bits of exponents –Reduces size of exponent, requires wider adder Example: –Base-8 conversion: , exp=10110 ( x 2 22 => ~5.7 million) Shift to the left by 6 bits… , exp=10 (87.25 x 2 8*2 = > ~5.7 million)
Accumulator Design Feedback Loop Preprocess Post-process α = 3
Reduction Circuit Designed a novel reduction circuit Lightweight by taking advantage of shallow adder pipeline Requires –One input buffer –One output buffer –Eight State FSM controller
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 22 11 B1 Input 0
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 22 11 B2 Input B1
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer B1 33 B2 Input 2 B3
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer B1 33 Input 2 B4 B2+B3
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 Input 2 B5 B2+B3B1+B4
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input 2 3 B6 B2+B3B1+B4 B5
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input 2 3 B7 B2+B3 +B6 B1+B4 B5
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input 2 3 B8 B2+B3 +B6 B1+B4 +B7 B5
Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input C1 B2+B3 +B6 B1+B4 +B7 B5+B8 0
Reduction Circuit Configurations Four “configurations”: Deterministic control sequence, triggered by set change: –D, A, C, B, A, B, B, C, B/D Minimum set size: α ⌈ lg α + 1 ⌉ -1 Minimum set size for adder pipeline depth of 3 is 8
New SpMV Architecture Built on top of limitation of Reduction Circuit Delete Adder Binary tree Replicate accumulators Schedule data to process multiple dot products in parallel
Talk Outline GPU –Memory Hierarchy & Microarchitecture Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations
Performance Figures GPUFPGA Matrix Order/ dimensions nznz Avg. n z /row Mem. BW (GB/s) GFLOPs GFLOPs ( 8.5 GB/s) TSOPF_RS_b162_c E40r Simon/olafu Garon/garon Mallya/lhr11c Hollinger/mark3jac020sc Bai/dw YCheng/psse x GHS_indef/ncvxqp
Performance Comparison If FPGA Memory bandwidth scaled by adding multipliers/ accumulators to match GPU Memory Bandwidth for each matrix separately GPU Mem. BW (GB/s) FPGA Mem BW (GB/s) GB/s (x6) GB/s (x6) GB/s (x6) GB/s (x5) GB/s (x4) GB/s (x3) GB/s (x3) GB/s (x3) GB/s (x3)
Conclusions Presented state of the art GPU Implementation of SpMV Presented a new SpMV Architecture for FPGA –Based on novel Accumulator architecture GPUs at present, perform better than FPGAs for SpMV –Due to available memory bandwidth FPGAs have the potential to outperform GPUs –Need more memory bandwidth
Acknowledgement Dr. Jason Bakos Yan Zhang, Tiffany Mintz, Zheming Jin, Yasser Shalabi, Rishabh Jain National Science Foundation Questions?? Thank You!!
Performance Analysis Xilinx Virtex-2Pro100 –Includes everything related to the accumulator (LUT based adder)