FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Chapter 04 Authors: John Hennessy & David Patterson.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Accuracy, Cost, and Performance Trade-offs for Floating Point Accumulation Krishna K. Nagar and Jason D. Bakos Univ. of South Carolina.
Extracted directly from:
GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Efficient FPGA Implementation of QR
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Principles of Linear Pipelining
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Chapter One Introduction to Pipelined Processors
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
Irregular Applications –Sparse Matrix Vector Multiplication
Accuracy, Cost and Performance Tradeoffs for Floating Point Accumulation Krishna K. Nagar & Jason D. Bakos University of South Carolina, Columbia, SC Objective.
Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Yang Gao and Dr. Jason D. Bakos
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Multivector and SIMD Computers
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
Presentation transcript:

FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Heterogeneous and Reconfigurable Computing Group This material is based upon work supported by the National Science Foundation under Grant Nos. CCF and CCF

Sparse Matrix Vector Multiplication SpMV used as a kernel in many methods –Iterative Principal Component Analysis (PCA) –Matrix Decomposition: LU, SVD, Cholesky, QR, etc –Iterative Linear System Solvers: CG, BCG, GMRES, Jacobi etc –Other Matrix Operations

Talk Outline GPU –Microarchitecture & Memory Hierarchy Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations

NVIDIA GT200 Microarchitecture Many-core architecture –24 or 30 On-chip Streaming Multiprocessors –8 Scalar Processor per SMs –Each SP can issue up to four threads –Warp: Group of 32 threads having common control path

GPU Memory Hierarchy Off-Chip Device Memory –On board –Host and GPU exchange I/O data –GPU stores state data On-Chip Memories –A large Set of 32-bit regs per processor –Shared Memory –Constant Cache (Read Only) –Texture Cache (Read Only) Multiprocessor 1 Multiprocessor 2 Multiprocessor n Constant Texture Constant Memory Texture Memory Device Memory

GPU Utilization and Throughput Metrics CUDA Profiler used to measure –Occupancy Ratio of active warps to the maximum number of active warps per SM Limiting Factors: –Number of registers –Amount of shared memory –Instruction count required by the threads Not an accurate indicator of SM utilization –Instruction Throughput Ratio of achieved instruction rate to peak instruction rate Limiting Factors: –Memory latency –Bank conflicts on shared memory –Inactive threads within a warp caused by thread divergence

Talk Outline GPU –Memory Hierarchy & Microarchitecture Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations

Sparse Matrix Sparse Matrices can be very large but contain few non-zero elements SpMV: Ax = b Need special storage format –Compressed Storage Row (CSR) val col ptr

GPU SpMV Multiplication State of the art –NVIDIA Research (Nathan Bell) –Ohio State University and IBM (Rajesh Bordawekar) Built on top of NVIDIA’s SpMV CSR kernel Memory management optimizations added In general, performance depends on effective use of GPU memories

OSU/IBM SpMV Matrix stored in device memory –Zero padding: Elements per row to be a multiple of sixteen Input vector in SM’s texture cache Shared memory stores output vector Extracting Global Memory Bandwidth –Instruction and variable alignment necessary Fulfilled by built-in types –Global memory access by all threads of a half-warp coalesced into a transaction of 32, 64, or 128 bytes

Analysis Each thread reads 1/16 th of non-zero elements in a row Accessing device memory (128 byte interface): –Access val array => 16 threads read 16 x 8 bytes = 128 bytes –Access col array => 16 threads read 16 x 4 bytes = 64 bytes Occupancy achieved by all matrices was ONE –Each thread uses sufficiently small amount of registers and shared memory –Each SM capable of executing the maximum number of threads possible Instruction throughput ratio : to 0.886

Talk Outline GPU –Memory Hierarchy & Microarchitecture Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations

SpMV FPGA Implementation Generally Implemented Architecture (from literature) –Multipliers followed by a binary tree of adders followed by accumulator –Values delivered serially to the accumulator –For a set of n values, n-1 additions required to reduce Problem –Accumulation of FP values is an iterative procedure M1M1 M2M2 V1V1 V2V2 Accumulator

The Reduction Problem + + Mem Control Partial sums Basic Accumulator Architecture Adder Pipeline Required Design Reduction Ckt Feedback Loop

Previous Reduction Ckt Implementations GroupFPGA Reduc’n Logic Reduc’n BRAM D.p. adder speed Accumulator speed Prasanna ’07 Virtex2 Pro100 DSA3170 MHz142 MHz Prasanna ’07 Virtex2 Pro100 SSA6170 MHz165 MHz Gerards ’08Virtex 4 Rule Based 9324 MHz200 MHz We need better architecture Feedback Reduction Circuit −Simple and Resource Efficient Reduce the performance gap between adder and accumulator –Move logic outside the feedback loop

A Close Look at Floating Point Addition Compare exponents Add mantissas De- normalize smaller value Round Re- normalize x x x x x 2 24 Round x 2 24 IEEE 754 adder pipeline (assume 4-bit significand): x x 2 21

Base Conversion Idea: –Shift both inputs to the left by amount specified in low-order bits of exponents –Reduces size of exponent, requires wider adder Example: –Base-8 conversion: , exp=10110 ( x 2 22 => ~5.7 million) Shift to the left by 6 bits… , exp=10 (87.25 x 2 8*2 = > ~5.7 million)

Accumulator Design Feedback Loop Preprocess Post-process α = 3

Reduction Circuit Designed a novel reduction circuit Lightweight by taking advantage of shallow adder pipeline Requires –One input buffer –One output buffer –Eight State FSM controller

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 22 11 B1 Input 0

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 22 11 B2 Input B1

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer B1 33 B2 Input  2 B3

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer B1 33 Input  2 B4 B2+B3

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 Input  2 B5 B2+B3B1+B4

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input  2  3 B6 B2+B3B1+B4 B5

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input  2  3 B7 B2+B3 +B6 B1+B4 B5

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input  2  3 B8 B2+B3 +B6 B1+B4 +B7 B5

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input C1 B2+B3 +B6 B1+B4 +B7 B5+B8 0

Reduction Circuit Configurations Four “configurations”: Deterministic control sequence, triggered by set change: –D, A, C, B, A, B, B, C, B/D Minimum set size: α ⌈ lg α + 1 ⌉ -1 Minimum set size for adder pipeline depth of 3 is 8

New SpMV Architecture Built on top of limitation of Reduction Circuit Delete Adder Binary tree Replicate accumulators Schedule data to process multiple dot products in parallel

Talk Outline GPU –Memory Hierarchy & Microarchitecture Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations

Performance Figures GPUFPGA Matrix Order/ dimensions nznz Avg. n z /row Mem. BW (GB/s) GFLOPs GFLOPs ( 8.5 GB/s) TSOPF_RS_b162_c E40r Simon/olafu Garon/garon Mallya/lhr11c Hollinger/mark3jac020sc Bai/dw YCheng/psse x GHS_indef/ncvxqp

Performance Comparison If FPGA Memory bandwidth scaled by adding multipliers/ accumulators to match GPU Memory Bandwidth for each matrix separately GPU Mem. BW (GB/s) FPGA Mem BW (GB/s) GB/s (x6) GB/s (x6) GB/s (x6) GB/s (x5) GB/s (x4) GB/s (x3) GB/s (x3) GB/s (x3) GB/s (x3)

Conclusions Presented state of the art GPU Implementation of SpMV Presented a new SpMV Architecture for FPGA –Based on novel Accumulator architecture GPUs at present, perform better than FPGAs for SpMV –Due to available memory bandwidth FPGAs have the potential to outperform GPUs –Need more memory bandwidth

Acknowledgement Dr. Jason Bakos Yan Zhang, Tiffany Mintz, Zheming Jin, Yasser Shalabi, Rishabh Jain National Science Foundation Questions?? Thank You!!

Performance Analysis Xilinx Virtex-2Pro100 –Includes everything related to the accumulator (LUT based adder)