A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.

Slides:



Advertisements
Similar presentations
THE SPARC ARCHITECTURE Presented By M. SHAHADAT HOSSAIN NAIEEM TOURZO KHAN SARDER FERDOUS SADIQUE
Advertisements

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Lecture 6: Multicore Systems
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
The University of Adelaide, School of Computer Science
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
BIST for Logic and Memory Resources in Virtex-4 FPGAs Sachin Dhingra, Daniel Milton, and Charles Stroud Electrical and Computer Engineering Auburn University.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Performance and Energy Efficiency of GPUs and FPGAs
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Chapter 04 Authors: John Hennessy & David Patterson.
GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Chapter One Introduction to Pipelined Processors.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Computer Organization & Assembly Language © by DR. M. Amer.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
Principles of Linear Pipelining
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Chapter One Introduction to Pipelined Processors
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
Yang Gao and Dr. Jason D. Bakos
Gwangsun Kim, Jiyun Jeong, John Kim
William Stallings Computer Organization and Architecture 8th Edition
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Presentation transcript:

A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos Heterogeneous and Reconfigurable Computing Lab (HeRC) This material is based upon work supported by the National Science Foundation under Grant Nos. CCF and CCF

Introduction Convey HC-1: A turnkey reconfigurable computer Personality: A configuration of the user programmable FPGAs that works within the HC-1’s execution and programming model This paper introduces a sparse matrix personality for Convey HC-1 FCCM 2011 | 05/02/11 2

Outline Convey HC-1 –Overview of system –Shared coherent memory model –High-performance coprocessor memory Personality design for sparse matrix vector multiply –Indirect addressing of vector data –Streaming double precision reduction architecture Results and comparison with NVIDIA Tesla FCCM 2011 | 05/02/11 3

Convey HC-1 Intel Server Motherboard Intel Xeon Socket 1Socket 2 24 GB DRAM HC-1 Coprocessor Virtex-5 LX GB Scatter-Gather DDR2 DRAM application engines (AEs) FCCM 2011 | 05/02/11 4

Coprocessor Memory System MC0 MC1 MC2 MC3 MC4 MC5 MC6 MC7 AE Hub 5 GB/s 2.5 GB/s Host 16 GB SG-DIMM Each AE connected to 8 MCs through a full crossbar Address space partitioned across all 16 DIMMs High-performance memory –Organized in 1024 banks –Crossbar parallelism gives 80 GB/s aggregate bandwidth –Relaxed memory model Smallest contiguous unit of data that can be read at full bandwidth = 512 BYTES FCCM 2011 | 05/02/11 5

Host Action Host Memory State Application code invoked Coprocessor Action Invalid Input data written to memory Invokes coprocessor; Sends pointers to memory blocks Exclusive Blocked Updates coprocessor memory Write to a memory block; Finish! Invalid Exclusive Shared Write results from memory block Read contents of memory block Shared Coprocessor Memory State Invalid Shared Idle HC-1 Execution Model Input Output FCCM 2011 | 05/02/11 6 Host Coproc

Convey Licensed Personalities Soft-core vector processor –Includes corresponding vectorizing compiler –Supports single and double precision –Supports hardware instructions for transcendental and random number generation Smith-Waterman sequence alignment personality No sparse matrix personality FCCM 2011 | 05/02/11 7

Outline Convey HC-1 –Overview of system –Shared coherent memory –High-performance coprocessor memory Personality design for sparse matrix vector multiply –Indirect addressing of vector data –Streaming double precision reduction architecture Results and comparison with NVIDIA CUSPARSE on Tesla and Fermi FCCM 2011 | 05/02/11 8

Sparse Matrix Representation Sparse Matrices can be very large but contain few non-zero elements Compressed formats are often used, e.g. Compressed Sparse Row (CSR) val ( ) col ( ) ptr ( ) FCCM 2011 | 05/02/11 9

Sparse Matrix-Vector Multiply Code for Ax = b row = 0 for i = 0 to number_of_nonzero_elements do if i == ptr[row+1] then row=row+1, b[row]=0.0 b[row] = b[row] + val[i] * x[col[i]] end NVIDIA GPUs achieve only 0.6% to 6% of their peak double precision performance with CSR SpMV N. Bell, M. Garland, "Implementing Sparse Matrix-Vector Multiplication on Throughput- Oriented Processors," Proc. Supercomputing recurrence (reduction) indirect indexing Low arithmetic intensity (1 FLOP / 10 bytes ) FCCM 2011 | 05/02/11 10

Indirect Addressing val/col X val vec  col Vector cache (64KB) Coprocessor Memory vec FCCM 2011 | 05/02/11 11 b[row] = b[row] + val[i] * x[col[i]]

Data Stream X val vec  col Vector Cache 4096 bit Shifter Matrix Cache vec PE FCCM 2011 | 05/02/11 12 AE Coprocessor Memory Matrix cache to get contiguous data Shifter loads matrix data in parallel and delivers serially to MAC

Top Level Design and Scaling X  Shifter X  X  X  Coprocessor Memory vec PE1 PE2 PE3 PE8 FCCM 2011 | 05/02/11 13 AE Matrix Cache (64 KB )

Outline Convey HC-1 –Overview of system –Shared coherent memory –High-performance coprocessor memory Personality design for sparse matrix vector multiply –Indirect addressing of vector data –Streaming double precision reduction architecture Results and comparison with NVIDIA CUSPARSE on Tesla and Fermi FCCM 2011 | 05/02/11 14

The Reduction Problem (Ideally) New values arrive every clock cycle Partial sums of different accumulation sets become intermixed in the deeply pipelined adder pipeline −Data hazard Adder Pipeline Partial sums From multiplier FCCM 2011 | 05/02/11 15

Resolving Reduction Problem Custom architecture to dynamically schedule concurrent reduction operations FCCM 2011 | 05/02/11 16 GroupAdders Reduction BRAM Prasanna ’ Gerards ’ This Work 1 3 Control MEM

Our Approach Built around 14 stage double precision adder Rule based approach –Governs the routing of incoming values and adder output –Decides inputs to the adder –Applied based on current state of the system Goal –Maximize the adder utilization –Minimize the required number of buffers Used software model to design rules and find required number of buffers FCCM 2011 | 4/14/11 17

Reduction Circuit Adder inputs based on row ID of: –Incoming value –Buffered values –Adder output –Rule 1 buf n.rowID = adderOut. rowID –Rule 2 buf i.rowID= buf j.rowID –Rule 3 input. rowID = adderOut. rowID –Rule 4 buf n. rowID = input. rowID –Rule 5 addIn1 = input addIn2 = 0 –Rule 5 Special Case addIn1 = adderOut addIn2 = 0 FIFO Rule 1 FIFO Rule 2 Rule 3 Rule 4 Rule 5 Special Case FCCM 2011 | 05/02/11 18 buffers

Outline Convey HC-1 –Overview of system –Shared coherent memory –High-performance coprocessor memory Personality design for sparse matrix vector multiply –Indirect addressing of vector data –Streaming double precision reduction architecture Results and comparison with NVIDIA CUSPARSE on Tesla FCCM 2011 | 05/02/11 19

SpMV on GPU GPUs widely used for accelerating scientific applications GPUs generally have more mem bandwidth than FPGAs, so do better for computations with low arithmetic intensity Target: NVIDIA Tesla S1070 –Contains four Tesla T10 GPUs –Each GPU has 50% more memory bandwidth than all 4 AEs on Convey HC-1 combined Implementation using NVIDIA CUDA CUSPARSE library –Supports Sparse BLAS routines for various sparse matrix representations including CSR –Can run only on single GPU for single SpMV computations FCCM 2011 | 05/02/11 20

Experimental Results Test matrices from Matrix Market and UFL Matrix collection Throughput = 2 * nz / (Execution Time) MatrixApplicationr * cnznz/row dw8192Electromagnetics8192* t2d_q9Structural9801* epb1Thermal14734* raefsky1 Computational fluid dynamics 3242* psmigr_2Economics3140* torso22D model of a torso115967* FCCM 2011 | 05/02/11 21

Performance Comparison FCCM 2011 | 05/02/ x 2.7x 3.2x 1.5x1.4x 0.4x

Test Matrices FCCM 2011 | 4/14/11 23 torso2 epb1

Final Word Conclusions –Described a SpMV personality tailored for Convey HC-1 built around new streaming reduction circuit architecture –FPGA outperforms GPU –Custom architectures have the potential for achieving high performance for kernels with low arithmetic intensity Future Work –Analyze design tradeoffs between vector cache and functional units –Improve the vector cache performance –Multi-GPU implementation FCCM 2011 | 05/02/11 24

About Us Heterogeneous and Reconfigurable Computing Lab The University of South Carolina Visit us at Thank You! FCCM 2011 | 05/02/11 25

Resource Utilization PEsSlicesBRAMDSP48E 4 per AE (Overall 16) / (50%) 146 / 288 (50%) 48 / 192 (25%) 8 per AE (Overall 32) / (73%) 210 / 288 (73%) 96 / 192 (50%) FCCM 2011 | 05/02/11 26

Set ID Tracking Mechanism Three dual ported memories with respective counters Write Port –Counter1 always increments associated incoming value setID –Counter2 always decrements associated adder input setID –Counter 3 decrements when number of associated active values reach one setID Read Port –Outputs current value for associated setID Set is completely reduced and output when count1 + count2 + count3 = 1 input.set adderOut.set adderIn.set wr rd wr rd wr rd Mem1 Mem2 Mem3 + _ _ count1 count3 count2 numActive(set) adderOut.set FCCM 2011 | 05/02/11 27