Yang Gao and Dr. Jason D. Bakos

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Basics and Architectures
Fan Zhang, Yang Gao and Jason D. Bakos
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
My Coordinates Office EM G.27 contact time:
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Appendix C Graphics and Computing GPUs
University of California, Berkeley
Scalpel: Customizing DNN Pruning to the
NFV Compute Acceleration APIs and Evaluation
TI Information – Selective Disclosure
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
Scientific requirements and dimensioning for the MICADO-SCAO RTC
Reducing Hit Time Small and simple caches Way prediction Trace caches
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Morgan Kaufmann Publishers
Map-Scan Node Accelerator for Big-Data
Morgan Kaufmann Publishers
Introduction to Digital Signal Processors (DSPs)
Array Processor.
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
DRAM Bandwidth Slide credit: Slides adapted from
Coe818 Advanced Computer Architecture
Introduction to Heterogeneous Parallel Computing
Digital Signal Processors-1
Many-Core Graph Workload Analysis
Multicore and GPU Programming
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Yang Gao and Dr. Jason D. Bakos Sparse Matrix-Vector Multiply on the Texas Instruments C6678 Digital Signal Processor Yang Gao and Dr. Jason D. Bakos Application-Specific Systems, Architectures, and Processors 2013

TI C6678 vs. Competing Coprocessors NVIDIA Tesla K20X GPU Intel Xeon Phi 5110p TI C66 Peak single precision performance 3.95 Tflops/s 2.12 Tflops/s 128 Gflops/s Memory bandwidth 250 GB/s 320 GB/s 12.8 GB/s Power 225 W 10 W Primary programming model CUDA/OpenCL OpenMP None, but OpenMP/ OpenCL in development

Why the C6678? Unique architectural features Power efficiency 8 symmetric VLIW cores with SIMD instructions, up to 16 flops/cycle No shared last level cache 4MB on-chip shared RAM L1D and L2 can be configured as cache, scratchpad, or both DMA engine for parallel loading/flushing scratchpads Power efficiency At 45 nm, achieves 12.8 ideal SP Gflops/Watt Intel Phi [22 nm] is 9.4 Gflops/Watt NVIDIA K20x [28 nm] is 17.6 Gflops/Watt Fast on-chip interfaces for potential scalability 4 x Rapid IO(SRIO) 2.1: 20 Gb/s 1 x Ethernet: 1 Gb/s 2 x PCI-E 2.0: 10 Gb/s HyperLink: 50 Gb/s

C66 Platforms

Software Pipelining Time 1 1 1 1 2 Prolog 1 1 2 3 Kernel 2 2 3 Epilog VLIW architecture requires explicit usage of functional units C66 compiler uses software pipelining to maximize FU utilization Conditional prevents SP and lowers utilization Regular Loop Software Pipelining Time 1 1 1 1 2 Prolog 1 1 2 3 Kernel 2 2 3 Epilog 2 3 ALU1 2 ALU2 ALU3

Sparse Matrices val col ptr (1 -1 -3 -2 5 4 6 -4 2 7 8 -5) (0 1 3 4) We evaluated the C66 relative using a SpMV kernel GPUs achieve only 0.6% to 6% of their peak performance with CSR SpMV Sparse Matrices can be very large but contain few non-zero elements Compressed formats are often used, e.g. Compressed Sparse Row (CSR) 1 -1 -3 -2 5 4 6 -4 2 7 8 -5 val (1 -1 -3 -2 5 4 6 -4 2 7 8 -5) col (0 1 3 4) ptr 11 13)  

Sparse Matrix-Vector Multiply Code for y = Aax + by row = 0 for i = 0 to number_of_nonzero_elements do if i == ptr[row+1] then row=row+1, y[row]*=beta; y[row] = y[row] + alpha * A[i] * x[col[i]] end conditional execution Low arithmetic intensity (~3 flops / 24 bytes) Memory bound kernel reduction indirect indexing

Naïve Implementation y write back index i ptr buffer y buffer val for i = columns assigned to current core y write back index i ptr buffer y buffer val array col array row if ptr[row] == i then row = row +1 y[row]  y[row] * b end if y[row]y[row]+Acc x array product results Acc a * val * x 0.55 Gflops/s 60.4% of cycles were uncovered memory latency

Double Buffer and DMA Product Loop index i SDRAMbuffer val buffer Val col buffer Col L2 buffer DMA x buffer α * val * x DSP 0.78 Gflops/s 28.8% of cycles were uncovered memory latency

Loop Unroll and Predicate Instruction Accumulate Loop y write back i ptr buffer y buffer row The accumulate loop is manually unrolled by 8 Predicate instructions are applied to replace the if-statements in assembly. Acc  Acc + prod[i] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if prod buffer Acc  Acc + prod[i+1] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if Acc  Acc + prod[i+K] if ptr[row] == i then row = row +1 y[row]  y[row] * β +Acc end if 1.63 Gflops/s 50.1% cycles were uncovered memory latency

Loop Fission Product Loop Accumulate Loop y write back index i i ptr buffer y buffer val buffer col buffer row Acc  Acc+prod[i] if ptr[row] == i then row = row +1 y[row]  y[row]*β+Acc end if x buffer prod buffer product results α * val * x 2.08 Gflops/s 36.6% cycles were uncovered memory latency

Adaptive Row Pointer y write back Accumulate Loop i ptr buffer y buffer prod buffer row if(ptr[row+1]-ptr[row]) >K Acc  Acc + prod[i] Acc  Acc + prod[i+1] … Acc  Acc + prod[i+K] Acc  Acc + prod[i] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if Acc  Acc + prod[i+1] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if Acc  Acc + prod[i+K] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if

Test Environment i5 650 MKL GTX680 CUSPARSE GTX650Ti C66 Architecture Clarkdale Kepler Shannon Process(nm) 32 28 45 Memory throughput (GB/s) 21 192.3 86.4 12.8 TDP (W) 73 195 110 10 Single precision performance (Gflops) 26 3090 1425 128

Power Socket Provide by WT500 Power Analyzer Power Socket Provide by WT500 110 V PSU with EVM board 12 V

Matrix Tri-diagonal N-diagonal 3 - 501 University of Florida sparse matrix collection Matrix Market 2, 4, 0, 0, 0, 0 5, 4, 7, 0, 0, 0 0, 6, 2, 4, 0, 0 0, 0, 3, 10, 1, 0 0, 0, 0, 4, 6, 8 0, 0, 0, 0, 2, 12

SpMV Performance N-diagonal Matrix Generally, the C66 achieves ~2/3 CPU performance

SpMV Gflops/Watt N-diagonal Matrix C66 is equivalent to GPUs when N > 51

Gflops/Watt for Nonsynthetic Matrices C66 power efficiency also scales with density for real-world matrices

Memory Efficiency 𝐴𝐼= 9∗𝑟𝑜𝑤𝑠∗𝑛+8∗𝑟𝑜𝑤𝑠+2 12(2∗𝑟𝑜𝑤𝑠∗𝑛+𝑛+2∗𝑟𝑜𝑤𝑠+1) 𝑜𝑝𝑠/𝑏𝑦𝑡𝑒 N = 151 rows = 208326 𝐴𝐼∗12.8

Memory Efficiency

Next Generation Keystone-II 28 nm Doubles caches Increases memory bandwidth by 125% C66 66AK2H12 “Keystone-II” CPU n/a 4 x ARM A15 DSP 8 Cores DSP L2 512 KB 1024 KB DDR3 64 bit 2 x 72 bit Process 45 nm 28 nm Power 10 W ?

Conclusions TI DSP is a promising coprocessor technology for HPC Advantages: Unique architectural features that facilitate automated parallelization (easier to program?) Inherently power efficient microarchitecture Equivalent to modern GPUs and Phi despite older process technology Has advanced memory system for memory bound kernels Simultaneous DMA and caching to match access pattern of indvidivual arrays Has advanced onchip interfaces for efficient scalability Large-scale multi-DSP platforms already exist Looking forward: Keystone II will: Improve efficiency and memory performance (cache + b/w) Has onboard host CPUs to facilitate runtimes for multi-DSP scaling

Q & A

Arithmetic Intensity y write back index i ptr buffer y buffer 4 bytes val buffer col buffer row 4 bytes 4 bytes Acc  Acc+prod[i] if ptr[row] == i then row = row +1 y[row]  y[row]*β+Acc end if x buffer 1 op 4 bytes 4/rows bytes 2 ops prod buffer α * val * x n = average number of non-zero elements per row 2 ops 𝐴𝐼=( 3𝑜𝑝𝑠 8+ 4 𝑟𝑜𝑤𝑠 𝑏𝑦𝑡𝑒𝑠 + 1 n ∗ 2ops 12byts )/(1+ 1 n )