Yang Gao and Dr. Jason D. Bakos

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

DSPs Vs General Purpose Microprocessors

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.

A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Basics and Architectures

Fan Zhang, Yang Gao and Jason D. Bakos

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

My Coordinates Office EM G.27 contact time:

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Appendix C Graphics and Computing GPUs

University of California, Berkeley

Scalpel: Customizing DNN Pruning to the

NFV Compute Acceleration APIs and Evaluation

TI Information – Selective Disclosure

Employing compression solutions under openacc

CS427 Multicore Architecture and Parallel Computing

Scientific requirements and dimensioning for the MICADO-SCAO RTC

Reducing Hit Time Small and simple caches Way prediction Trace caches

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Morgan Kaufmann Publishers

Map-Scan Node Accelerator for Big-Data

Morgan Kaufmann Publishers

Introduction to Digital Signal Processors (DSPs)

Array Processor.

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Linchuan Chen, Peng Jiang and Gagan Agrawal

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

DRAM Bandwidth Slide credit: Slides adapted from

Coe818 Advanced Computer Architecture

Introduction to Heterogeneous Parallel Computing

Digital Signal Processors-1

Many-Core Graph Workload Analysis

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Yang Gao and Dr. Jason D. Bakos Sparse Matrix-Vector Multiply on the Texas Instruments C6678 Digital Signal Processor Yang Gao and Dr. Jason D. Bakos Application-Specific Systems, Architectures, and Processors 2013

TI C6678 vs. Competing Coprocessors NVIDIA Tesla K20X GPU Intel Xeon Phi 5110p TI C66 Peak single precision performance 3.95 Tflops/s 2.12 Tflops/s 128 Gflops/s Memory bandwidth 250 GB/s 320 GB/s 12.8 GB/s Power 225 W 10 W Primary programming model CUDA/OpenCL OpenMP None, but OpenMP/ OpenCL in development

Why the C6678? Unique architectural features Power efficiency 8 symmetric VLIW cores with SIMD instructions, up to 16 flops/cycle No shared last level cache 4MB on-chip shared RAM L1D and L2 can be configured as cache, scratchpad, or both DMA engine for parallel loading/flushing scratchpads Power efficiency At 45 nm, achieves 12.8 ideal SP Gflops/Watt Intel Phi [22 nm] is 9.4 Gflops/Watt NVIDIA K20x [28 nm] is 17.6 Gflops/Watt Fast on-chip interfaces for potential scalability 4 x Rapid IO(SRIO) 2.1: 20 Gb/s 1 x Ethernet: 1 Gb/s 2 x PCI-E 2.0: 10 Gb/s HyperLink: 50 Gb/s

C66 Platforms

Software Pipelining Time 1 1 1 1 2 Prolog 1 1 2 3 Kernel 2 2 3 Epilog VLIW architecture requires explicit usage of functional units C66 compiler uses software pipelining to maximize FU utilization Conditional prevents SP and lowers utilization Regular Loop Software Pipelining Time 1 1 1 1 2 Prolog 1 1 2 3 Kernel 2 2 3 Epilog 2 3 ALU1 2 ALU2 ALU3

Sparse Matrices val col ptr (1 -1 -3 -2 5 4 6 -4 2 7 8 -5) (0 1 3 4) We evaluated the C66 relative using a SpMV kernel GPUs achieve only 0.6% to 6% of their peak performance with CSR SpMV Sparse Matrices can be very large but contain few non-zero elements Compressed formats are often used, e.g. Compressed Sparse Row (CSR) 1 -1 -3 -2 5 4 6 -4 2 7 8 -5 val (1 -1 -3 -2 5 4 6 -4 2 7 8 -5) col (0 1 3 4) ptr 11 13)

Sparse Matrix-Vector Multiply Code for y = Aax + by row = 0 for i = 0 to number_of_nonzero_elements do if i == ptr[row+1] then row=row+1, y[row]*=beta; y[row] = y[row] + alpha * A[i] * x[col[i]] end conditional execution Low arithmetic intensity (~3 flops / 24 bytes) Memory bound kernel reduction indirect indexing

Naïve Implementation y write back index i ptr buffer y buffer val for i = columns assigned to current core y write back index i ptr buffer y buffer val array col array row if ptr[row] == i then row = row +1 y[row]  y[row] * b end if y[row]y[row]+Acc x array product results Acc a * val * x 0.55 Gflops/s 60.4% of cycles were uncovered memory latency

Double Buffer and DMA Product Loop index i SDRAMbuffer val buffer Val col buffer Col L2 buffer DMA x buffer α * val * x DSP 0.78 Gflops/s 28.8% of cycles were uncovered memory latency

Loop Unroll and Predicate Instruction Accumulate Loop y write back i ptr buffer y buffer row The accumulate loop is manually unrolled by 8 Predicate instructions are applied to replace the if-statements in assembly. Acc  Acc + prod[i] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if prod buffer Acc  Acc + prod[i+1] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if Acc  Acc + prod[i+K] if ptr[row] == i then row = row +1 y[row]  y[row] * β +Acc end if 1.63 Gflops/s 50.1% cycles were uncovered memory latency

Loop Fission Product Loop Accumulate Loop y write back index i i ptr buffer y buffer val buffer col buffer row Acc  Acc+prod[i] if ptr[row] == i then row = row +1 y[row]  y[row]*β+Acc end if x buffer prod buffer product results α * val * x 2.08 Gflops/s 36.6% cycles were uncovered memory latency

Adaptive Row Pointer y write back Accumulate Loop i ptr buffer y buffer prod buffer row if(ptr[row+1]-ptr[row]) >K Acc  Acc + prod[i] Acc  Acc + prod[i+1] … Acc  Acc + prod[i+K] Acc  Acc + prod[i] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if Acc  Acc + prod[i+1] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if Acc  Acc + prod[i+K] if ptr[row] == i then row = row +1 y[row]  y[row] * β end if

Test Environment i5 650 MKL GTX680 CUSPARSE GTX650Ti C66 Architecture Clarkdale Kepler Shannon Process(nm) 32 28 45 Memory throughput (GB/s) 21 192.3 86.4 12.8 TDP (W) 73 195 110 10 Single precision performance (Gflops) 26 3090 1425 128

Power Socket Provide by WT500 Power Analyzer Power Socket Provide by WT500 110 V PSU with EVM board 12 V

Matrix Tri-diagonal N-diagonal 3 - 501 University of Florida sparse matrix collection Matrix Market 2, 4, 0, 0, 0, 0 5, 4, 7, 0, 0, 0 0, 6, 2, 4, 0, 0 0, 0, 3, 10, 1, 0 0, 0, 0, 4, 6, 8 0, 0, 0, 0, 2, 12

SpMV Performance N-diagonal Matrix Generally, the C66 achieves ~2/3 CPU performance

SpMV Gﬂops/Watt N-diagonal Matrix C66 is equivalent to GPUs when N > 51

Gflops/Watt for Nonsynthetic Matrices C66 power efficiency also scales with density for real-world matrices

Memory Efficiency 𝐴𝐼= 9∗𝑟𝑜𝑤𝑠∗𝑛+8∗𝑟𝑜𝑤𝑠+2 12(2∗𝑟𝑜𝑤𝑠∗𝑛+𝑛+2∗𝑟𝑜𝑤𝑠+1) 𝑜𝑝𝑠/𝑏𝑦𝑡𝑒 N = 151 rows = 208326 𝐴𝐼∗12.8

Memory Efficiency

Next Generation Keystone-II 28 nm Doubles caches Increases memory bandwidth by 125% C66 66AK2H12 “Keystone-II” CPU n/a 4 x ARM A15 DSP 8 Cores DSP L2 512 KB 1024 KB DDR3 64 bit 2 x 72 bit Process 45 nm 28 nm Power 10 W ?

Conclusions TI DSP is a promising coprocessor technology for HPC Advantages: Unique architectural features that facilitate automated parallelization (easier to program?) Inherently power efficient microarchitecture Equivalent to modern GPUs and Phi despite older process technology Has advanced memory system for memory bound kernels Simultaneous DMA and caching to match access pattern of indvidivual arrays Has advanced onchip interfaces for efficient scalability Large-scale multi-DSP platforms already exist Looking forward: Keystone II will: Improve efficiency and memory performance (cache + b/w) Has onboard host CPUs to facilitate runtimes for multi-DSP scaling

Q & A

Arithmetic Intensity y write back index i ptr buffer y buffer 4 bytes val buffer col buffer row 4 bytes 4 bytes Acc  Acc+prod[i] if ptr[row] == i then row = row +1 y[row]  y[row]*β+Acc end if x buffer 1 op 4 bytes 4/rows bytes 2 ops prod buffer α * val * x n = average number of non-zero elements per row 2 ops 𝐴𝐼=( 3𝑜𝑝𝑠 8+ 4 𝑟𝑜𝑤𝑠 𝑏𝑦𝑡𝑒𝑠 + 1 n ∗ 2ops 12byts )/(1+ 1 n )