Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
KeyStone C66x CorePac Overview
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Intel Itanium 2 Processor Intel’s Server Solution Raymond Ball April 2, 2004.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
ARM Cortex-A9 performance in HPC applications Kurt Keville, Clark Della Silva, Merritt Boyd ARM gaining market share in embedded systems and SoCs Current.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Fan Zhang, Yang Gao and Jason D. Bakos
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Carving New Niches for Keystone: Research in General Purpose DSP Computing at the University of South Carolina Jason D. Bakos, Konstantin Rubin Former.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
Irregular Applications –Sparse Matrix Vector Multiplication
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Appendix C Graphics and Computing GPUs
University of California, Berkeley
Yang Gao and Dr. Jason D. Bakos
TI Information – Selective Disclosure
Embedded Systems Design
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Embedded OpenCV Acceleration
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Coe818 Advanced Computer Architecture
Digital Signal Processors-1
CSC3050 – Computer Architecture
Presentation transcript:

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme Computing Conference(HPEC ‘14)

Heterogeneous Computing on Embedded Platforms GPU/FPGA/DSP(KeyStone) ARM + Embedded GPUs/FPGA/DSP(KeyStone II) 2 Embedded GPUFPGADSP PlatformsPowerVR SGX 544MP3 PowerVR G6430 GK20A (Kepler)Kepler Zynq-7000KeyStone II Exynos 5 Octa Apple A7Tegra K1(Jetson) 7Z045TCI6638 Theoretical Peak Performance (Gflops) MHz

Key Features of KeyStone II In-order execution, static scheduled VLIW processor 8 symmetric VLIW cores with SIMD instructions, up to 16 flops/cycle No shared last level cache 6MB on-chip shared RAM (MSMC) L1D and L2 can be configured as cache, scratchpad, or both DMA engine for parallel loading/flushing scratchpads High performance, Low power consumption design 3

VLIW and Software Pipeline 4 for i = 0 to n $1 = load A[i] add $2,$2,$1 A[i] = store $2 endfor A L S Regular Loop Load 3 cycles Add 3 cycles Store 1 cycles VLIW Software Pipelined Loop with static scheduler L S L S L S 7 cycles 3 cycles After Schedule Out-of-order Processor with dynamic scheduler A A A Software Pipeline Restrictions: Not friendly to conditional code inside loop body Register Limitations Rely on compiler to exert the optimization in low level

DSP Memory Hierarchy 1024K Level 2 SRAM Cache SPRAM Size (KB) Cache Capacity(KB) L1P320/4/8/16/32 L1D320/4/8/16/32 L210240/32/64/128/ 256/512/1024 MSMC61440 Cache SPRAM Cache SPRAM

SPM Method and Double Buffer Load 0Load 1 Compute 0 Load 0 Compute 1 EDMA DSP Corepac Time Compute 0

Sparse Matrices We evaluated the Keystone II using a SpMV kernel Sparse Matrices can be very large but contain few non-zero elements Compressed formats are often used, e.g. Compressed Sparse Row (CSR) val ( ) col ( ) ptr ( )

Sparse Matrix-Vector Multiply Code for y = Ax + y row = 0 for i = 0 to number_of_nonzero_elements do if i == ptr[row+1] then row=row+1, y[row]*=beta; y[row] = y[row] + alpha * A[i] * x[col[i]] end reduction indirect indexing Low arithmetic intensity (~3 flops / 24 bytes ) Memory bound kernel 8 conditional execution

9 for i = elements assigned to current core val array col array x array ptr array y array i row if ptr[row] = i then row = row+1 y[row]  y[row]*β endif y[row]  y[row]+acc α*val*x prod buffer i DDR arrays val buffer col buffer acc  acc+prod[i] if ptr[row] = i then row = row+1 y[row]  y[row]*β+acc endif Product Loop Accumulation Loop if(ptr[row+1]-ptr[row])>K acc  acc+prod[i+1] if ptr[row] = i then row = row+1 y[row]  y[row]*β+acc endif acc  acc+prod[i+k] if ptr[row] = i then row = row+1 y[row]  y[row]*β+acc endif Manually Unrolled Loop Loop Fission DMA Double buffer acc  acc+prod[i] acc  acc+prod[i+1] … acc  acc+prod[i+k] Adaptive Row Pointer Predicate Instruction Assembly Optimization DDR arrays DMA Circular buffer ptr buffer y buffer

Buffer Location and Performance 10 Buffer val col ptr y prod Locations L2 MSMC cache map? L1 L2 MSMC cache

Matrix 11 Tri-diagonal N-diagonal 3, 151 2, 4, 0, 0, 0, 0 5, 4, 7, 0, 0, 0 0, 6, 2, 4, 0, 0 0, 0, 3, 10, 1, 0 0, 0, 0, 4, 6, 8 0, 0, 0, 0, 2, 12

Buffer Location and Performance Non- zeros per row valcolptryprodGflopsNorm. Perf Note 3S L2 C L2 C L2 C L2 C L1 L2 S C Best Median Worst 2 All cache 151L2 S C SCCCSCCC L2 C L2 C L2 S L2 C Best Median Worst 2 All cache 12 L1: level 1 SPRAM, L2:level 2 SPRAM, S:MSMC, C:cache 1: The results are normalized to the all cache configuration 2: The worst amongst the configurations with SPRAM

Matrix 13 Tri-diagonal N-diagonal 3, 151 University of Florida sparse matrix collection Matrix Market 2, 4, 0, 0, 0, 0 5, 4, 7, 0, 0, 0 0, 6, 2, 4, 0, 0 0, 0, 3, 10, 1, 0 0, 0, 0, 4, 6, 8 0, 0, 0, 0, 2, 12

Testing Platforms Intel i7 3770K MKL NVIDIA GTX 680 cuSparse NVIDIA Tegra K1 cuSparse TI 6638K2K ArchIvy BridgeKepler KeyStone II Memory B/W(GB/s) SPRAM KB/core n/a64/ /1024/768 2 Single Precision Peak Throughput (Gflops) 172.8(DSP) 44.8(ARM) TDP(W)77195~10~ Register file/allocable share memory or L1 2.L1 SRAM / L2 SRAM / MSMC per core

Close Comparison to NVIDIA Tegra K1 Tegra K1TI 6638K2K ArchitectureCortex A15 + Kepler GPUCortex A15 + Keystone II DSP CPU4 Cores, 2.2 GHz 32K+32K L1 2M L2 4 Cores, 1.4 GHz 32K+32K L1 4M L2 Co-processor192 CUDA MHz 331 Gflops (SP peak) 8 DSP 1.35GHz Gflops (SP peak) Memory Bandwidth17.1 GB/s(DDR3-2133)12.8 GB/s(DDR3-1600) Power~ 10 W~ 15 W

Performance Comparison vs. Nvidia TK1 16

Performance Comparison 17

Kernel/Memory Efficiency 18 efficiency = observed performance/((Arithmetic Intensity) X (Memory Bandwidth))

Conclusion Significant performance gain with loop tuning and SPM optimizations Despite the mobile GPU’s higher theoretical SP performance and memory bandwidth, the DSP outperforms it with our SpMV implementation. Auto-decided SPM buffer allocation/mapping by a model of data access pattern. 19

Q & A 20