University of Michigan, Ann Arbor

Slides:

Advertisements

Similar presentations

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

 Understanding the Sources of Inefficiency in General-Purpose Chips.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Chapter One Introduction to Pipelined Processors.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

JAVA AND MATRIX COMPUTATION

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

QCAdesigner – CUDA HPPS project

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

University of Michigan, Ann Arbor

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

IMP: Indirect Memory Prefetcher

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Sunpyo Hong, Hyesoon Kim

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

CS/EE 217 – GPU Architecture and Parallel Programming

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Scalpel: Customizing DNN Pruning to the

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Analysis of Sparse Convolutional Neural Networks

Controlled Kernel Launch for Dynamic Parallelism in GPUs

High Performance Computing on an IBM Cell Processor --- Bioinformatics

High-Performance Matrix Multiplication

CSCI206 - Computer Organization & Programming

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Cache Memory Presentation I

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Improving cache performance of MPEG video codec

Short Circuiting Memory Traffic in Handheld Platforms

Spare Register Aware Prefetching for Graph Algorithms on GPUs

RegLess: Just-in-Time Operand Staging for GPUs

Linchuan Chen, Peng Jiang and Gagan Agrawal

CS/EE 217 – GPU Architecture and Parallel Programming

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

University of Wisconsin-Madison

COMP60621 Fundamentals of Parallel and Distributed Systems

Introduction to Heterogeneous Parallel Computing

Many-Core Graph Workload Analysis

Memory System Performance Chapter 3

Chapter 01: Introduction

6- General Purpose GPU Programming

Presentation transcript:

University of Michigan, Ann Arbor OuterSPACE An Outer Product based Sparse Matrix Multiplication Accelerator 28 February, 2018 Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti†, Hun-Seok Kim, David Blaauw, Trevor Mudge and Ronald Dreslinski University of Michigan, Ann Arbor †Arizona State University HPCA 2018

Overview Introduction Algorithm Architecture Evaluation Conclusion HPCA 2018

Overview Introduction Algorithm Architecture Evaluation Conclusion HPCA 2018

The Big Data Problem Big data collected from various sources Sensor feed, social media, scientific experiments Challenge: the nature of data is sparse Architecture research previously focused on improving compute Sparse matrix computation: a key example of memory bound workloads GPUs achieve ~100 GFLOPS for dense matrix mult. vs. ~100 MFLOPS for sparse Two dominant kernels Sparse matrix-matrix multiplication (SpGEMM) Breadth-first search, algebraic multigrid methods Sparse matrix-vector multiplication (SpMV) PageRank, support vector machines, ML based text analytics Talk about importance of the algorithms rather than describe how they work - Talk about how memory-bound workloads such as SpGEMM have gained importance as opposed to before Graphs become adjacency matrices, for example; many fundamental problems get converted to SPMM Animate the points Point to sparse matrix image HPCA 2018

Inner Product based Matrix Multiplication 𝑐 𝑖,𝑗 = 𝑘=0 𝑁−1 𝑎 𝑖,𝑘 × 𝑏 𝑘,𝑗 HPCA 2018

Outer Product based Matrix Multiplication 𝐂= 𝑖=0 𝑁−1 𝐂 𝑖 = 𝑖=0 𝑁−1 𝐚 𝑖 𝐛 𝑖 HPCA 2018

Outer Product based Matrix Multiplication + No index matching + Low reuse distance 𝐂= 𝑖=0 𝑁−1 𝐂 𝑖 = 𝑖=0 𝑁−1 𝐚 𝑖 𝐛 𝑖 HPCA 2018

Comparison of the Approaches Inner Product Outer Product Entries need to be index-matched before multiplication Reuse distance is high; each column of B is reloaded after multiplication of a row of A with all columns of B All non-zero pairs belonging to a column of A and row of B produce meaningful outputs Reuse distance is small; a pair of column of A and row of B are multiplied and never used again ✔ ✔ ∙ ∙ ∙ ∙ ∙ ∙ ∙ Pause at the end of slide ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ✔ ∙ ∙ ∙ ✔ ✔ ∙ ∙ ∙ HPCA 2018

Overview Introduction Algorithm Architecture Evaluation Conclusion HPCA 2018

Outer Product Implementation Figure: Implementation of outer product algorithm in a system with parallel Processing Units HPCA 2018

Outer Product Implementation Figure: The outer product algorithm implementation Multiply phase Processing units multiply an element of a column of A with a row of B Issued as a single “multiply task”: A[i][j] x Brow[j] Merge phase A processing unit merges all partial products pertaining to a row of C Uses a modified mergesort based approach to minimize memory traffic Talk about the mergesort approach; just mention that there is no data sharing in merge; reduce the red text; blow up figures; (1,2 ) and (3,4) go in diff lines Fix caption animations HPCA 2018

Performance on Traditional Hardware: CPU Each matrix has 1 million uniformly distributed non-zeros Matrix dimension Figure: Performance of the outer product algorithm against Intel MKL on the CPU. Evaluated outer product against MKL on Broadwell CPU The outer product algorithm puts high pressure on the memory system N N×N matrices streamed out to main memory during multiply Loaded back in during merge Merge phase implementation involves no sharing, leading to cache thrashing HPCA 2018

Performance on Traditional Hardware: CPU Each matrix has 1 million uniformly distributed non-zeros Performance bottlenecked by the restricted cache hierarchy and compute parallelism Matrix dimension Figure: Performance of the outer product algorithm against Intel MKL on the CPU. Evaluated outer product against MKL on Broadwell CPU The outer product algorithm puts high pressure on the memory system N N×N matrices streamed out to main memory during multiply Loaded back in during merge Merge phase implementation involves no sharing, leading to cache thrashing HPCA 2018

Performance on Traditional Hardware: GPU Each matrix has 1 million uniformly distributed non-zeros Matrix dimension Figure: Performance of the outer product algorithm against CUSP on the GPU Evaluated outer product against CUSP on K40 GPU The multiply phase streams and processes data much faster than the CPU implementation, scaling roughly linearly with decreasing density The merge phase suffers from a much lower total throughput Resulted by control divergence between threads within a warp while executing the conditional branches in the merge phase HPCA 2018

Performance on Traditional Hardware: GPU Each matrix has 1 million uniformly distributed non-zeros Performance bottlenecked by the SIMD nature of warps Matrix dimension Figure: Performance of the outer product algorithm against CUSP on the GPU. Matrices Evaluated outer product against CUSP on K40 GPU The multiply phase streams and processes data much faster than the CPU implementation, scaling roughly linearly with decreasing density The merge phase suffers from a much lower total throughput Resulted by control divergence between threads within a warp while executing the conditional branches in the merge phase Solution: SPMD architecture! HPCA 2018

Overview Introduction Algorithm Architecture Evaluation Conclusion HPCA 2018

OuterSPACE: Multiply Phase Figure: OuterSPACE architecture for the multiply phase. SPMD-style Processing Elements (PEs), high-speed crossbars and non- coherent caches with request coalescing, HBM interface Local Control Processor (LCP): streaming instructions in to the PEs Central Control Processor (CCP): work scheduling and memory management Talk about the PE architecture itself! Animate data movement through the architecture HPCA 2018

OuterSPACE: Multiply Phase Column of A Row of B Figure: OuterSPACE architecture for the multiply phase. Talk more about the architecture; talk about what’s in the PEs HPCA 2018

OuterSPACE: Multiply Phase Figure: Reconfigured architecture for the multiply phase. L0 caches contain elements of rows of B that are shared between PEs L1 caches act as as “victim” caches Hold hot data that may get evicted from L0 if multiple PEs are working simultaneously on different rows Talk about the PE architecture itself! Animate data movement through the architecture HPCA 2018

OuterSPACE: Merge Phase 1 2 2 Partial rows 2 3 5 1 2 2 1 2 Figure: Reconfigured architecture for the merge phase. The L0 cache-crossbar blocks are reconfigured to accommodate the change in data access pattern L0 reconfigured into smaller, private caches and private scratchpads A PE-pair merges all the partial products belonging to one row They work on each final row independent of other PE-pairs Thus, the merge phase involves no data sharing among PEs HPCA 2018

OuterSPACE: Merge Phase Figure: Reconfigured architecture for the merge phase. Half of the PEs are turned off and the rest work in pairs to merge outer products A “fetcher” PE initiates loads to fetch the partial products A “sorter” PE sorts the previously-fetched partial products and merges upon collision Turning off PEs saves power and regulates bandwidth allocation No synchronization => no coherence! HPCA 2018

Overview Introduction Algorithm Architecture Evaluation Conclusion HPCA 2018

Simulation Methodology Modeled the PEs, cache- crossbar hierarchy and HBM using gem5 Trace-based simulation Generated memory instruction traces offline Fed them to the PEs in the simulator Processing Element (PE) 1.5 GHz, 64-entry requestQ 16 PEs/tile for multiply 8 PEs/tile for merge L0 cache/scratchpad (SPM) 16 kB, 4-way per tile for multiply 2 kB cache, 2 kB SPM per tile for merge L1 cache 4 kB, 2-way Crossbar 16x16 and 4x4, swizzle-switch based Main memory HBM 2.0, 8000 MB/s per channel, 16 channels Make the table bigger Table: gem5 simulation parameters Evaluated against state-of-the-art SpGEMM packages Intel MKL on 6-core Intel Broadwell CPU NVIDIA cuSPARSE & CUSP on K40 GPU HPCA 2018

Performance of Matrix-Matrix Multiplication Figure: Evaluation of SpGEMM on UFL SuiteSparse and SNAP matrices Red circle around bars when talking about specific matrices; Refer to photo you clicked HPCA 2018

Performance of Matrix-Matrix Multiplication Figure: Evaluation of SpGEMM on UFL SuiteSparse and SNAP matrices Sync with the animations; color code boxes based on point; explain each bar; color in the matrices represent nonzeros Greater speedups for irregular matrices; inner product incurs large # of comparisons MKL performs poorly on power law graphs Uneven non-zero distribution increases merge time HPCA 2018

Performance of Matrix-Vector Multiplication Matrix dimension (ms) Figure: Speedups for multiplication with vector of varying density (r) Figure: Performance scaling for multiplication with a dense vector Each matrix has 1 million uniformly distributed non-zeros ∼10x gain in speedup with 10x reduction in vector density for SpM-SpV multiplication color code the table cells The outer product method only accesses matrix columns that match the indices of the non-zero elements of the vector Eliminates all redundant accesses to the matrix The outer product algorithm scales with number of non-zeros and not with the density/dimensions of the matrix HPCA 2018

Table: Power and area estimates for OuterSPACE in 32 nm Total chip area for OuterSPACE: ∼87 mm2 @ 24 W power budget Average throughput of 2.9 GFLOPS  126 MFLOPS/W GPU achieves avg. 67 MFLOPS @ 85 W for UFL/SNAP matrices Mention peak gflops GPUs achieve on dense matrix mult/ NN or other compute bound workloads; compare that with gflops achieved for sparse workloads; highlight 150x power efficiency improvement over GPU HPCA 2018

Overview Introduction Algorithm Architecture Evaluation Conclusion HPCA 2018

Summary Explored the outer product approach for SpGEMM Discovered inefficiencies in existing hardware leading to sub-optimal performance Designed a custom architecture following an SPMD paradigm that reconfigures to support different data access patterns Our architecture minimizes memory accesses though efficient use of the cache-crossbar hierarchy Demonstrated acceleration of SpGEMM, a prominent memory-bound kernel Evaluated the outer product algorithm on artificial and real-world workloads OuterSPACE achieves speedups of 7.9-14.0x over commercial SpGEMM libraries on CPU and GPU Mention Speedup numbers HPCA 2018

Questions? HPCA 2018