Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan, Ann Arbor

Similar presentations


Presentation on theme: "University of Michigan, Ann Arbor"— Presentation transcript:

1 University of Michigan, Ann Arbor
OuterSPACE An Outer Product based Sparse Matrix Multiplication Accelerator 28 February, 2018 Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti†, Hun-Seok Kim, David Blaauw, Trevor Mudge and Ronald Dreslinski University of Michigan, Ann Arbor †Arizona State University HPCA 2018

2 Overview Introduction Algorithm Architecture Evaluation Conclusion
HPCA 2018

3 Overview Introduction Algorithm Architecture Evaluation Conclusion
HPCA 2018

4 The Big Data Problem Big data collected from various sources
Sensor feed, social media, scientific experiments Challenge: the nature of data is sparse Architecture research previously focused on improving compute Sparse matrix computation: a key example of memory bound workloads GPUs achieve ~100 GFLOPS for dense matrix mult. vs. ~100 MFLOPS for sparse Two dominant kernels Sparse matrix-matrix multiplication (SpGEMM) Breadth-first search, algebraic multigrid methods Sparse matrix-vector multiplication (SpMV) PageRank, support vector machines, ML based text analytics Talk about importance of the algorithms rather than describe how they work - Talk about how memory-bound workloads such as SpGEMM have gained importance as opposed to before Graphs become adjacency matrices, for example; many fundamental problems get converted to SPMM Animate the points Point to sparse matrix image HPCA 2018

5 Inner Product based Matrix Multiplication
𝑐 𝑖,𝑗 = 𝑘=0 𝑁−1 𝑎 𝑖,𝑘 × 𝑏 𝑘,𝑗 HPCA 2018

6 Outer Product based Matrix Multiplication
𝐂= 𝑖=0 𝑁−1 𝐂 𝑖 = 𝑖=0 𝑁−1 𝐚 𝑖 𝐛 𝑖 HPCA 2018

7 Outer Product based Matrix Multiplication
+ No index matching + Low reuse distance 𝐂= 𝑖=0 𝑁−1 𝐂 𝑖 = 𝑖=0 𝑁−1 𝐚 𝑖 𝐛 𝑖 HPCA 2018

8 Comparison of the Approaches
Inner Product Outer Product Entries need to be index-matched before multiplication Reuse distance is high; each column of B is reloaded after multiplication of a row of A with all columns of B All non-zero pairs belonging to a column of A and row of B produce meaningful outputs Reuse distance is small; a pair of column of A and row of B are multiplied and never used again Pause at the end of slide HPCA 2018

9 Overview Introduction Algorithm Architecture Evaluation Conclusion
HPCA 2018

10 Outer Product Implementation
Figure: Implementation of outer product algorithm in a system with parallel Processing Units HPCA 2018

11 Outer Product Implementation
Figure: The outer product algorithm implementation Multiply phase Processing units multiply an element of a column of A with a row of B Issued as a single “multiply task”: A[i][j] x Brow[j] Merge phase A processing unit merges all partial products pertaining to a row of C Uses a modified mergesort based approach to minimize memory traffic Talk about the mergesort approach; just mention that there is no data sharing in merge; reduce the red text; blow up figures; (1,2 ) and (3,4) go in diff lines Fix caption animations HPCA 2018

12 Performance on Traditional Hardware: CPU
Each matrix has 1 million uniformly distributed non-zeros Matrix dimension Figure: Performance of the outer product algorithm against Intel MKL on the CPU. Evaluated outer product against MKL on Broadwell CPU The outer product algorithm puts high pressure on the memory system N N×N matrices streamed out to main memory during multiply Loaded back in during merge Merge phase implementation involves no sharing, leading to cache thrashing HPCA 2018

13 Performance on Traditional Hardware: CPU
Each matrix has 1 million uniformly distributed non-zeros Performance bottlenecked by the restricted cache hierarchy and compute parallelism Matrix dimension Figure: Performance of the outer product algorithm against Intel MKL on the CPU. Evaluated outer product against MKL on Broadwell CPU The outer product algorithm puts high pressure on the memory system N N×N matrices streamed out to main memory during multiply Loaded back in during merge Merge phase implementation involves no sharing, leading to cache thrashing HPCA 2018

14 Performance on Traditional Hardware: GPU
Each matrix has 1 million uniformly distributed non-zeros Matrix dimension Figure: Performance of the outer product algorithm against CUSP on the GPU Evaluated outer product against CUSP on K40 GPU The multiply phase streams and processes data much faster than the CPU implementation, scaling roughly linearly with decreasing density The merge phase suffers from a much lower total throughput Resulted by control divergence between threads within a warp while executing the conditional branches in the merge phase HPCA 2018

15 Performance on Traditional Hardware: GPU
Each matrix has 1 million uniformly distributed non-zeros Performance bottlenecked by the SIMD nature of warps Matrix dimension Figure: Performance of the outer product algorithm against CUSP on the GPU. Matrices Evaluated outer product against CUSP on K40 GPU The multiply phase streams and processes data much faster than the CPU implementation, scaling roughly linearly with decreasing density The merge phase suffers from a much lower total throughput Resulted by control divergence between threads within a warp while executing the conditional branches in the merge phase Solution: SPMD architecture! HPCA 2018

16 Overview Introduction Algorithm Architecture Evaluation Conclusion
HPCA 2018

17 OuterSPACE: Multiply Phase
Figure: OuterSPACE architecture for the multiply phase. SPMD-style Processing Elements (PEs), high-speed crossbars and non- coherent caches with request coalescing, HBM interface Local Control Processor (LCP): streaming instructions in to the PEs Central Control Processor (CCP): work scheduling and memory management Talk about the PE architecture itself! Animate data movement through the architecture HPCA 2018

18 OuterSPACE: Multiply Phase
Column of A Row of B Figure: OuterSPACE architecture for the multiply phase. Talk more about the architecture; talk about what’s in the PEs HPCA 2018

19 OuterSPACE: Multiply Phase
Figure: Reconfigured architecture for the multiply phase. L0 caches contain elements of rows of B that are shared between PEs L1 caches act as as “victim” caches Hold hot data that may get evicted from L0 if multiple PEs are working simultaneously on different rows Talk about the PE architecture itself! Animate data movement through the architecture HPCA 2018

20 OuterSPACE: Merge Phase
1 2 2 Partial rows 2 3 5 1 2 2 1 2 Figure: Reconfigured architecture for the merge phase. The L0 cache-crossbar blocks are reconfigured to accommodate the change in data access pattern L0 reconfigured into smaller, private caches and private scratchpads A PE-pair merges all the partial products belonging to one row They work on each final row independent of other PE-pairs Thus, the merge phase involves no data sharing among PEs HPCA 2018

21 OuterSPACE: Merge Phase
Figure: Reconfigured architecture for the merge phase. Half of the PEs are turned off and the rest work in pairs to merge outer products A “fetcher” PE initiates loads to fetch the partial products A “sorter” PE sorts the previously-fetched partial products and merges upon collision Turning off PEs saves power and regulates bandwidth allocation No synchronization => no coherence! HPCA 2018

22 Overview Introduction Algorithm Architecture Evaluation Conclusion
HPCA 2018

23 Simulation Methodology
Modeled the PEs, cache- crossbar hierarchy and HBM using gem5 Trace-based simulation Generated memory instruction traces offline Fed them to the PEs in the simulator Processing Element (PE) 1.5 GHz, 64-entry requestQ 16 PEs/tile for multiply 8 PEs/tile for merge L0 cache/scratchpad (SPM) 16 kB, 4-way per tile for multiply 2 kB cache, 2 kB SPM per tile for merge L1 cache 4 kB, 2-way Crossbar 16x16 and 4x4, swizzle-switch based Main memory HBM 2.0, 8000 MB/s per channel, 16 channels Make the table bigger Table: gem5 simulation parameters Evaluated against state-of-the-art SpGEMM packages Intel MKL on 6-core Intel Broadwell CPU NVIDIA cuSPARSE & CUSP on K40 GPU HPCA 2018

24 Performance of Matrix-Matrix Multiplication
Figure: Evaluation of SpGEMM on UFL SuiteSparse and SNAP matrices Red circle around bars when talking about specific matrices; Refer to photo you clicked HPCA 2018

25 Performance of Matrix-Matrix Multiplication
Figure: Evaluation of SpGEMM on UFL SuiteSparse and SNAP matrices Sync with the animations; color code boxes based on point; explain each bar; color in the matrices represent nonzeros Greater speedups for irregular matrices; inner product incurs large # of comparisons MKL performs poorly on power law graphs Uneven non-zero distribution increases merge time HPCA 2018

26 Performance of Matrix-Vector Multiplication
Matrix dimension (ms) Figure: Speedups for multiplication with vector of varying density (r) Figure: Performance scaling for multiplication with a dense vector Each matrix has 1 million uniformly distributed non-zeros ∼10x gain in speedup with 10x reduction in vector density for SpM-SpV multiplication color code the table cells The outer product method only accesses matrix columns that match the indices of the non-zero elements of the vector Eliminates all redundant accesses to the matrix The outer product algorithm scales with number of non-zeros and not with the density/dimensions of the matrix HPCA 2018

27 Table: Power and area estimates for OuterSPACE in 32 nm
Total chip area for OuterSPACE: ∼87 24 W power budget Average throughput of 2.9 GFLOPS  126 MFLOPS/W GPU achieves avg W for UFL/SNAP matrices Mention peak gflops GPUs achieve on dense matrix mult/ NN or other compute bound workloads; compare that with gflops achieved for sparse workloads; highlight 150x power efficiency improvement over GPU HPCA 2018

28 Overview Introduction Algorithm Architecture Evaluation Conclusion
HPCA 2018

29 Summary Explored the outer product approach for SpGEMM
Discovered inefficiencies in existing hardware leading to sub-optimal performance Designed a custom architecture following an SPMD paradigm that reconfigures to support different data access patterns Our architecture minimizes memory accesses though efficient use of the cache-crossbar hierarchy Demonstrated acceleration of SpGEMM, a prominent memory-bound kernel Evaluated the outer product algorithm on artificial and real-world workloads OuterSPACE achieves speedups of x over commercial SpGEMM libraries on CPU and GPU Mention Speedup numbers HPCA 2018

30 Questions? HPCA 2018


Download ppt "University of Michigan, Ann Arbor"

Similar presentations


Ads by Google