Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
OpenFOAM on a GPU-based Heterogeneous Cluster
Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Introduction CS 524 – High-Performance Computing.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
A Survey of Parallel Tree- based Methods on Option Pricing PRESENTER: LI,XINYING.
GPGPU platforms GP - General Purpose computation using GPU
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Matrices. Definitions  A matrix is an m x n array of scalars, arranged conceptually as m rows and n columns.  m is referred to as the row dimension.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
JAVA AND MATRIX COMPUTATION
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Analysis of Sparse Convolutional Neural Networks
Ioannis E. Venetis Department of Computer Engineering and Informatics
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Applying Control Theory to Stream Processing Systems
Resource Elasticity for Large-Scale Machine Learning
Linchuan Chen, Xin Huo and Gagan Agrawal
Presentation Title September 22, 2018
Linchuan Chen, Peng Jiang and Gagan Agrawal
© 2012 Elsevier, Inc. All rights reserved.
Parallel build blocks.
Presentation transcript:

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa 2, Sriram Krishnamoorthy 2, Antonino Tumeo 2 and Xiaoming Li 1 1 University of Delaware 2 Pacific Northwest National Laboratory 1 September 24 th, 2010

Overview Introduction Cluster level Node level Results Conclusion Future Work 2

Overview Introduction Cluster level Node level Results Conclusion Future Work 3

Sparse Matrix-Matrix Multiply - Challenges The efficient implementation of sparse matrix-matrix multiplications on HPC systems poses several challenges: Large size of input matrices E.g ×10 6 with 30×10 6 nonzero elements Compressed representation Partitioning Density of the output matrices Load balancing large differences in density and computation times 4 Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at:

Sparse Matrix-Matrix Multiply Cross Cluster implementation: Partitioning Data Distribution Load Balancing Communication/Scaling Result handling In-Node implementation: Multiple efficient SpGEMM algorithms CPU/GPU implementation Double buffering Exploiting heterogeneity 5 Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at:

Overview Introduction Cluster level Node level Results Conclusion Future Work 6

Sparse Matrix-Matrix Multiply - Cluster level Blocking Block size depends on sparsity of input matrices and # processing elements. NumOfBlocksX × NumOfBlocksY >> NumOfProcessingElements Data Layout What format and order to allow for easy and fast access Communication and storage implemented using Global Arrays (GA) Offers a set of primitives for non-blocking operations, contiguous and non-contiguous data transfers. 7

Sparse Matrix-Matrix Multiply - Data representation and Tiling 8 A B C C=A×B Blocked Matrix representation: Each block is stored in CSR* form data ( ) col ( ) row ( ) *CSR: Compressed Sparse Row

Sparse Matrix-Matrix Multiply - Data representation and Tiling 9 A B C C=A×B datacolumnrowdatacol… Tile 0Tile 2 … Matrix A: The single CSR tiles are stored serialized into the GA space. Tile sizes and offsets are stored in a 2D array Tiles with 0 nonzero elements are not represented in the GA dataset.

Sparse Matrix-Matrix Multiply - Data representation and Tiling 10 B Matrix B: tiles are serialized in a transposed way. depending on the algorithm used to calculate the single tiles the data in the tiles can be stored transposed or not transposed. For the Gustavson algorithm the representation of the data in the tiles themselves is not transposed not transposed or transposed

Sparse Matrix-Matrix Multiply - Tasking and Data Movement C Each Block in C represents a Task. Nodes grab tasks and additional needed data when they have computational power available Results are stored locally meta data of the result blocks in each node is distributed to determine the offsets of the tiles in the GA space. Tiles are put into the GA space in right order 01N-1 …

Sparse Matrix-Matrix Multiply - Tasking and Data Movement 12 A B C=A×B Each node fetches the data needed by the task to handle: E.g. here for task/tile 5 the node has to load the data of Stripes s a = 1 and s b = 0 N … S a …S b -1

Sparse Matrix-Matrix Multiply - Next Step: Locality aware Tasking 13 A B C C=A×B Assign tasks depending on how the global array is distributed over the cluster. The task queue should be aware of what data is already available in a node and based on that assign the follow up task. Tasks that should have a higher priority to be assigned to the node that handled task 5

Overview Introduction Cluster level Node level Results Conclusion Future Work 14

Sparse Matrix-Matrix Multiply - Gustavson 15 The algorithm is based on the equation: i-th row of C is a linear combination of the v rows of B for which a iv is nonzero. Where A has the dimensions p×q and B q×r × data(2,3,-1,2,3,-3,1,2,3,1,2,2,2,-1,4) col (0,1, 1,3,4, 2,3,2,3,0,3,4,3, 4,5) row (0,2,5,7,9,12,15) data(1,-1,5,4,6,-2,7,-4,1,5,1,2) col (0, 1,1,2,3, 0,3, 4,1,4,3,4) row (0,2,3,5,8,10,12) AC B × i=1i=1, v=1i=1, v=3i=1, v=4 + + × +

Sparse Matrix-Matrix Multiply - Gustavson 16 AC B In the CUDA implementation: each result row c i is handled by the 16 threads of a half warp ( 1/2W ) For each nonzero elements a iv in A one 1/2W performs the multiplications for each row v· in parallel The results are kept in dense form until all calculations are complete Then the results get compressed on the device half-warp 0 half-warp 1 half-warp 2 …

Overview Introduction Cluster level Node level Results Conclusion Future Work 17

Sparse Matrix-Matrix Multiply – Case Study Midsize matrix from the University of Florida Sparse Matrix Collection* 2D/3D problem size 72, 000 × 72, , 715, 634 nonzero Blocked into 5041 tiles. Multiplying matrix with itself. 18 * Darker colors represent higher densities of nonzero elements.

Sparse Matrix-Matrix Multiply - Results 19 Scaling of SpGEMM with the different approaches

Sparse Matrix-Matrix Multiply - Results 20

Sparse Matrix-Matrix Multiply - Results Even inside a node where different compute elements are used the load balancing mechanism still performs well The processes using the CUDA devices here completing almost 5x more tasks than the pure CPU processes. 21

Overview Introduction Cluster level Node level Results Conclusion Future Work 22

Sparse Matrix-Matrix Multiply We presented a parallel framework using a co-design approach which takes into account characteristics of: The selected application (here SpGEMM) The underlying hardware (heterogeneous cluster) The difficulties of using static partitioning approaches show that a global load balancing method is needed Different optimized implementations of the Gustavson algorithm are presented and are used depending on the available compute element For the selected case study optimal load balancing with uniform computation time across all processing elements is achieved 23

Overview Introduction Cluster level Node level Results Conclusion Future Work 24

Future Work – General Tasking Framework for Heterogeneous GPU Clusters More General Task definition More flexibility in Input and output data definition Exploring limits imposed on Tasks by a Heterogeneous system Feedback loop during execution that allows more efficient assignment of tasks. Introducing heterogeneous execution on GPU and CPU in one process/core. Locality aware Task queue(s) and work stealing Task reinsertion or generation at the node level. 25

Thank you Questions? 26