Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
OpenFOAM on a GPU-based Heterogeneous Cluster
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Chapter 4. Numerical Interpretation of Eigenvalues In terms of matrix arithmetic eigenvalues turn matrix multiplication into scalar multiplication. Numerically.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Rewriting the book on Earthquake Locations Ethan Coon (APAM) Felix Waldhauser (LDEO)
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
School of Electronics Engineering and Computer Science Peking University Beijing, P.R. China Ziqi Wang, Yuwei Tan, Ming Zhang.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Instance Construction via Likelihood- Based Data Squashing Madigan D., Madigan D., et. al. (Ch 12, Instance selection and Construction for Data Mining.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
1 Presented by: Yuchen Bian MRWC: Clustering based on Multiple Random Walks Chain.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions by S. Mahadevan & M. Maggioni Discussion led by Qi An ECE, Duke University.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
9 Nov B - Introduction to Scientific Computing1 Sparse Systems and Iterative Methods Paul Heckbert Computer Science Department Carnegie Mellon.
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
PRACTICAL TIME BUNDLE ADJUSTMENT FOR 3D RECONSTRUCTION ON THE GPU Siddharth Choudhary ( IIIT Hyderabad ), Shubham Gupta ( IIIT Hyderabad ), P J Narayanan.
Finding Rightmost Eigenvalues of Large, Sparse, Nonsymmetric Parameterized Eigenvalue Problems Minghao Wu AMSC Program Advisor: Dr. Howard.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Lecture 3 CUDA Programming 1
The PageRank Citation Ranking: Bringing Order to the Web
Interactive Machine Learning with a GPU-Accelerated Toolkit
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
by Hyunwoo Park and Kichun Lee Knowledge-Based Systems 60 (2014) 58–72
DTMC Applications Ranking Web Pages & Slotted ALOHA
Linchuan Chen, Peng Jiang and Gagan Agrawal
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
All-Pairs Shortest Paths
University of Wisconsin-Madison
Peng Jiang, Linchuan Chen, and Gagan Agrawal
TensorFlow: A System for Large-Scale Machine Learning
Presentation transcript:

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan Parthasarathy and P. Sadayappan Department of Computer Science and Engineering The Ohio State University

Copyright 2011, Data Mining Research Laboratory Outline Motivation and Background Single- and Multi- GPU SpMV Optimizations Automatic Parameter Tuning and Performance Modeling Conclusions

Copyright 2011, Data Mining Research Laboratory Introduction Sparse Matrix-Vector Multiplication (SpMV) –y = Ax, where A is a sparse matrix and x is a dense vector. –Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods. Focus of much research –Scientific Applications, e.g. finite element method –Graph Mining algorithms PageRank, Random Walk with Restart, HITS –Industrial Strength Efforts CPUs, Clusters (e.g. Vuduc, Yelick et al 2009) GPUs (e.g. NVIDIA 2010)

Copyright 2011, Data Mining Research Laboratory Why GPUs? High Performance –GPU is 10x faster High Memory Bandwidth –180 GB/s v.s. <40 GB/s High Productivity –CUDA (now) vs. OpenGL (before) GFLOPS GB/s [ Source: www-sop.inria.fr/nachos ]

Copyright 2011, Data Mining Research Laboratory Problem Statement and Challenges Can we improve upon industrial strength efforts for computing SpMV on matrices representing large power-law graphs on GPU? –Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ? Challenges –Need to balance load Power-law nature of graphs –Need to coalesce memory access –Need to avoid conditional divergence SIMD architecture prefers the threads follow identical control flow in branching instructions. –Need to handle large matrices Graph Nodes Degree [ Source: Wikipedia ]

Copyright 2011, Data Mining Research Laboratory Background: CUDA Architecture Programming Model (logical hierarchy): –Grid –Block –Thread –Kernel [ Source: NVIDIA CUDA guide ]

Copyright 2011, Data Mining Research Laboratory Hardware (Physical): –A set of multiprocessors –A warp = 32 threads, concurrently run the same instructions –Conditional divergence Parallel threads should follow identical control flow to avoid performance penalty. Memory System – Global memory: coalescing – Texture cache 6~8KB texture cache per multiprocessor Background: CUDA Architecture [ Source: NVIDIA CUDA guide ]

Copyright 2011, Data Mining Research Laboratory Outline Motivation and Background Single- and Multi- GPU SpMV Optimizations Automatic Parameter Tuning and Performance Modeling Conclusions

Copyright 2011, Data Mining Research Laboratory Single GPU Optimizations I Problem I: Row accesses random values in vector x -- bad locality. Solution: Tiling matrix A and vector x by texture cache. Problem II: Full tiling is not always beneficial (power-law) Solution: Partially tiling (parameterized), reorder by column length. Texture cache size was not available Estimated to be 250 KB (=64,000 columns) Note entire X cannot fit on texture cache

Copyright 2011, Data Mining Research Laboratory Problem III: Imbalance in Row Length Solution: Composite Storage –Row major performs well on long rows (1 warp per row). –Column major performs well on short rows (1 thread per row). –Partition rows into workload with similar size, padded with 0. Workload with long rows will be stored in row major. Workload with many short rows will be stored in column major. –Workload size: parameterized Single GPU Optimizations II

Copyright 2011, Data Mining Research Laboratory Empirical Results on NVIDIA Tesla GPU Power-law matrices

Copyright 2011, Data Mining Research Laboratory Results: PageRank CPU: Vuduc, Yelick et al 2009 GPU: NVIDIA 2010 up to 16.5X over CPU GPU: Tile-Composite up to 30X over CPU up to 2X over NVIDIA GPU

Copyright 2011, Data Mining Research Laboratory Multi-GPU SpMV Problem IV: Handling Large Matrices Challenge: PCI-express bandwidth limitation(max 8GB/s) Solution: Processing on Multiple GPUs –Partition the matrix by rows and distribute the work to different GPUs in a cluster. –SK2005 dataset: 50 million nodes 2 billion edges 75% parallel efficiency Improvement over NVIDIA – 1.5X

Copyright 2011, Data Mining Research Laboratory Outline Motivation and Background Single- and Multi- GPU SpMV Optimizations Automatic Parameter Tuning and Performance Modeling Conclusions

Copyright 2011, Data Mining Research Laboratory Automatic Parameter Tuning Two parameters in our approach 1.Number of tiles: when to stop partially tiling? 1.Workload size in a tile: how to partition a tile? Stop when no memory reuse benefits!

Copyright 2011, Data Mining Research Laboratory Automatic Parameter Tuning Performance Modeling –Offline component: map a workload to a performance number Parameter search space pruning Dataset independent and one time cost per hardware –Online component: given all the workloads of a matrix tile, take the average performance as predicted performance Warp 0 Warp 1 Warp 2 Warp 3 Streaming Multiprocessor 1x64 2x3232x264x1 6 GFLOPS 4 GFLOPS 3 GFLOPS 1 GFLOPS

Copyright 2011, Data Mining Research Laboratory Automatic Parameter Tuning Results Performance model can also be used to predict performance.

Copyright 2011, Data Mining Research Laboratory Outline Motivation and Background Single- and Multi- GPU SpMV Optimizations Automatic Parameter Tuning and Performance Modeling Conclusions

Copyright 2011, Data Mining Research Laboratory Take Home Messages Architecture conscious SpMV optimizations for graph mining kernels (e.g. PageRank, RWR, HITS) on GPU –Highlight I: Orders of magnitude improvement over best CPU implementations. –Highlight II: 2X improvement over industrial strength implementations from NVIDIA and others PCI-express bandwidth limiting factor for processing large graphs –Multiple GPUs can handle large web graph data. Auto-tuning leads to non-parametric solution! –Also enables accurate performance modeling.

Copyright 2011, Data Mining Research Laboratory Acknowledgment: grants from NSF –CAREER-IIS –RI-CNS –CCF –IIS Thank you for your attention! Questions?

Copyright 2011, Data Mining Research Laboratory Backup slides

Copyright 2011, Data Mining Research Laboratory Unstructured matrices: non-power-law SpMV Kernel

Copyright 2011, Data Mining Research Laboratory Performance Prediction

Copyright 2011, Data Mining Research Laboratory Dataset

Copyright 2011, Data Mining Research Laboratory Hardware Details CPU: AMD Opteron X2 with 8GB RAM GPU: NVIDIA Tesla C1060 with 30 multiprocessors, 240 cores and 4GB global memory MPI-based cluster with 1 CPU and 2 GPUs per node. CUDA version 3.0

Copyright 2011, Data Mining Research Laboratory Sorting Cost Sorting is used to re-structure the columns and rows of the matrix. When the row or column lengths follow power-law distribution, they can be sorted very efficiently –The numbers in the long tail of the power-law distribution can be sorted using bucket sort in linear time. –We only need to sort the remaining numbers. Further more, these cost can be amortized by the iterative call to the SpMV kernel.

Copyright 2011, Data Mining Research Laboratory Parameter search space pruning for workload size Lower bound: the longest row in a tile –It cannot be partitioned. Upper bound: total number of non-zeros in a tile divided by the maximum number of available warps (960 on the Tesla GPU) –We want to fully utilize the available resource. Workload size must be an integer multiple of the longest row –The first workload must be a rectangle.

Copyright 2011, Data Mining Research Laboratory Given directed graph G = (V, E), and adjacency matrix A PageRank: –W is row normalization of A – c = 0.85, U is a n by n matrix with all elements set to 1/n. Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i. –W is column normalization of A – c = 0.9, the ith element in is 1, the others are all 0. HITS: each web page is assigned an authority score and a hub score. Data Mining Applications

Copyright 2011, Data Mining Research Laboratory PageRank

Copyright 2011, Data Mining Research Laboratory Random Walk with Restart

Copyright 2011, Data Mining Research Laboratory HITS

Copyright 2011, Data Mining Research Laboratory Limitations of Previous Work NVIDIAs SpMV Library based on different storage formats of matrix A. –CSR CSR kernel CSR-vector kernel Optimized CSR-vector Baskaran et al. CSR: Imbalanced workload amongst threads, non- coalesced memory accesses. CSR-vector: many short rows, waste of threads

Copyright 2011, Data Mining Research Laboratory Limitation of Previous Work – COO kernel Each warp works on one interval Warps run in parallel With in one warp, threads do binary reduction, need to check whether two operands are from the same row warp0warp1 COO: thread divergence, low thread level parallelism

Copyright 2011, Data Mining Research Laboratory Limitation of Previous Work –ELL kernel Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k. Data and index matrices are stored in column major, each thread works on one row. –HYB kernel: ELL + COO ELL: long rows cant be bounded HYB: ELL part only covers small amount of computation, COO part is slow, increasing the ratio of ELL part introduces memory overhead.