Linchuan Chen, Peng Jiang and Gagan Agrawal

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

ACCELERATING SPARSE CANONICAL CORRELATION ANALYSIS FOR LARGE BRAIN IMAGING GENETICS DATA Jingwen Yan, Hui Zhang, Lei Du, Eric Wernert, Andew J. Saykin,

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

K-Ary Search on Modern Processors Fakultät Informatik, Institut Systemarchitektur, Professur Datenbanken Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.

IMP: Indirect Memory Prefetcher

A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access Laurence S.Kaplan BBN Advanced Computers Inc. Cambridge,MA Distributed.

Irregular Applications –Sparse Matrix Vector Multiplication

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Appendix C Graphics and Computing GPUs

University of California, Berkeley

Scalpel: Customizing DNN Pruning to the

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Analysis of Sparse Convolutional Neural Networks

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Ioannis E. Venetis Department of Computer Engineering and Informatics

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of

Morgan Kaufmann Publishers

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Efficient and Simplified Parallel Graph Processing over CPU and MIC

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Mahesh Ravishankar1, John Eisenlohr1,

Linchuan Chen, Xin Huo and Gagan Agrawal

Advisor: Dr. Gagan Agrawal

The University of Adelaide, School of Computer Science

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

University of Wisconsin-Madison

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Gary M. Zoppetti Gagan Agrawal

EE 193: Parallel Computing

Many-Core Graph Workload Analysis

Parallel build blocks.

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Data Parallel Pattern 6c.1

6- General Purpose GPU Programming

Fan Ni Xing Lin Song Jiang

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Exploiting Recent SIMD Architectural Advances for Irregular Applications Linchuan Chen, Peng Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University, USA

Motivation - Applications Irregular Data Accesses are very Common Graph applications Unstructured grid Particle simulations Sparse matrix operations And more ...

Motivation - SIMD Hardware Evolution More Powerful Parallelism Wider SIMD Lanes Massive MIMD Cores Challenging to Exploit for Applications with Irregular Data Access Memory access locality Write conflicts

Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes 61 cores Each supports 4 hyper-threads Wide SIMD Lanes 512 bit lane width = 16 floats / 8 doubles More Flexible SIMD Programming with Gather/Scatter Enables programming for irregular memory access

Intel Xeon Phi - Gather/Scatter Performance SIMD access locality influenced by access range Performance of gather/scatter fall below load/store significantly at access ranges of higher than 256

Contributions A General Optimization Methodology The Steps Based on a sparse matrix view Data access patterns identification made easy The Steps Data locality enhancement through matrix tiling Data access pattern identification Write conflict removal at both SIMD and MIMD levels Subsets Studied Irregular reductions Graph algorithms SpMM

Sparse Matrix View of Irregular Reductions Sequential computation steps: Read node elements Do computation Write to reduction array

Parallel Optimization – Irregular Reductions Step 1: Matrix Tiling The sparse matrix is tiled into squares (tiles) A tile is stored as COO (rows, cols, vals) Each tile is exclusively processed by one thread in SIMD Step 2: Access Pattern Identification Writes happen in both row and column directions

Optimization – irregular reductions Step 3: Conflict Removal Conflicts SIMD level: different lanes might write to same locations MIMD level: different threads might write to same locations Conflict Removal at SIMD Level – Grouping Divide the non-zeros in each tile into conflict-free groups In each conflict group, every two elements have distinct row IDs and distinct column IDs Conflict Removal at MIMD Level Group tiles in the same way Each parallel step: all threads process one conflict-free tile group

Optimization – irregular reductions Execution (SIMD processing of a conflict-free group) Load row IDs and column IDs, using aligned load Load non-zero values, using aligned load Gather node data according to row IDs Gather node data according to column IDs Compute in SIMD Update reduction array using row IDs with scatter Update reduction array using column IDs with scatter rid, cid, val

Sparse Matrix View of Graph Algorithms Sequential computation steps Load values of source vertices Load values of edges Compute Update destination vertices

Optimization – graph algorithms Locality Enhancement Tiling - same as irregular reductions Access Pattern Different with irregular reductions Only write in one direction Will influence the grouping policy Conflict Removal Conflict-free grouping: only need to consider column IDs

SpMM Sparse Matrix-Matrix Multiplication C = A X B Access Two Sparse Matrices A, B, C are in CSR Straightforward SIMD Implementation Load a vector of elements from A For each element from A, load a vector of elements from B Conduct a SIMD multiplication Store results to hash table using column IDs as keys Very irregular memory access Needs unpacking of vector data B

Optimization – SpMM Storage format: Partition sparse matrix into 4 x 4 tiles Each tile containing non- zero(s) is stored as a dense matrix Tiles containing non-zero(s) are indexed in CSR format SIMD Multiplication between Tiles

Tile-wise Multiplication: Conflict-free SIMD Processing In each iteration: Load Va from A, gather Vb from B using an index mapping such that elements in same position in Va and Vb have same color Conduct SIMD multiplication to get result Vr Conduct a permutation (with gather) to Vr - align elements with corresponding elements (same color) in partial reduction result Vc, and add Vr to Vc Locate the result tile Tc from the hash table for the corresponding row A SIMD add is conducted to reduce Vc to the Tc

MIMD Level Parallelism C = A X B Treat each tile as an element Divide the workload based on the rows of A, for the threads For each tile in A, scale (in SIMD) with the elements in the corresponding row of B Store the scaling results to hash table (in SIMD)

Results Platform Applications Intel Xeon Phi SE10P coprocessor 61 cores, 1.1 GHz, 4 hyper threads per core 8 GB GDDR5 Intel ICC 13.1.0, -O3 enabled Applications Irregular reductions Moldyn, Euler Graph algorithms Bellman-Ford, PageRank SpMM float and double used

Irregular Reductions Single Thread: different execution approaches Serial Single thread scalar, row by row processing Serial Tiling (optimal tile size) Scalar, on tiled data SIMD Naive Row by row processing, SIMD SIMD Tiling (Our) (optimal tile size) SIMD processing, on tiled data 9.05x 7.47x 2.09x 2.01x 6/4/2015

Irregular Reductions Overall Speedup (Over Single Thread Scalar) Moldyn: up to 290x speedup, with 45-3.0r, 1024 tile size, 80 threads Euler: up to 64x speedup, kron_g500-logn19, 2048 tile size, 200 threads

Graph Algorithms Single Thread: different execution approaches 7.4x

Graph Algorithms MIMD + SIMD Performance Bellman-Ford: up to 467x speedup, with soc-Pokec, 512 tile size, 200 threads PageRank: up to 341x speedup, kron_g500-logn19, 128 tile size, 244 threads

SpMM Single Thread Performance (clustered datasets) SIMD Tiling Speedups: Over Serial 6.58x ~ 8.07x 4.75x ~ 7.70x Over MKL 1.5x ~ 2.89x 1.01x ~ 1.97x

SpMM Overall Performance Max Speedups over Serial: float double Our Approach (Solid Lines) 392.49x 334.34x MKL (Dashed Lines) 139.68x 156.23x

Conclusions Effective and General Methodology for SIMDizing Irregular Applications Irregular reductions Graph algorithms SpMM Sparse Matrix View Optimization made easy Common steps can be applied to different subclasses Performance High efficiency in both SIMD and MIMD Utilization

Thanks for your attention! Q? Linchuan Chen chen.1996@osu.edu Peng Jiang jiang.952@osu.edu Gagan Agrawal agrawal@cse.ohio-state.edu Thanks for your attention! Q?