Peng Jiang, Linchuan Chen, and Gagan Agrawal

Slides:

Advertisements

Similar presentations

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Advertisements

Bo Hong Electrical and Computer Engineering Department Drexel University

Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

A Navigation Mesh for Dynamic Environments Wouter G. van Toll, Atlas F. Cook IV, Roland Geraerts CASA 2012.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

Data Structures and Algorithms in Parallel Computing Lecture 7.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Institute of Parallel and Distributed Systems (IPADS)

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Ioannis E. Venetis Department of Computer Engineering and Informatics

Conception of parallel algorithms

Case Study 2- Parallel Breadth-First Search Using OpenMP

Parallel Programming By J. H. Wang May 2, 2017.

Morgan Kaufmann Publishers

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Supporting Fault-Tolerance in Streaming Grid Applications

Linchuan Chen, Xin Huo and Gagan Agrawal

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Advisor: Dr. Gagan Agrawal

structures and their relationships." - Linus Torvalds

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Linchuan Chen, Peng Jiang and Gagan Agrawal

Replication-based Fault-tolerance for Large-scale Graph Processing

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

GENERAL VIEW OF KRATOS MULTIPHYSICS

Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University

University of Wisconsin-Madison

Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Gary M. Zoppetti Gagan Agrawal

Many-Core Graph Workload Analysis

Fully-dynamic aimGraph

Alan Kuhnle*, Victoria G. Crawford, and My T. Thai

Resource Allocation for Distributed Streaming Applications

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

6- General Purpose GPU Programming

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

structures and their relationships." - Linus Torvalds

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Reusing Data Reorganization for Efficient SIMD Parallelization of Dynamic Irregular Applications Peng Jiang, Linchuan Chen, and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University, USA

Motivation - Applications Irregular Data Access is Very Common Static Irregular Bellman-Ford Sparse matrix operations … Dynamic Irregular Vertex-centric BFS, SSSP… Adaptive Dynamic Irregular Particle Simulation …

Motivation - SIMD Hardware Evolution More Powerful SIMD features Wider SIMD lanes Gather/scatter operations enable irregular memory accesses Mask data type and operations enable computation on specified SIMD lanes Challenging to Exploit for Applications with Irregular Memory Access Memory access locality Write conflict

Intel Xeon Phi - Gather/Scatter Performance SIMD access locality influenced by access range Performance of gather/scatter fall below load/store significantly at access ranges of higher than 256

Our Previous Work Exploiting Recent SIMD Architectural Advances for Irregular Applications (CGO’16) A General Optimization Methodology Based on a sparse matrix view Data access patterns identification made easy Three Steps (tiling and grouping) Data locality enhancement through matrix tiling Data access pattern identification Write conflict removal at both SIMD and MIMD levels Aims at Static Irregular Applications Bellman-Ford, PageRank, SPMM

Sparse Matrix View of Irregular Reductions Sequential computation steps: Read node elements Do computation Write to reduction array

Challenges Dynamic Irregular Applications In some vertex-centric graph applications (SSSP, BFS) The topology of the graph is determined But the memory access patterns differ across iterations Adaptive Irregular Applications In particle simulation (Molecular Dynamics) Edges can be removed, new edges can be added The edges changes adaptively Pre-reorganizing Data is Not Effective Memory access patterns change

Contributions An Approach to Efficiently Vectorizing (Adaptive) Dynamic Applications Based on the tiling and grouping method Reuse the data reorganization The Steps Pre-reorganize the data with tiling and grouping Manage the active memory accesses in the reorganized data during execution Apply SIMD to the computation with managed data Subsets Studied Vertex-centric graph algorithms with dynamic memory access Particle simulation with adaptive dynamic memory access

Utilizing SIMD in Vertex-Centric Graph Processing Dynamic Memory Access Pattern Vertex-centric SSSP Only the frontier vertices are active in each iteration The memory access pattern changes over iteration

Utilizing SIMD in Vertex-Centric Graph Processing Our Approach Manage the active memory accesses in an index data structure Reuse the tiled and grouped data to apply SIMD

Utilizing SIMD in Vertex-Centric Graph Processing Index Data Structure Stores The ordered edges (Index) The grouped edges (Xcoord, Ycoord) A mapping from ordered edges to grouped edges (Pos)

Utilizing SIMD in Vertex-Centric Graph Processing Search-and-Activate The active edges are searched in ordered index with binary search The positions of these active edges in the grouped edge array are obtained from the mapping These edges are activated in the grouped array ( ready for SIMD processing)

SIMD Execution for Adaptive Dynamic Applications The active edges in these applications change adaptively (e.g. Molecular Dynamics) Only 5% of the total edges are changed

SIMD Execution for Adaptive Dynamic Applications Our Approach Pre-reorganize the initial edges (by tiling and grouping) Every time the edges are rebuilt, inactivate the removed edges and add newly built edges Incrementally maintain the grouped edges Apply SIMD to the computation

SIMD Execution for Adaptive Dynamic Applications Index Data Structure Collection Stores multiple: Ordered edges (Index) Grouped edges (Xcoord, Ycoord) And mapping from ordered edges to grouped edges (Pos) A new collection is added when edges are rebuilt

SIMD Execution for Adaptive Dynamic Applications Search-and-Update The active edges are searched in multiple ordered indices with binary search If an edge exists, obtain its position from the mapping, and activate it in the grouped edge array Add all newly built edges in a new ordered index, group the new edges, and build mapping for the new edges

Results Platform Applications Intel Xeon Phi SE10P coprocessor 61 cores, 1.1 GHz, 4 hyper threads per core 8 GB GDDR5 Intel ICC 13.1.0, -O3 enabled Applications Graph Algorithm BFS, SSSP, SSWP, WCC, TC Molecular Dynamics Moldyn, MiniMD

Graph Algorithm Single thread performance, speedup by SIMD 7.19x 4.23x

Molecular Dynamics Overhead of neighbor rebuilding Time unit: seconds Even faster than Non-Group because the neighbor rebuilding are processed in tiles

Molecular Dynamics Overall performance with single thread 4.43x 3.44x

Conclusions Effective Approach for SIMDizing (Adaptive) Dynamic Irregular Applications Vertex-centric graph processing Particle simulation Efficient Management of Dynamic Memory Accesses Index Data Structure and Search-and-Activate procedure Performance High efficiency in SIMD utilization

Thanks for your attention! Q? Peng Jiang jiang.952@osu.edu Linchuan Chen chen.1996@osu.edu Gagan Agrawal agrawal@cse.ohio-state.edu Thanks for your attention! Q?