Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

Slides:



Advertisements
Similar presentations
Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
Advertisements

1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,
Distributed Algorithms for Secure Multipath Routing
Galois System Tutorial Donald Nguyen Mario Méndez-Lojo.
Galois System Tutorial Mario Méndez-Lojo Donald Nguyen.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
Galois Performance Mario Mendez-Lojo Donald Nguyen.
Submitted by: Omer & Ofer Kiselov Supevised by: Dmitri Perelman Networked Software Systems Lab Department of Electrical Engineering, Technion.
1 Keshav Pingali University of Texas, Austin Towards a Science of Parallel Programming.
Automated Layout and Phase Assignment for Dark Field PSM Andrew B. Kahng, Huijuan Wang, Alex Zelikovsky UCLA Computer Science Department
The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Fast and Area-Efficient Phase Conflict Detection and Correction in Standard-Cell Layouts Charles Chiang, Synopsys Andrew B. Kahng, UC San Diego Subarna.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi- sets, etc. can be.
Fast Spectrum Allocation in Coordinated Dynamic Spectrum Access Based Cellular Networks Anand Prabhu Subramanian*, Himanshu Gupta*,
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote.
Graph Partitioning Donald Nguyen October 24, 2011.
Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.
Elixir : A System for Synthesizing Concurrent Graph Programs
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Data-parallel Abstractions for Irregular Applications Keshav Pingali University of Texas, Austin.
Thread-Level Speculation Karan Singh CS
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Eric Chang and Rutwik Parikh. Goal: Determine the largest subset of edges in a graph such that no vertex of the graph is touched by more than one edge.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.
Experimental Study of Directed Feedback Vertex Set Problem With Rudolf Fleischer and Liwei Yuan Fudan University, Shanghai Xi Wu.
Parallel Routing for FPGAs based on the operator formulation
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation Mechanism ECE 751, Fall 2015 Peng Liu 1.
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Mesh Generation, Refinement and Partitioning Algorithms Xin Sui The University of Texas at Austin.
Introduction of BP & TRW-S
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
A Study of Group-Tree Matching in Large Scale Group Communications
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Synchronization trade-offs in GPU implementations of Graph Algorithms
Barrier Coverage with Optimized Quality for Wireless Sensor Networks
Gary M. Zoppetti Gagan Agrawal
Fifty Years of Parallel Programming: Ieri, Oggi, Domani
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Parallel Data Structures
Presentation transcript:

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan 1 Milind Kulkarni 2 Martin Burtscher 1 Keshav Pingali 1 1 The University of Texas at Austin (USA) 2 Purdue University (USA)

Irregular algorithms Operate on pointer-based data structures like graphs – mesh refinements, min. spanning tree, max-flow… Plenty of available parallelism [Kulkarni et al., PPoPP’09] Baseline Galois system [Kulkarni et al., PLDI’07] – uses speculation to exploit this parallelism – may have high overheads for some algorithms Solution explored in paper – exploit algorithmic structure to reduce overheads of baseline system We will show: – common algorithmic structures – optimizations that exploit those structures – performance results 2

3 Operator formulation of algorithms Algorithm = repeated application of operator to graph – active node: node where computation is needed – activity: application of operator to active node can add/remove nodes from graph – neighborhood: set of nodes and edges read/written to perform activity can be distinct from neighbors in graph Focus: algorithms in which order of activities does not matter Amorphous data-parallelism – parallel execution of activities, subject to neighborhood constraints i1i1 i2i2 i3i3 i4i4 i5i5 : active node : neighborhood

4 Delaunay mesh refinement Iterative refinement to remove badly shaped triangles: add initial bad triangles to workset while workset is not empty: pick a bad triangle find its cavity retriangulate cavity add new bad triangles to workset Multiple valid solutions Parallelism: – bad triangles with cavities that do not overlap can be processed in parallel – parallelism is dependent on runtime values compilers cannot find this parallelism

concurrent graph main() … for node: workset … program Parallel execution model shared-memory optimistic execution of Galois iterators Implementation threads get active nodes from workset apply operator to them Neighborhood independence each node/edge has an associated token graph operations → acquire tokens on read/written nodes token owned by another thread → conflict → activity rolled back software TLS/TM variety Baseline execution model i1i1 i2i2 i3i3 i4i4 i5i5 5

W Sources of overhead Dynamic assignment of work – the centralized workset requires synchronization Enforcing neighborhood constraints – acquiring/releasing tokens on neighborhood Copying data for rollbacks – when an activity modifies a graph element Aborted activities – work is wasted + roll back the activity 6 RR ≡ ≡ activity time ≡

Proposed optimizations Baseline execution model is very general – many irregular algorithms do not need its full generality – “optimize the common case” Identify general-purpose optimizations and evaluate their performance impact Optimizations – cautious – one-shot – iteration coalescing 7

Cautious Algorithmic structure: operator reads all elements of its neighborhood before modifying any of them – conflicts detected before modifications occur Optimizations: Enforcing neighborhood constraints – token acquisition unnecessary after first modification Copying data for rollbacks examples: Delaunay refinement, Boruvka minimum spanning tree, etc. 8 RW R ≡ ≡ activity time

One-shot Algorithmic structure: neighborhood can be predicted before activity begins Optimizations: Enforcing neighborhood constraints – token acquisition only necessary when activity starts Copying data for rollbacks Aborted activities – waste little computation examples: preflow-push, survey propagation, stencil codes like Jacobi, etc. 9 RW R ≡ ≡ activity time

Iteration coalescing Benefits: Dynamic assignment of work – less contention Enforcing neighborhood constraints – locality: thread probably owns the token 10 Iteration coalescing = data-centric loop chunking ─ place new active nodes in thread-local worksets ─ release tokens only on abort/commit Algorithmic structure: −activities generate new active nodes − same token acquired many times across related activities WRR WRR ≡ ≡

Iteration coalescing Benefits: Dynamic assignment of work – less contention Enforcing neighborhood constraints – locality: thread probably owns the token Drawback: number of tokens held by thread increases − higher conflict ratio 11 Iteration coalescing = data-centric loop chunking ─ place new active nodes in thread-local worksets ─ release tokens only on abort/commit

Evaluation Experiments on Niagara (8 cores, 2 threads/core) Average % improvement over baseline 12 Delaunay refinement cautious → 15% cautious + coalescing → 18% one-shot not applicable max speedup: 8.4x Boruvka (MST) cautious → 22% one-shot → 8% coalescing has no impact max speedup: 2.7x

Evaluation Experiments on Niagara (8 cores, 2 threads/core) Average % improvement over baseline 13 Preflow-push (max flow) cautious → 33% one-shot→ 44% one-shot + coalescing → 59% max speedup: 1.12x Survey propagation (SAT) baseline times out one-shot → 28% over cautious max speedup: 1.04x

Conclusions There is structure in irregular algorithms Our optimizations exploit this structure – cautious – one-shot – iteration coalescing The evaluation confirms the importance of reducing the overheads of speculation – other optimizations waiting to be discovered? 14

Thank you! धन्यवाद ! slides available at 15