Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
A Rely-Guarantee-Based Simulation for Verifying Concurrent Program Transformations Hongjin Liang, Xinyu Feng & Ming Fu Univ. of Science and Technology.
Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with.
Minimum Spanning Trees Definition Two properties of MST’s Prim and Kruskal’s Algorithm –Proofs of correctness Boruvka’s algorithm Verifying an MST Randomized.
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
Carnegie Mellon Lecture 6 Register Allocation I. Introduction II. Abstraction and the Problem III. Algorithm Reading: Chapter Before next class:
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Galois System Tutorial Donald Nguyen Mario Méndez-Lojo.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
Galois System Tutorial Mario Méndez-Lojo Donald Nguyen.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
External-Memory MST (Arge, Brodal, Toma). Minimum-Spanning Tree Given a weighted, undirected graph G=(V,E), the minimum-spanning tree (MST) problem is.
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
1 Lock-Free Linked Lists Using Compare-and-Swap by John Valois Speaker’s Name: Talk Title: Larry Bush.
Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Greedy methods Prudence Wong
Graph Partitioning Donald Nguyen October 24, 2011.
Elixir : A System for Synthesizing Concurrent Graph Programs
CS 3343: Analysis of Algorithms Lecture 21: Introduction to Graphs.
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
Program Analysis and Synthesis of Parallel Systems Roman ManevichBen-Gurion University.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
COSC 2007 Data Structures II Chapter 14 Graphs III.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.
Generative Programming. Automated Assembly Lines.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.
Threads Cannot be Implemented as a Library Hans-J. Boehm.
Parallel Routing for FPGAs based on the operator formulation
Motivation  Parallel programming is difficult  Culprit: Non-determinism Interleaving of parallel threads But required to harness parallelism  Sequential.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
Minimum- Spanning Trees
AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.
Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)
Introduction of BP & TRW-S
Håkan Sundell Philippas Tsigas
Threads Cannot Be Implemented As a Library
Janus: exploiting parallelism via hindsight
Faster Data Structures in Transactional Memory using Three Paths
Amir Kamil and Katherine Yelick
Hongjin Liang, Xinyu Feng & Ming Fu
Synchronization trade-offs in GPU implementations of Graph Algorithms
Threads and Memory Models Hal Perkins Autumn 2011
Minimum Spanning Tree.
CSE373: Data Structures & Algorithms Lecture 12: Minimum Spanning Trees Catie Baker Spring 2015.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
CSE373: Data Structures & Algorithms Lecture 20: Minimum Spanning Trees Linda Shapiro Spring 2016.
Threads and Memory Models Hal Perkins Autumn 2009
Minimum Spanning Tree Optimizations
Amir Kamil and Katherine Yelick
Algorithms: Design and Analysis
Parallel Data Structures
Presentation transcript:

Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study

vision 2 Problem How to utilize parallel hardware Programming model for parallel applications High-level language for parallelism Program in terms of sequential semantics Choose tuning parameters for better performance Decouple semantics from implementation Compiler synthesizes parallel code Correctness guarantees Avoids usual pitfalls: deadlocks, data races, etc. For any value of tuning parameters

this talk 3 Problem How to utilize parallel hardware Programming model for parallel applications High-level language for parallelism Program in terms of sequential semantics Choose tuning parameters for better performance Decouple semantics from implementation Compiler synthesizes parallel code Correctness guarantees Avoids usual pitfalls: deadlocks, data races, etc. For any value of tuning parameters Parallelizing graph algorithms Implementing concurrent graph data structures Relational algebra Relation decomposition and tiling Autograph generates Java code Linearizability Speculation support: abstract locks + undos

context 4 Graph algorithms are ubiquitous Computational biology Social NetworksComputer Graphics

organization 5 Speculative parallelism background Speculative parallelization via Galois Data structures for speculative parallelism Autograph Specifying relational data structures Optimizations Empirical evaluation Outperform library data structures up to 2x

minimum spanning tree problem 6 cd ab ef g

7 cd ab ef g

Boruvka’s algorithm 8 Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node cd ab ef g d a,c b ef g lt

parallelism in Boruvka 9 cd ab ef g Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node

non-conflicting iterations 10 cd ab Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node ef g 4 6

non-conflicting iterations 11 Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node d a,c b e f,g 6

conflicting iterations 12 cd ab ef g Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node

Amorphous data-parallelism 13 Algorithm = repeated application of operator to graph Active node: Node where computation is needed Activity: Application of operator to active node Neighborhood: Sub-graph read/written to perform activity Unordered algorithms: Active nodes can be processed in any order Parallel execution of activities, subject to neighborhood constraints Neighborhoods unknown at compile time Use speculation i1i1 i2i2 i3i3

optimistic parallelization in Galois 14 Programming model Client code has sequential semantics Library of concurrent data structures Parallel execution model Thread-level speculation (TLS) Activities executed speculatively Conflict detection Each node has associated exclusive lock Graph operations acquire locks on accessed nodes Lock owned by another thread  conflict  iteration rollback i1i1 i2i2 i3i3

concurrent data structure contract 15 Linearizability [Herlihy & Wing TOPLAS’90] Method calls should appear to execute atomically Synchronization w.r.t concrete data structure Support speculation [Pingali et al. PLDI’07] [Herlihy & Koskinen PPoPP’08] Methods acquire abstract locks Synchronization w.r.t abstract data type Methods should register undo actions for rollback (Data-race freedom) (Deadlock freedom) (Non-blocking methods)

library graph data structure 16 thread id a b next dummy next f c dummy next d dummy e next dummy 0123 next Boruvka only removes nodes in_flag=1 set of nodes:

customized graph data structure 17 a b f c d e next in_flag=1 remove(d) set of nodes:

customized graph data structure 18 a b f c d e next in_flag=1 in_flag=0 in_flag=1 remove(d) set of nodes:

organization 19 Speculative parallelism background Speculative parallelization via Galois Data structures for speculative parallelism Autograph Specifying relational data structures Optimizations Empirical evaluation Outperform library data structures up to 2x

high-level spec at a glance 20 Structure nodes : rel(node) edges : rel(src, dst, wt) FD {src, dst} → {wt} FK src → node FK dst → node Decomposition Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods... nodes:Set node edges:List edgesOut:Map src succ:Map dstwt edgesIn:Map dstpred:Set src Tiling edgestile ListTile nodestile AttArrLinkedSet edgesOuttile AttMap edgesIn tile AttMap succ tile DualArrayMap pred tile ArraySet semanticsimplementation

specifying a graph for Boruvka 21 Structure nodes : rel(node) edges : rel(src, dst, wt) FD {src, dst} → {wt} FK src → node FK dst → node

relational representation of graph 22 Structure nodes : rel(node) edges : rel(src, dst, wt) FD {src, dst} → {wt} FK src → node FK dst → node ab5 ac2 bd4 cd7 de1 ef6 ba5 ca2 db4 dc7 ed1 fe6 srcdstwtnode a b c d e f nodesedges cd ab ef 6

specifying methods 23 Structure nodes : rel(node) edges : rel(src, dst, wt) FD {src, dst} → {wt} FK src → node FK dst → node Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods... ab5 ac2 bd4 cd7 de1 ef6 ba5 ca2 db4 dc7 ed1 fe6 srcdstwtnode a b c d e f nodesedges can we implement efficiently?

decomposing relations 24 Structure nodes : rel(node) edges : rel(src, dst, wt) FD {src, dst} → {wt} FK src → node FK dst → node Decomposition Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods... nodes:Set node edges:List edgesOut:Map src succ:Map dstwt edgesIn:Map dstpred:Set src

decomposed representation 25 Decomposition nodes : Set(node) edges : List( edgesOut : Map(src, succ : Map(dst, wt)) edgesIn : Map(dst, pred : Set(src)) ) Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods... a b c d e f srcsuccnode a b c d e f b5 c2 d4 a5 d7 a2 e1 b4 c7 f6 d1 dstwt a b c d e f dstpred b c src a d a d b c e d f e edgesOutedgesIn e6 nodesedges

findMin(a) 26 a b c d e f srcsuccnode a b c d e f b5 c2 d4 a5 d7 a2 e1 b4 c7 f6 d1 dstwt a b c d e f dstpred b c src a d a d b c e d f e edgesOutedgesIn e6 nodesedges Decomposition nodes : Set(node) edges : List( edgesOut : Map(src, succ : Map(dst, wt)) edgesIn : Map(dst, pred : Set(src)) ) Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods...

findMin(a) 27 a b c d e f srcsuccnode a b c d e f b5 c2 d4 a5 d7 a2 e1 b4 c7 f6 d1 dstwt a b c d e f dstpred b c src a d a d b c e d f e edgesOutedgesIn e6 nodesedges Decomposition nodes : Set(node) edges : List( edgesOut : Map(src, succ : Map(dst, wt)) edgesIn : Map(dst, pred : Set(src)) ) Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods...

findMin(a) 28 a b c d e f srcsuccnode a b c d e f b5 c2 d4 a5 d7 a2 e1 b4 c7 f6 d1 dstwt a b c d e f dstpred b c src a d a d b c e d f e edgesOutedgesIn e6 nodesedges Decomposition nodes : Set(node) edges : List( edgesOut : Map(src, succ : Map(dst, wt)) edgesIn : Map(dst, pred : Set(src)) ) Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods...

findMin(a) 29 a b c d e f srcsuccnode a b c d e f b5 c2 d4 a5 d7 a2 e1 b4 c7 f6 d1 dstwt a b c d e f dstpred b c src a d a d b c e d f e edgesOutedgesIn e6 nodesedges Decomposition nodes : Set(node) edges : List( edgesOut : Map(src, succ : Map(dst, wt)) edgesIn : Map(dst, pred : Set(src)) ) Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods...

findMin(a) 30 a b c d e f srcsuccnode a b c d e f b5 c2 d4 a5 d7 a2 e1 b4 c7 f6 d1 dstwt a b c d e f dstpred b c src a d a d b c e d f e edgesOutedgesIn e6 nodesedges Decomposition nodes : Set(node) edges : List( edgesOut : Map(src, succ : Map(dst, wt)) edgesIn : Map(dst, pred : Set(src)) ) Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods...

findMin(a): abstract locks 31 a b c d e f srcsuccnode a b c d e f b5 c2 d4 a5 d7 a2 e1 b4 c7 f6 d1 dstwt a b c d e f dstpred b c src a d a d b c e d f e edgesOutedgesIn e6 nodesedges Decomposition nodes : Set(node) edges : List( edgesOut : Map(src, succ : Map(dst, wt)) edgesIn : Map(dst, pred : Set(src)) ) Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods...

“tiles”: concretizing sub-relations 32 Structure nodes : rel(node) edges : rel(src, dst, wt) FD {src, dst} → {wt} FK src → node FK dst → node Decomposition Methods edgeExists : contains(src, dst) removeNode : remove(node) findMin : map(src, out dst, out wt) { if (wt < minWeight) { lt = dst; minWeight = wt; }... other methods... nodes:Set node edges:List edgesOut:Map src succ:Map dstwt edgesIn:Map dstpred:Set src Tiling edgestile ListTile nodestile AttArrLinkedSet edgesOuttile AttMap edgesIn tile AttMap succ tile DualArrayMap pred tile ArraySet

nodes tile AttArrLinkedSet 33 a b c d e f nodesedges thread id a b next dummy next f c dummy next d dummy e next dummy 0123 next in_flag=1

nodes tile AttLinkedSet 34 a b c d e f nodesedges a b f c d e next in_flag=1

optimizations 35 Customizing tiles Customize nodes set for concurrent deletions Customize successor/predecessor maps for primitive types Customize map operations Inlining Selecting relevant attributes Handling auxiliary state Loop fusion for read-only operations

organization 36 Speculative parallelism background Speculative parallelization of graph algorithms Data structures for speculative parallelism Autograph Specifying relational data structures Optimizations Empirical evaluation Related work + conclusion

experiments 37 Specified graph data structures Used Autograph to generates Java code Compared Generated data structures Library data structures (from Galois) Hand-written parallel benchmarks Show relative effect of different optimizations

Boruvka: running times comparison 38

Boruvka: running times comparison 39

Boruvka: effect of optimizations 40

Delaunay mesh refinement: times 41

Single-source shortest path: times 42

writing graph applications yesterday 43 Galois Runtime Graph Application Concurrent Data Structure Library Morph Graph LC Graph Set … Map Expert programmer Concurrency Expert Joe programmer + Correct ? Efficient (non-customizable)

writing graph applications today 44 Galois Runtime Data structure specification Autograph Graph Application Joe programmer Joe++ programmer + Correct + Customizable + Speedup over library data structures Data structure implementation

Grazie! Download Galois from Expect Autograph in next Galois release