1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.

Slides:



Advertisements
Similar presentations
1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Parallel and Distributed Simulation Time Warp: Basic Algorithm.
Lookahead. Outline Null message algorithm: The Time Creep Problem Lookahead –What is it and why is it important? –Writing simulations to maximize lookahead.
Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with.
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Galois System Tutorial Donald Nguyen Mario Méndez-Lojo.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
Galois Performance Mario Mendez-Lojo Donald Nguyen.
1 Keshav Pingali University of Texas, Austin Towards a Science of Parallel Programming.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Concurrency CS 510: Programming Languages David Walker.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Models and Paradigms
Reconfigurable Computing (EN2911X, Fall07)
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scalable, Reliable, Power-Efficient Communication for Hardware Transactional Memory Seth Pugsley, Manu Awasthi, Niti Madan, Naveen Muralimanohar and Rajeev.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi- sets, etc. can be.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Fast Spectrum Allocation in Coordinated Dynamic Spectrum Access Based Cellular Networks Anand Prabhu Subramanian*, Himanshu Gupta*,
Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote.
Graph Partitioning Donald Nguyen October 24, 2011.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Elixir : A System for Synthesizing Concurrent Graph Programs
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
Message-Passing for Wireless Scheduling: an Experimental Study Paolo Giaccone (Politecnico di Torino) Devavrat Shah (MIT) ICCCN 2010 – Zurich August 2.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Data-parallel Abstractions for Irregular Applications Keshav Pingali University of Texas, Austin.
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Keshav Pingali Department of Computer Science University of Texas, Austin The Hunt for Right Abstractions.
Distributed Galois Andrew Lenharth 2/27/2015. Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.
Clock Synchronization (Time Management) Deadlock Avoidance Using Null Messages.
Some Computational Science Algorithms and Data Structures Keshav Pingali University of Texas, Austin.
Mesh Generation, Refinement and Partitioning Algorithms Xin Sui The University of Texas at Austin.
PDES Introduction The Time Warp Mechanism
Auburn University
TensorFlow– A system for large-scale machine learning
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Parallel and Distributed Simulation
Synchronization trade-offs in GPU implementations of Graph Algorithms
Parallel and Distributed Simulation
Fifty Years of Parallel Programming: Ieri, Oggi, Domani
Parallel Data Structures
Presentation transcript:

1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms

2 Irregular applications Organized around large, pointer-based data structures such as graphs, trees, grids, etc. –Artificial intelligence: Bayesian inference, SAT solvers: WalkSAT, survey propagation –Computational biology: protein homology networks –Finite-element methods: mesh generation, refinement, partitioning, sparse linear solvers –Data-mining: clustering –Social networks: finding communities –Simulation: event-driven simulation, Petri-nets –N-body methods: Barnes-Hut, FMM Galois project: –(Wirth): algorithms + data structures = programs –understand patterns of parallelism and locality in irregular algorithms –build systems for exploiting this parallelism on multicore processors

3 High-level message Lot of parallelism in irregular algorithms –amorphous data-parallelism (ADP) Compile-time analysis cannot find ADP in most irregular algorithms –points-to analysis, shape analysis,…  Optimistic parallel execution is necessary to find ADP in some irregular applications –but speculation is not needed for most irregular algorithms There is a lot of structure in irregular algorithms –can be exploited to optimize parallel execution –regular algorithms become a special case of irregular algorithms Dependence graphs are the wrong abstraction for irregular algorithms –data-centric abstractions are crucial –rethink algorithms: operator formulation

4 Operator formulation of algorithms Algorithm = repeated application of operator to graph –active element: node or edge where computation is needed –neighborhood: set of nodes and edges read/written to perform activity distinct usually from neighbors in graph –ordering: order in which active elements must be executed in a sequential implementation –any order –problem-dependent order Amorphous data-parallelism –parallel execution of activities, subject to neighborhood and ordering constraints i1i1 i2i2 i3i3 i4i4 i5i5 : active node : neighborhood

5 Delaunay Mesh Refinement Iterative refinement to remove badly shaped triangles: while there are bad triangles do { Pick a bad triangle; Find its cavity; Retriangulate cavity; // may create new bad triangles } Don’t-care non-determinism: –final mesh depends on order in which bad triangles are processed –applications do not care which mesh is produced Data structure: –graph in which nodes represent triangles and edges represent triangle adjacencies Parallelism: –bad triangles with cavities that do not overlap can be processed in parallel –parallelism is dependent on runtime values compilers cannot find this parallelism –(Miller et al) at runtime, repeatedly build interference graph and find maximal independent sets for parallel execution Mesh m = /* read in mesh */ WorkList wl; wl.add(m.badTriangles()); while (true) { if ( wl.empty() ) break; Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e);//determine new cavity c.expand(); c.retriangulate();//re-triangulate region m.update(c);//update mesh wl.add(c.badTriangles()); }

6 Event-driven simulation Stations communicate by sending messages with time-stamps on FIFO channels Stations have internal state that is updated when a message is processed Messages must be processed in time- order at each station Data structure: –Messages in event-queue, sorted in time- order Parallelism: –activities created in future may interfere with current activities –Jefferson time-warp station can fire when it has an incoming message on any edge requires roll-back if speculative conflict is detected –Chandy-Misra-Bryant conservative event-driven simulation requires null messages to avoid deadlock 2 5 A B 3 4 C 6

7 Galois programming model (PLDI 2007) Joe programmers –sequential, OO model –Galois set iterators: for iterating over unordered and ordered sets of active elements for each e in Set S do B(e) –evaluate B(e) for each element in set S –no a priori order on iterations –set S may get new elements during execution for each e in OrderedSet S do B(e) –evaluate B(e) for each element in set S –perform iterations in order specified by OrderedSet –set S may get new elements during execution Stephanie programmers –Galois concurrent data structure library (Wirth) Algorithms + Data structures = Programs Mesh m = /* read in mesh */ Set ws; ws.add(m.badTriangles()); // initialize ws for each tr in Set ws do { //unordered Set iterator if (tr no longer in mesh) continue; Cavity c = new Cavity(tr); c.expand(); c.retriangulate(); m.update(c); ws.add(c.badTriangles()); //bad triangles } DMR using Galois iterators

8 Concurrent Data structure main() …. for each …..{ ……. }..... Master Joe Program Parallel execution model: –shared-memory –optimistic execution of Galois iterators Implementation: –master thread begins execution of program –when it encounters iterator, worker threads help by executing iterations concurrently –barrier synchronization at end of iterator Independence of neighborhoods: –software TLS/TM variety –logical locks on nodes and edges Ordering constraints for ordered set iterator: –execute iterations out of order but commit in order –cf. out-of-order CPUs Galois parallel execution model i1i1 i2i2 i3i3 i4i4 i5i5

9 ParaMeter Parallelism Profiles: DMR Input mesh: –Produced by Triangle (Shewchuck) –550K triangles –Roughly half are badly shaped Available parallelism: –How many non-conflicting triangles can be expanded at each time step? Parallelism intensity: –What fraction of the total number of bad triangles can be expanded at each step?

10 Structural analysis of irregular algorithms irregular algorithms topology operator ordering morph local computation reader general graph grid tree unordered ordered refinement coarsening general topology-driven data-driven Jacobi: topology: grid, operator: local computation, ordering: unordered DMR, graph reduction: topology: graph, operator: morph, ordering: unordered Event-driven simulation: topology: graph, operator: local computation, ordering: ordered

11 Cautious operators (PPoPP 2010) Cautious operator implementation: –reads all the elements in its neighborhood before modifying any of them –(eg) Delaunay mesh refinement Algorithm structure: –cautious operator + unordered active elements Optimization: optimistic execution w/o buffering –grab locks on elements during read phase conflict: someone else has lock, so release your locks –once update phase begins, no new locks will be acquired update in-place w/o making copies zero-buffering –note: this is not two-phase locking

12 Eliminating speculation Coordinated execution of activities: if we can build dependence graph –Run-time scheduling: cautious operator implementation + unordered active elements execute all activities partially to determine neighborhoods create interference graph and find independent set of activities execute independent set of activities in parallel w/o synchronization –Just-in-time scheduling: local computation + topology-driven (eg) tree walks, sparse MVM (inspector-executor) –Compile-time scheduling: previous case + graph is known at compile-time (eg. grid or clique) make all scheduling decisions at compile-time time Speculation needed if –operator implementation is not cautious, or –unordered algorithm that can create new active nodes dynamically

13 Summary (Wirth): Algorithms + Data structures = Programs –deep understanding of structure in algorithms and data structures is critical for success in parallelization Wrong abstractions: –dependence graphs, dataflow graphs, etc. Right abstraction: –operator formulation of algorithms –key notions: active nodes, neighborhoods, ordering Exposes key properties of algorithms –amorphous data-parallelism –structure in algorithms and data structures