ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.
Advertisements

Greedy Algorithms.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Network Correlated Data Gathering With Explicit Communication: NP- Completeness and Algorithms R˘azvan Cristescu, Member, IEEE, Baltasar Beferull-Lozano,
Galois System Tutorial Donald Nguyen Mario Méndez-Lojo.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
Galois Performance Mario Mendez-Lojo Donald Nguyen.
1 Spanning Trees Lecture 20 CS2110 – Spring
CS Dept, City Univ.1 Low Latency Broadcast in Multi-Rate Wireless Mesh Networks LUO Hongbo.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
More Graph Algorithms Weiss ch Exercise: MST idea from yesterday Alternative minimum spanning tree algorithm idea Idea: Look at smallest edge not.
CSE 550 Computer Network Design Dr. Mohammed H. Sqalli COE, KFUPM Spring 2007 (Term 062)
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
TECH Computer Science Graph Optimization Problems and Greedy Algorithms Greedy Algorithms  // Make the best choice now! Optimization Problems  Minimizing.
Minimum Spanning Trees What is a MST (Minimum Spanning Tree) and how to find it with Prim’s algorithm and Kruskal’s algorithm.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi- sets, etc. can be.
Shortest Path Algorithms. Kruskal’s Algorithm We construct a set of edges A satisfying the following invariant:  A is a subset of some MST We start with.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Graph Partitioning Donald Nguyen October 24, 2011.
Network Aware Resource Allocation in Distributed Clouds.
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
Fundamentals of Algorithms MCS - 2 Lecture # 7
De-Nian Young Ming-Syan Chen IEEE Transactions on Mobile Computing Slide content thanks in part to Yu-Hsun Chen, University of Taiwan.
Prims’ spanning tree algorithm Given: connected graph (V, E) (sets of vertices and edges) V1= {an arbitrary node of V}; E1= {}; //inv: (V1, E1) is a tree,
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
SPANNING TREES Lecture 21 CS2110 – Fall Nate Foster is out of town. NO 3-4pm office hours today!
Distributed Data Gathering Scheduling in Multi-hop Wireless Sensor Networks for Improved Lifetime Subhasis Bhattacharjee and Nabanita Das International.
Vasilis Syrgkanis Cornell University
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
CSEP 521 Applied Algorithms Richard Anderson Winter 2013 Lecture 3.
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Greedy Algorithms. Zhengjin,Central South University2 Review: Dynamic Programming Summary of the basic idea: Optimal substructure: optimal solution to.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
1 Ch18. The Greedy Methods. 2 BIRD’S-EYE VIEW Enter the world of algorithm-design methods In the remainder of this book, we study the methods for the.
Mesh Generation, Refinement and Partitioning Algorithms Xin Sui The University of Texas at Austin.
Prof. Yu-Chee Tseng Department of Computer Science
Parallel Graph Algorithms
Graphs Representation, BFS, DFS
Greedy Technique.
Maximal Independent Set
COMP 6/4030 ALGORITHMS Prim’s Theorem 10/26/2000.
CS 3343: Analysis of Algorithms
Topological Sort (topological order)
Short paths and spanning trees
Graphs & Graph Algorithms 2
Haim Kaplan and Uri Zwick
CSE 373 Data Structures and Algorithms
Sungho Kang Yonsei University
CSE 373: Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 35 Graphs IV
CSE 373: Data Structures and Algorithms
Presentation transcript:

ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin

“Available Parallelism” How many active nodes can be processed in parallel over time Profile the algorithm, not the system – Disregard communication/synchronization costs, runtime overheads and locality – Rough upper bound on parallelism Expose common high-level structure between algorithms 2

3 Measuring Parallelism Represent program as DAG – Nodes: operations – Edges: dependencies Execution strategy – Operations take unit time – Execute greedily Profile of #operations executed each step Cannot build DAG in general for amorphous data-parallel programs ?

Outline 1.Example: spanning tree 2.Demo of ParaMeter 3.Computing parallelism profiles 4.Algorithm classification 4

Example: Spanning Tree Given an unweighted graph and a starting node, construct a spanning tree 5

Programming with Galois Galois Iterators – Ordered and unordered foreach iterator Galois Data Structures – Graphs, Sets, Maps, etc. 6

Programming with Galois Algorithm Choose v from worklist For each neighbor u … If u not in ST, add edge, mark u and add u to worklist Data Structures Graph Spanning tree (set of edges) Graph g = … Node root = … root.in = true Worklist wl = {root} Set st = { } foreach v: wl foreach u: v.neigh() if not u.in u.in = true st.add((v, u)) wl.add(u) 7

Where’s the Parallelism? Active nodes are nodes on the frontier of ST Neighborhood is immediate neighbors of active node If neighborhoods are small, we can expand ST at several active nodes at once 8

How Much Parallelism is There? If we assume an infinite number of processors? – and every activity takes unit time – and perfect knowledge of neighborhood and ordering constraints How many steps would it take to finish? How many activities can we execute in a step? 9

10 A Guess

Demo: Spanning Tree 11

How Does It Work? 12

Simplified Measurement Represent program as graph – No notion of ordering Execution strategy – Choose independent sets of activities – Different choices  different profiles 13 Conflict Graph parallelism step

Greedy Scheduling Finding schedule to maximize parallelism is NP-hard Heuristic: schedule greedily – Attempt to maximize activities in current step – Choose maximal independent set in conflict graph 14

15 Measurement Conflict Graph Conflict graph changes during execution – New work generated – New conflicts Solution: execute in stages, recalculate conflict graph after each stage

16 Incremental Execution Conflict Graph Conflict graph changes during execution – New work generated – New conflicts Solution: execute in stages, recalculate conflict graph after each stage

ParaMeter Execution Strategy While work left 1.Generate conflict graph for current worklist 2.Execute maximal independent set of nodes 3.Add newly generated work to worklist 17

Outline 1.Example: spanning tree 2.Demo of ParaMeter 3.Computing parallelism profiles 4.Algorithm classification 18

Spanning tree is a refinement morph Typical profile – Little parallelism to start, as ST gets refined – Most parallelism in the middle, as more activities become independent – Little parallelism at the end, as algorithm runs out of work Are there similar trends for other refinement algorithms? 19

Delaunay Mesh Refinement Active nodes: bad triangles Neighborhoods: cavities Refinement-like algorithm – As bad triangles get fixed, mesh gets larger – Cavity sizes stay roughly the same – As mesh grows, cavities less likely to overlap 20

Refine 550K triangle mesh, initially 50% bad 21

Agglomerative Clustering Build a dendrogram by clustering points according to distance Two points cluster if each is the other’s nearest neighbor Dendrogram is built bottom-up 22

Agglomerative Clustering Expect parallelism to match “bushiness” of dendrogram Dendrogram built bottom-up – Coarsening morph – Parallelism should decrease as tree gets connected 23

Cluster 100K randomly generated points 24

Kruskal’s Algorithm Compute minimal spanning tree Ordered algorithm – Active nodes must be processed in some order Execute nodes like out- of-order processor ParaMeter tracks when an activity executes not when it retires

100x100 grid with random weights 26

topology-driven topology operator ordering morph local computation reader general graph grid tree unordered ordered refinement coarsening general data-driven Delaunay Mesh RefinementDelaunay TriangulationSpanning Tree 27

28 topology-driven topology operator ordering morph local computation reader general graph grid tree unordered ordered refinement coarsening general data-driven Agglomerative ClusteringBorvuka’s Algorithm

topology-driven topology operator ordering morph local computation reader general graph grid tree unordered ordered refinement coarsening general data-driven Survey Propagation Single-source Shortest Path Preflow-push 29 Ordered vs. Unordered: a Comparison of Parallelism and Work- Efficiency in Irregular Algorithms, M. Amber Hassaan, Martin Burtscher and Keshav Pingali

Thanks 30

31

32