Download presentation
Presentation is loading. Please wait.
2
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan 1 Milind Kulkarni 2 Martin Burtscher 1 Keshav Pingali 1 1 The University of Texas at Austin (USA) 2 Purdue University (USA)
3
Irregular algorithms Operate on pointer-based data structures like graphs – mesh refinements, min. spanning tree, max-flow… Plenty of available parallelism [Kulkarni et al., PPoPP’09] Baseline Galois system [Kulkarni et al., PLDI’07] – uses speculation to exploit this parallelism – may have high overheads for some algorithms Solution explored in paper – exploit algorithmic structure to reduce overheads of baseline system We will show: – common algorithmic structures – optimizations that exploit those structures – performance results 2
4
3 Operator formulation of algorithms Algorithm = repeated application of operator to graph – active node: node where computation is needed – activity: application of operator to active node can add/remove nodes from graph – neighborhood: set of nodes and edges read/written to perform activity can be distinct from neighbors in graph Focus: algorithms in which order of activities does not matter Amorphous data-parallelism – parallel execution of activities, subject to neighborhood constraints i1i1 i2i2 i3i3 i4i4 i5i5 : active node : neighborhood
5
4 Delaunay mesh refinement Iterative refinement to remove badly shaped triangles: add initial bad triangles to workset while workset is not empty: pick a bad triangle find its cavity retriangulate cavity add new bad triangles to workset Multiple valid solutions Parallelism: – bad triangles with cavities that do not overlap can be processed in parallel – parallelism is dependent on runtime values compilers cannot find this parallelism
6
concurrent graph main() … for node: workset … program Parallel execution model shared-memory optimistic execution of Galois iterators Implementation threads get active nodes from workset apply operator to them Neighborhood independence each node/edge has an associated token graph operations → acquire tokens on read/written nodes token owned by another thread → conflict → activity rolled back software TLS/TM variety Baseline execution model i1i1 i2i2 i3i3 i4i4 i5i5 5
7
W Sources of overhead Dynamic assignment of work – the centralized workset requires synchronization Enforcing neighborhood constraints – acquiring/releasing tokens on neighborhood Copying data for rollbacks – when an activity modifies a graph element Aborted activities – work is wasted + roll back the activity 6 RR ≡ ≡ activity time ≡
8
Proposed optimizations Baseline execution model is very general – many irregular algorithms do not need its full generality – “optimize the common case” Identify general-purpose optimizations and evaluate their performance impact Optimizations – cautious – one-shot – iteration coalescing 7
9
Cautious Algorithmic structure: operator reads all elements of its neighborhood before modifying any of them – conflicts detected before modifications occur Optimizations: Enforcing neighborhood constraints – token acquisition unnecessary after first modification Copying data for rollbacks examples: Delaunay refinement, Boruvka minimum spanning tree, etc. 8 RW R ≡ ≡ activity time
10
One-shot Algorithmic structure: neighborhood can be predicted before activity begins Optimizations: Enforcing neighborhood constraints – token acquisition only necessary when activity starts Copying data for rollbacks Aborted activities – waste little computation examples: preflow-push, survey propagation, stencil codes like Jacobi, etc. 9 RW R ≡ ≡ activity time
11
Iteration coalescing Benefits: Dynamic assignment of work – less contention Enforcing neighborhood constraints – locality: thread probably owns the token 10 Iteration coalescing = data-centric loop chunking ─ place new active nodes in thread-local worksets ─ release tokens only on abort/commit Algorithmic structure: −activities generate new active nodes − same token acquired many times across related activities WRR WRR ≡ ≡
12
Iteration coalescing Benefits: Dynamic assignment of work – less contention Enforcing neighborhood constraints – locality: thread probably owns the token Drawback: number of tokens held by thread increases − higher conflict ratio 11 Iteration coalescing = data-centric loop chunking ─ place new active nodes in thread-local worksets ─ release tokens only on abort/commit
13
Evaluation Experiments on Niagara (8 cores, 2 threads/core) Average % improvement over baseline 12 Delaunay refinement cautious → 15% cautious + coalescing → 18% one-shot not applicable max speedup: 8.4x Boruvka (MST) cautious → 22% one-shot → 8% coalescing has no impact max speedup: 2.7x
14
Evaluation Experiments on Niagara (8 cores, 2 threads/core) Average % improvement over baseline 13 Preflow-push (max flow) cautious → 33% one-shot→ 44% one-shot + coalescing → 59% max speedup: 1.12x Survey propagation (SAT) baseline times out one-shot → 28% over cautious max speedup: 1.04x
15
Conclusions There is structure in irregular algorithms Our optimizations exploit this structure – cautious – one-shot – iteration coalescing The evaluation confirms the importance of reducing the overheads of speculation – other optimizations waiting to be discovered? 14
16
Thank you! धन्यवाद ! slides available at http://www.ices.utexas.edu/~marioml 15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.