Download presentation
Presentation is loading. Please wait.
1
Galois Performance Mario Mendez-Lojo Donald Nguyen
2
Overview Galois system is a test bed to explore opts – Safe but not fast out of the box Important optimizations – Select least transactional overhead – Select right scheduling – Select appropriate data structure Quantify optimizations on applications 2
3
Algorithms 3 irregular algorithms topology operator ordering morph local computation reader general graph grid tree unordered ordered 1. Barnes-Hut 2. Delaunay Mesh Refinement 3. Preflow-push
4
Methodology Threads IdleSerialGC 4 Time Compute Abort Ratio: Aborted It/Total it GC options UseParallelGC UseParallelOldGC NewRatio=1
5
Terms Base – Default scheduling, Default graph Serial – Galois classes => No concurrency control classes Speedup – Best mean performance of a serial variant Throughput – # Serial Iterations / time 5
6
Numbers Runtime – Last of 5 runs in same VM – Ignore time to read and construct initial graph Other statistics – Last of 5 runs 6
7
Test Environment 2 x Xeon X5570 (4 core, 2.93 GHz) Java 1.6.0_0-b11 Linux 2.6.24-27 x86_64 20GB heap size 7
8
BARNES-HUT 8 Most Distant Galaxy Candidates in the Hubble Ultra Deep Field
9
Barnes-Hut N-body algorithm – Oct-tree acceleration structure – Serial Tree build, center of mass, particle update – Parallel Force computation Structure – Reader on tree Variants – Splash2, Reader Galois 9
10
Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE); 10
11
ParaMeter Profile 11
12
Barnes-Hut Results 100,000 points, 1 time step 12 Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X
13
Barnes-Hut Results 100,000 points, 1 time step 13 Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X
14
Barnes-Hut Scalability 14
15
15
16
DELAUNAY MESH REFINEMENT 16
17
Delaunay Mesh Refinement Refine “bad” triangles – Maintained in worklist Structure – Cautious operator on graph Variants – Flag optimized, locallifo 17 base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)
18
Cautious Optimization mesh.contains(item);... mesh.remove(preNodes.get(i));... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT);... mesh.remove(preNodes.get(i), MethodFlag.NONE);... mesh.add(node, MethodFlag.NONE); No need to save undo info Only check conflicts up to first write
19
LIFO Optimization GaloisRuntime.foreach(..., Priority.defaultOrder()); GaloisRuntime.foreach(..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class)); 19
20
ParaMeter Profile 20
21
DMR Results 0.5M triangles, 0.25M bad triangles Best serial: locallifo.flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4.5X 21
22
22
23
PREFLOW-PUSH 23
24
Preflow-push Max-flow algorithm – Nodes push flow downhill Structure – Cautious, local computation Variants – Flag optimized, local computation graph base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class) base (relabel): Priority.first(ChunkedFIFO.class, 8)
25
Local Computation Optimization graph =... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create() 25
26
ParaMeter Profile 26
27
Preflow-push Results From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges http://avglab.com/andrew/CATS/maxflow_synthetic.htm C: 11450 ms Java: 30234 ms Best serial: lc.flagopt Serial time: 57121 ms Best // time: 18242 ms Best speedup: 3.1X 27
28
Preflow-push Scalability 28
29
29
30
What performance did we expect? 30 Threads Time IdleSerialGC//ComputeMiss-Speculation Measured Indirectly Synchronization, … Error
31
What performance did we expect? Naïve: 31 r(x) = t 1 / x Amdahl: r(x) = t p / x + t s t 1 = t p + t s t s = t idle + t gc + t serial Simple: r(x) = (t p (i x / i 1 )) / x + t s
32
Barnes-Hut 32
33
Delaunay Mesh Refinement 33
34
Preflow-push 34
35
Summary Many profitable optimizations – Selecting among method flags, worklists, graph variants Open topics – Automation – Static, dynamic and performance analysis – Efficient ordered algorithms 35
36
36
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.