Galois Performance Mario Mendez-Lojo Donald Nguyen
Overview Galois system is a test bed to explore opts – Safe but not fast out of the box Important optimizations – Select least transactional overhead – Select right scheduling – Select appropriate data structure Quantify optimizations on applications 2
Algorithms 3 irregular algorithms topology operator ordering morph local computation reader general graph grid tree unordered ordered 1. Barnes-Hut 2. Delaunay Mesh Refinement 3. Preflow-push
Methodology Threads IdleSerialGC 4 Time Compute Abort Ratio: Aborted It/Total it GC options UseParallelGC UseParallelOldGC NewRatio=1
Terms Base – Default scheduling, Default graph Serial – Galois classes => No concurrency control classes Speedup – Best mean performance of a serial variant Throughput – # Serial Iterations / time 5
Numbers Runtime – Last of 5 runs in same VM – Ignore time to read and construct initial graph Other statistics – Last of 5 runs 6
Test Environment 2 x Xeon X5570 (4 core, 2.93 GHz) Java 1.6.0_0-b11 Linux x86_64 20GB heap size 7
BARNES-HUT 8 Most Distant Galaxy Candidates in the Hubble Ultra Deep Field
Barnes-Hut N-body algorithm – Oct-tree acceleration structure – Serial Tree build, center of mass, particle update – Parallel Force computation Structure – Reader on tree Variants – Splash2, Reader Galois 9
Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE); 10
ParaMeter Profile 11
Barnes-Hut Results 100,000 points, 1 time step 12 Best serial: base Serial time: ms Best // time: 1553 ms Best speedup: 6.6X
Barnes-Hut Results 100,000 points, 1 time step 13 Best serial: base Serial time: ms Best // time: 1553 ms Best speedup: 6.6X
Barnes-Hut Scalability 14
15
DELAUNAY MESH REFINEMENT 16
Delaunay Mesh Refinement Refine “bad” triangles – Maintained in worklist Structure – Cautious operator on graph Variants – Flag optimized, locallifo 17 base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)
Cautious Optimization mesh.contains(item);... mesh.remove(preNodes.get(i));... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT);... mesh.remove(preNodes.get(i), MethodFlag.NONE);... mesh.add(node, MethodFlag.NONE); No need to save undo info Only check conflicts up to first write
LIFO Optimization GaloisRuntime.foreach(..., Priority.defaultOrder()); GaloisRuntime.foreach(..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class)); 19
ParaMeter Profile 20
DMR Results 0.5M triangles, 0.25M bad triangles Best serial: locallifo.flagopt Serial time: ms Best // time: 3745 ms Best speedup: 4.5X 21
22
PREFLOW-PUSH 23
Preflow-push Max-flow algorithm – Nodes push flow downhill Structure – Cautious, local computation Variants – Flag optimized, local computation graph base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class) base (relabel): Priority.first(ChunkedFIFO.class, 8)
Local Computation Optimization graph =... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create() 25
ParaMeter Profile 26
Preflow-push Results From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges C: ms Java: ms Best serial: lc.flagopt Serial time: ms Best // time: ms Best speedup: 3.1X 27
Preflow-push Scalability 28
29
What performance did we expect? 30 Threads Time IdleSerialGC//ComputeMiss-Speculation Measured Indirectly Synchronization, … Error
What performance did we expect? Naïve: 31 r(x) = t 1 / x Amdahl: r(x) = t p / x + t s t 1 = t p + t s t s = t idle + t gc + t serial Simple: r(x) = (t p (i x / i 1 )) / x + t s
Barnes-Hut 32
Delaunay Mesh Refinement 33
Preflow-push 34
Summary Many profitable optimizations – Selecting among method flags, worklists, graph variants Open topics – Automation – Static, dynamic and performance analysis – Efficient ordered algorithms 35
36