Presentation is loading. Please wait.

Presentation is loading. Please wait.

Galois Performance Mario Mendez-Lojo Donald Nguyen.

Similar presentations


Presentation on theme: "Galois Performance Mario Mendez-Lojo Donald Nguyen."— Presentation transcript:

1 Galois Performance Mario Mendez-Lojo Donald Nguyen

2 Overview Galois system is a test bed to explore opts – Safe but not fast out of the box Important optimizations – Select least transactional overhead – Select right scheduling – Select appropriate data structure Quantify optimizations on applications 2

3 Algorithms 3 irregular algorithms topology operator ordering morph local computation reader general graph grid tree unordered ordered 1. Barnes-Hut 2. Delaunay Mesh Refinement 3. Preflow-push

4 Methodology Threads IdleSerialGC 4 Time Compute Abort Ratio: Aborted It/Total it GC options UseParallelGC UseParallelOldGC NewRatio=1

5 Terms Base – Default scheduling, Default graph Serial – Galois classes => No concurrency control classes Speedup – Best mean performance of a serial variant Throughput – # Serial Iterations / time 5

6 Numbers Runtime – Last of 5 runs in same VM – Ignore time to read and construct initial graph Other statistics – Last of 5 runs 6

7 Test Environment 2 x Xeon X5570 (4 core, 2.93 GHz) Java 1.6.0_0-b11 Linux 2.6.24-27 x86_64 20GB heap size 7

8 BARNES-HUT 8 Most Distant Galaxy Candidates in the Hubble Ultra Deep Field

9 Barnes-Hut N-body algorithm – Oct-tree acceleration structure – Serial Tree build, center of mass, particle update – Parallel Force computation Structure – Reader on tree Variants – Splash2, Reader Galois 9

10 Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE); 10

11 ParaMeter Profile 11

12 Barnes-Hut Results 100,000 points, 1 time step 12 Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X

13 Barnes-Hut Results 100,000 points, 1 time step 13 Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X

14 Barnes-Hut Scalability 14

15 15

16 DELAUNAY MESH REFINEMENT 16

17 Delaunay Mesh Refinement Refine “bad” triangles – Maintained in worklist Structure – Cautious operator on graph Variants – Flag optimized, locallifo 17 base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)

18 Cautious Optimization mesh.contains(item);... mesh.remove(preNodes.get(i));... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT);... mesh.remove(preNodes.get(i), MethodFlag.NONE);... mesh.add(node, MethodFlag.NONE); No need to save undo info Only check conflicts up to first write

19 LIFO Optimization GaloisRuntime.foreach(..., Priority.defaultOrder()); GaloisRuntime.foreach(..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class)); 19

20 ParaMeter Profile 20

21 DMR Results 0.5M triangles, 0.25M bad triangles Best serial: locallifo.flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4.5X 21

22 22

23 PREFLOW-PUSH 23

24 Preflow-push Max-flow algorithm – Nodes push flow downhill Structure – Cautious, local computation Variants – Flag optimized, local computation graph base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class) base (relabel): Priority.first(ChunkedFIFO.class, 8)

25 Local Computation Optimization graph =... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create() 25

26 ParaMeter Profile 26

27 Preflow-push Results From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges http://avglab.com/andrew/CATS/maxflow_synthetic.htm C: 11450 ms Java: 30234 ms Best serial: lc.flagopt Serial time: 57121 ms Best // time: 18242 ms Best speedup: 3.1X 27

28 Preflow-push Scalability 28

29 29

30 What performance did we expect? 30 Threads Time IdleSerialGC//ComputeMiss-Speculation Measured Indirectly Synchronization, … Error

31 What performance did we expect? Naïve: 31 r(x) = t 1 / x Amdahl: r(x) = t p / x + t s t 1 = t p + t s t s = t idle + t gc + t serial Simple: r(x) = (t p (i x / i 1 )) / x + t s

32 Barnes-Hut 32

33 Delaunay Mesh Refinement 33

34 Preflow-push 34

35 Summary Many profitable optimizations – Selecting among method flags, worklists, graph variants Open topics – Automation – Static, dynamic and performance analysis – Efficient ordered algorithms 35

36 36


Download ppt "Galois Performance Mario Mendez-Lojo Donald Nguyen."

Similar presentations


Ads by Google