A Parallelism Profiler with What-If Analyses for OpenMP Programs Nader Boushehri, Adarsh Yoga, Santosh Nagarakatte Rutgers University SC 18
Incremental parallelization Feature rich Work-Sharing Tasking SIMD Offload #pragma omp parallel for Incremental parallelization for(int i=0; i<n; ++i) compute(i);
OpenMP Program Performance Analysis Serial Execution: $ ./prog_ser Running time: 120s Parallel Execution (2-cores): $ ./prog_omp Running time: 65s 1.8x Speedup on 2-Core System Parallel Execution (16-cores): $ ./prog_omp Running time: 50s 2.4x Speedup on 16-Core System
OpenMP Program Performance Analysis Serial Execution: $ ./prog_ser Running time: 120s Why is a program not performance portable? Parallel Execution (2-cores): $ ./prog_omp Running time: 65s 1.8x Speedup on 2-Core System Parallel Execution (16-cores): $ ./prog_omp Running time: 50s 2.4x Speedup on 16-Core System
Why is a Program Not Performance Portable? Focus of our talk Lack of work Serialization Bottlenecks Secondary effects Runtime overhead Identify regions that are responsible for serialization bottlenecks
Contributions A novel performance model to identify serialization bottlenecks Capture logical series-parallel relationship + fine-grained measurements Novel OpenMP series-parallel graph What-if analyses to estimate performance improvements Before designing concrete optimizations Effective in identifying bottlenecks that have to be optimized first Surprising effective in identifying bottlenecks Open source: https://github.com/rutgers-apl/omp-whip
Performance Model for what-if analyses
Performance Model Overview Capture the logical series-parallel relation between different fragments of an OpenMP program OpenMP Series-Parallel graph (OSPG) captures these relations Schedule independent OSPG Fine-grained measurements
Code Fragments in OpenMP Programs OpenMP code snippet Execution structure … a(); #pragma omp parallel b(); c(); b a c b A code fragment is the longest sequence of instructions in the dynamic execution before encountering an OpenMP construct
W-Nodes in OSPG Execution Structure OSPG W-nodes b2 W1 W2 W3 W4 a1 c4 b3 An OSPG W-node represents a code fragment in dynamic execution
Capturing Series-Parallel Relation b3 b2 W1 S2 W4 P1 P2 P-nodes capture the parallel relation W2 W3 S-nodes capture the series relation
Capturing Series-Parallel Relation Determine the series-parallel relation between any pair of W nodes with an LCA query S1 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node, they execute in parallel. Otherwise, they execute in series P1 P2 W2 W3 S2 = LCA(W2,W3) P1 = Left-Child(S2,W2,W3)
Capturing Series-Parallel Relation Determine the series-parallel relation between any pair of W nodes with an LCA query S1 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node, they execute in parallel. Otherwise, they execute in series P1 P2 W2 W3 S1 = LCA(W2,W4) S2 = Left-Child(S1,W2,W4)
Illustrative Example Merge sort program parallelized with OpenMP void main(){ int* arr = init(&n); #pragma omp parallel #pragma omp single mergeSort(arr, 0, n); } void mergeSort(int* arr, int s, int e){ if (n <= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); }
OSPG Construction S0 W0 S1 P0 P1 W1 void main(){ int* arr = init(&n); #pragma omp parallel #pragma omp single mergeSort(arr, 0, n); } W0 S1 P0 P1 W1
OSPG Construction S0 W0 S1 void mergeSort(int* arr, int s, int e){ if (n <= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } P0 P1 W2 S2 W5 W1 P2 P3 W3 W4
Parallelism Computation Using OSPG
Compute Parallelism Measure work in each Work node 210 W0 S1 W 6 W 204 Measure work in each Work node P0 W 204 Compute work for each internal node W1 W4 S2 W 2 W 2 P2 W 200 P3 W 100 W 100 W2 W3 W 100 W 100
Compute Serial Work Measure work in each Work node Compute work for each internal node W1 W4 S2 Identify serial work on critical path P2 P3 W2 W3
Compute Serial Work Measure work in each Work node 210 Compute Serial Work S0 SW 110 W0 S1 W 6 W 204 Measure work in each Work node SW 104 W 204 P0 Compute work for each internal node SW 104 W1 W4 S2 Compute serial work for each Internal node W 2 W 2 P2 W 200 P3 W 100 W 100 SW 100 SW 100 SW 100 W2 W3 W 100 W 100
Parallelism Profile Aggregate parallelism at OpenMP constructs W 210 110 W0 S1 main L1 P0 W 204 omp parallel L3 SW 104 W1 W4 S2 omp task L11 omp task L13 P2 P3 W 100 W 100 SW 100 SW 100 Aggregate parallelism at OpenMP constructs W2 W3
Parallelism Profile W0 S0 S1 P0 S2 P2 P3 W1 W4 W2 W3 Line Number Work Serial Work Parallelism Serial Work % program:1 210 110 1.91 5.4 omp parallel:3 204 104 1.96 3.5 omp task:11 100 1.00 91.1 omp task:13
We could estimate the increase in parallelism via hypothetically optimizing a region of the program
Example: What-if Analyses void mergeSort(int* arr, int s, int e){ if (n<= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } Selected Region Developer chooses regions for what-if analyses Estimate improvements in parallelism by reducing the serial work on corresponding work nodes
Compute What-if Profile S0 S1 P0 S2 P2 P3 W1 W4 W2 W3 Line Number Work Serial Work Parallelism Serial Work % program:1 210 16 13.1 37.5 omp parallel:3 204 10 20.4 25 omp task:11 100 6 16.0 omp task:13
Prototype OMP-WHIP, our profiler for OpenMP programs OMP-Whip Uses OMPT for instrumentation OMPT Event Callback Annotated regions for what-if Analysis Serialization Bottlenecks Program Trace Parallelism Profile What-if Profile Run Program OMP-Whip
Evaluation Tested 43 OpenMP applications Was it effective? Written in OpenMP common core Was it effective? Identified bottlenecks in all applications Identified regions that matter for parallelism using what-if analyses
Was it Effective? Application Initial speedup Opt speedup Change summary AMGmk 5.4 9.1 Parallelize loop regions QuickSilver 11.8 12.8 Change loop scheduling Del Triang 1.1 9.2 Parallelize compute loop Min Span 1.9 7.6 Parallelize sort using Tasks NBody 14.8 Recursive decomposition using Tasks CHull 2.1 11.1 Parallelize loop region and add tasking Strassen 14 15.6 Increase task count
Use case: AMGmk Initial speedup: 5.4x Relax Axpy Matvec Initial parallelism profile What-if profile Line Number Parallelism Serial work % Program 6.93 47.95 relax.c:91 13.63 29.17 csr.c:172 10.57 20.36 relax.c:87 11.53 1.58 Line Number Parallelism Serial work % Program 11.62 20.47 relax.c:91 13.11 52.32 csr.c:172 10.56 21.43 relax.c:87 11.4 2.75
Use case: AMGmk Initial speedup: 5.4x Relax Axpy Matvec Optimized speedup: 9.1x What-if profile Optimized parallelism profile Line Number Parallelism Serial work % Program 11.62 20.47 relax.c:91 13.11 52.32 csr.c:172 10.56 21.43 relax.c:87 11.4 2.75 Line Number Parallelism Serial work % Program 11.43 17.44 relax.c:91 13.11 51.87 csr.c:179 15.79 21.31 vect.c:383 9.14 3.52
Was it Practical to Use? 62% average profiling overhead compared to parallel execution 28% average memory overhead Only a small fraction of OSPG will be in memory One-the-fly profiling mode to analyze long running programs Eliminates the need for logs and offline analysis
Related Work https://www.openmp.org/resources/openmp-compilers-tools/
Conclusion and Future Work A novel performance model to identify serialization bottlenecks What-if analyses to estimate performance improvements Our first step to characterize performance of OpenMP programs In future work: Identify the right amount of parallelism Offloading support
Thank You OMP-WHIP is available online https://github.com/rutgers-apl/omp-whip