Download presentation
Presentation is loading. Please wait.
Published byΔιόσκουροι Νικολάκος Modified over 5 years ago
1
A Parallelism Profiler with What-If Analyses for OpenMP Programs
Nader Boushehri, Adarsh Yoga, Santosh Nagarakatte Rutgers University SC 18
2
Incremental parallelization
Feature rich Work-Sharing Tasking SIMD Offload #pragma omp parallel for Incremental parallelization for(int i=0; i<n; ++i) compute(i);
3
OpenMP Program Performance Analysis
Serial Execution: $ ./prog_ser Running time: 120s Parallel Execution (2-cores): $ ./prog_omp Running time: 65s 1.8x Speedup on 2-Core System Parallel Execution (16-cores): $ ./prog_omp Running time: 50s 2.4x Speedup on 16-Core System
4
OpenMP Program Performance Analysis
Serial Execution: $ ./prog_ser Running time: 120s Why is a program not performance portable? Parallel Execution (2-cores): $ ./prog_omp Running time: 65s 1.8x Speedup on 2-Core System Parallel Execution (16-cores): $ ./prog_omp Running time: 50s 2.4x Speedup on 16-Core System
5
Why is a Program Not Performance Portable?
Focus of our talk Lack of work Serialization Bottlenecks Secondary effects Runtime overhead Identify regions that are responsible for serialization bottlenecks
6
Contributions A novel performance model to identify serialization bottlenecks Capture logical series-parallel relationship + fine-grained measurements Novel OpenMP series-parallel graph What-if analyses to estimate performance improvements Before designing concrete optimizations Effective in identifying bottlenecks that have to be optimized first Surprising effective in identifying bottlenecks Open source:
7
Performance Model for what-if analyses
8
Performance Model Overview
Capture the logical series-parallel relation between different fragments of an OpenMP program OpenMP Series-Parallel graph (OSPG) captures these relations Schedule independent OSPG Fine-grained measurements
9
Code Fragments in OpenMP Programs
OpenMP code snippet Execution structure … a(); #pragma omp parallel b(); c(); b a c b A code fragment is the longest sequence of instructions in the dynamic execution before encountering an OpenMP construct
10
W-Nodes in OSPG Execution Structure OSPG W-nodes
b2 W1 W2 W3 W4 a1 c4 b3 An OSPG W-node represents a code fragment in dynamic execution
11
Capturing Series-Parallel Relation
b3 b2 W1 S2 W4 P1 P2 P-nodes capture the parallel relation W2 W3 S-nodes capture the series relation
12
Capturing Series-Parallel Relation
Determine the series-parallel relation between any pair of W nodes with an LCA query S1 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node, they execute in parallel. Otherwise, they execute in series P1 P2 W2 W3 S2 = LCA(W2,W3) P1 = Left-Child(S2,W2,W3)
13
Capturing Series-Parallel Relation
Determine the series-parallel relation between any pair of W nodes with an LCA query S1 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node, they execute in parallel. Otherwise, they execute in series P1 P2 W2 W3 S1 = LCA(W2,W4) S2 = Left-Child(S1,W2,W4)
14
Illustrative Example Merge sort program parallelized with OpenMP
void main(){ int* arr = init(&n); #pragma omp parallel #pragma omp single mergeSort(arr, 0, n); } void mergeSort(int* arr, int s, int e){ if (n <= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); }
15
OSPG Construction S0 W0 S1 P0 P1 W1 void main(){ int* arr = init(&n);
#pragma omp parallel #pragma omp single mergeSort(arr, 0, n); } W0 S1 P0 P1 W1
16
OSPG Construction S0 W0 S1 void mergeSort(int* arr, int s, int e){
if (n <= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } P0 P1 W2 S2 W5 W1 P2 P3 W3 W4
17
Parallelism Computation Using OSPG
18
Compute Parallelism Measure work in each Work node
210 W0 S1 W 6 W 204 Measure work in each Work node P0 W 204 Compute work for each internal node W1 W4 S2 W 2 W 2 P2 W 200 P3 W 100 W 100 W2 W3 W 100 W 100
19
Compute Serial Work Measure work in each Work node
Compute work for each internal node W1 W4 S2 Identify serial work on critical path P2 P3 W2 W3
20
Compute Serial Work Measure work in each Work node
210 Compute Serial Work S0 SW 110 W0 S1 W 6 W 204 Measure work in each Work node SW 104 W 204 P0 Compute work for each internal node SW 104 W1 W4 S2 Compute serial work for each Internal node W 2 W 2 P2 W 200 P3 W 100 W 100 SW 100 SW 100 SW 100 W2 W3 W 100 W 100
21
Parallelism Profile Aggregate parallelism at OpenMP constructs W 210
110 W0 S1 main L1 P0 W 204 omp parallel L3 SW 104 W1 W4 S2 omp task L11 omp task L13 P2 P3 W 100 W 100 SW 100 SW 100 Aggregate parallelism at OpenMP constructs W2 W3
22
Parallelism Profile W0 S0 S1 P0 S2 P2 P3 W1 W4 W2 W3 Line Number Work
Serial Work Parallelism Serial Work % program:1 210 110 1.91 5.4 omp parallel:3 204 104 1.96 3.5 omp task:11 100 1.00 91.1 omp task:13
23
We could estimate the increase in parallelism via hypothetically optimizing a region of the program
24
Example: What-if Analyses
void mergeSort(int* arr, int s, int e){ if (n<= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } Selected Region Developer chooses regions for what-if analyses Estimate improvements in parallelism by reducing the serial work on corresponding work nodes
25
Compute What-if Profile
S0 S1 P0 S2 P2 P3 W1 W4 W2 W3 Line Number Work Serial Work Parallelism Serial Work % program:1 210 16 13.1 37.5 omp parallel:3 204 10 20.4 25 omp task:11 100 6 16.0 omp task:13
26
Prototype OMP-WHIP, our profiler for OpenMP programs OMP-Whip
Uses OMPT for instrumentation OMPT Event Callback Annotated regions for what-if Analysis Serialization Bottlenecks Program Trace Parallelism Profile What-if Profile Run Program OMP-Whip
27
Evaluation Tested 43 OpenMP applications Was it effective?
Written in OpenMP common core Was it effective? Identified bottlenecks in all applications Identified regions that matter for parallelism using what-if analyses
28
Was it Effective? Application Initial speedup Opt speedup
Change summary AMGmk 5.4 9.1 Parallelize loop regions QuickSilver 11.8 12.8 Change loop scheduling Del Triang 1.1 9.2 Parallelize compute loop Min Span 1.9 7.6 Parallelize sort using Tasks NBody 14.8 Recursive decomposition using Tasks CHull 2.1 11.1 Parallelize loop region and add tasking Strassen 14 15.6 Increase task count
29
Use case: AMGmk Initial speedup: 5.4x Relax Axpy Matvec
Initial parallelism profile What-if profile Line Number Parallelism Serial work % Program 6.93 47.95 relax.c:91 13.63 29.17 csr.c:172 10.57 20.36 relax.c:87 11.53 1.58 Line Number Parallelism Serial work % Program 11.62 20.47 relax.c:91 13.11 52.32 csr.c:172 10.56 21.43 relax.c:87 11.4 2.75
30
Use case: AMGmk Initial speedup: 5.4x Relax Axpy Matvec
Optimized speedup: 9.1x What-if profile Optimized parallelism profile Line Number Parallelism Serial work % Program 11.62 20.47 relax.c:91 13.11 52.32 csr.c:172 10.56 21.43 relax.c:87 11.4 2.75 Line Number Parallelism Serial work % Program 11.43 17.44 relax.c:91 13.11 51.87 csr.c:179 15.79 21.31 vect.c:383 9.14 3.52
31
Was it Practical to Use? 62% average profiling overhead compared to parallel execution 28% average memory overhead Only a small fraction of OSPG will be in memory One-the-fly profiling mode to analyze long running programs Eliminates the need for logs and offline analysis
32
Related Work
33
Conclusion and Future Work
A novel performance model to identify serialization bottlenecks What-if analyses to estimate performance improvements Our first step to characterize performance of OpenMP programs In future work: Identify the right amount of parallelism Offloading support
34
Thank You OMP-WHIP is available online
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.