A Parallelism Profiler with What-If Analyses for OpenMP Programs

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Parallel Processing with OpenMP

Introduction to Openmp & openACC

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

Profiling your application with Intel VTune at NERSC

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Chess Review May 8, 2003 Berkeley, CA Compiler Support for Multithreaded Software Jeremy ConditRob von Behren Feng ZhouEric Brewer George Necula.

Guoquing Xu, Atanas Rountev Ohio State University Oct 9 th, 2008 Presented by Eun Jung Park.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.

Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

Optimizing the trace transform Using OpenMP and CUDA Tim Besard

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Multi-core Programming Threading Methodology. 2 Topics A Generic Development Cycle.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.

1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.

Mark Marron IMDEA-Software (Madrid, Spain) 1.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Threaded Programming Lecture 2: Introduction to OpenMP.

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Single Node Optimization Computational Astrophysics.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Parallel Computing Presented by Justin Reschke

Algorithm Analysis with Big Oh ©Rick Mercer. Two Searching Algorithms  Objectives  Analyze the efficiency of algorithms  Analyze two classic algorithms.

Tuning Threaded Code with Intel® Parallel Amplifier.

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Comparison of Threading Programming Models

CHaRy Software Synthesis for Hard Real-Time Systems

Complexity Analysis (Part I)

Efficient Evaluation of XQuery over Streaming Data

Lecture 5: Shared-memory Computing with Open MP

SHARED MEMORY PROGRAMMING WITH OpenMP

CS427 Multicore Architecture and Parallel Computing

Computer Engg, IIT(BHU)

Optimization Code Optimization ©SoftMoore Consulting.

Parallel Algorithm Design

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

Martin Rinard Laboratory for Computer Science

Intel® Parallel Studio and Advisor

Adaptive Optimization in the Jalapeño JVM

Allen D. Malony Computer & Information Science Department

Multithreading Why & How.

Complexity Analysis (Part I)

Complexity Analysis (Part I)

CMPT 225 Lecture 10 – Merge Sort.

Presentation transcript:

A Parallelism Profiler with What-If Analyses for OpenMP Programs Nader Boushehri, Adarsh Yoga, Santosh Nagarakatte Rutgers University SC 18

Incremental parallelization Feature rich Work-Sharing Tasking SIMD Offload #pragma omp parallel for Incremental parallelization for(int i=0; i<n; ++i) compute(i);

OpenMP Program Performance Analysis Serial Execution: $ ./prog_ser Running time: 120s Parallel Execution (2-cores): $ ./prog_omp Running time: 65s 1.8x Speedup on 2-Core System Parallel Execution (16-cores): $ ./prog_omp Running time: 50s 2.4x Speedup on 16-Core System

OpenMP Program Performance Analysis Serial Execution: $ ./prog_ser Running time: 120s Why is a program not performance portable? Parallel Execution (2-cores): $ ./prog_omp Running time: 65s 1.8x Speedup on 2-Core System Parallel Execution (16-cores): $ ./prog_omp Running time: 50s 2.4x Speedup on 16-Core System

Why is a Program Not Performance Portable? Focus of our talk Lack of work Serialization Bottlenecks Secondary effects Runtime overhead Identify regions that are responsible for serialization bottlenecks

Contributions A novel performance model to identify serialization bottlenecks Capture logical series-parallel relationship + fine-grained measurements Novel OpenMP series-parallel graph What-if analyses to estimate performance improvements Before designing concrete optimizations Effective in identifying bottlenecks that have to be optimized first Surprising effective in identifying bottlenecks Open source: https://github.com/rutgers-apl/omp-whip

Performance Model for what-if analyses

Performance Model Overview Capture the logical series-parallel relation between different fragments of an OpenMP program OpenMP Series-Parallel graph (OSPG) captures these relations Schedule independent OSPG Fine-grained measurements

Code Fragments in OpenMP Programs OpenMP code snippet Execution structure … a(); #pragma omp parallel b(); c(); b a c b A code fragment is the longest sequence of instructions in the dynamic execution before encountering an OpenMP construct

W-Nodes in OSPG Execution Structure OSPG W-nodes b2 W1 W2 W3 W4 a1 c4 b3 An OSPG W-node represents a code fragment in dynamic execution

Capturing Series-Parallel Relation b3 b2 W1 S2 W4 P1 P2 P-nodes capture the parallel relation W2 W3 S-nodes capture the series relation

Capturing Series-Parallel Relation Determine the series-parallel relation between any pair of W nodes with an LCA query S1 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node, they execute in parallel. Otherwise, they execute in series P1 P2 W2 W3 S2 = LCA(W2,W3) P1 = Left-Child(S2,W2,W3)

Capturing Series-Parallel Relation Determine the series-parallel relation between any pair of W nodes with an LCA query S1 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node, they execute in parallel. Otherwise, they execute in series P1 P2 W2 W3 S1 = LCA(W2,W4) S2 = Left-Child(S1,W2,W4)

Illustrative Example Merge sort program parallelized with OpenMP void main(){ int* arr = init(&n); #pragma omp parallel #pragma omp single mergeSort(arr, 0, n); } void mergeSort(int* arr, int s, int e){ if (n <= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); }

OSPG Construction S0 W0 S1 P0 P1 W1 void main(){ int* arr = init(&n); #pragma omp parallel #pragma omp single mergeSort(arr, 0, n); } W0 S1 P0 P1 W1

OSPG Construction S0 W0 S1 void mergeSort(int* arr, int s, int e){ if (n <= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } P0 P1 W2 S2 W5 W1 P2 P3 W3 W4

Parallelism Computation Using OSPG

Compute Parallelism Measure work in each Work node 210 W0 S1 W 6 W 204 Measure work in each Work node P0 W 204 Compute work for each internal node W1 W4 S2 W 2 W 2 P2 W 200 P3 W 100 W 100 W2 W3 W 100 W 100

Compute Serial Work Measure work in each Work node Compute work for each internal node W1 W4 S2 Identify serial work on critical path P2 P3 W2 W3

Compute Serial Work Measure work in each Work node 210 Compute Serial Work S0 SW 110 W0 S1 W 6 W 204 Measure work in each Work node SW 104 W 204 P0 Compute work for each internal node SW 104 W1 W4 S2 Compute serial work for each Internal node W 2 W 2 P2 W 200 P3 W 100 W 100 SW 100 SW 100 SW 100 W2 W3 W 100 W 100

Parallelism Profile Aggregate parallelism at OpenMP constructs W 210 110 W0 S1 main L1 P0 W 204 omp parallel L3 SW 104 W1 W4 S2 omp task L11 omp task L13 P2 P3 W 100 W 100 SW 100 SW 100 Aggregate parallelism at OpenMP constructs W2 W3

Parallelism Profile W0 S0 S1 P0 S2 P2 P3 W1 W4 W2 W3 Line Number Work Serial Work Parallelism Serial Work % program:1 210 110 1.91 5.4 omp parallel:3 204 104 1.96 3.5 omp task:11 100 1.00 91.1 omp task:13

We could estimate the increase in parallelism via hypothetically optimizing a region of the program

Example: What-if Analyses void mergeSort(int* arr, int s, int e){ if (n<= CUT_OFF) serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } Selected Region Developer chooses regions for what-if analyses Estimate improvements in parallelism by reducing the serial work on corresponding work nodes

Compute What-if Profile S0 S1 P0 S2 P2 P3 W1 W4 W2 W3 Line Number Work Serial Work Parallelism Serial Work % program:1 210 16 13.1 37.5 omp parallel:3 204 10 20.4 25 omp task:11 100 6 16.0 omp task:13

Prototype OMP-WHIP, our profiler for OpenMP programs OMP-Whip Uses OMPT for instrumentation OMPT Event Callback Annotated regions for what-if Analysis Serialization Bottlenecks Program Trace Parallelism Profile What-if Profile Run Program OMP-Whip

Evaluation Tested 43 OpenMP applications Was it effective? Written in OpenMP common core Was it effective? Identified bottlenecks in all applications Identified regions that matter for parallelism using what-if analyses

Was it Effective? Application Initial speedup Opt speedup Change summary AMGmk 5.4 9.1 Parallelize loop regions QuickSilver 11.8 12.8 Change loop scheduling Del Triang 1.1 9.2 Parallelize compute loop Min Span 1.9 7.6 Parallelize sort using Tasks NBody 14.8 Recursive decomposition using Tasks CHull 2.1 11.1 Parallelize loop region and add tasking Strassen 14 15.6 Increase task count

Use case: AMGmk Initial speedup: 5.4x Relax Axpy Matvec Initial parallelism profile What-if profile Line Number Parallelism Serial work % Program 6.93 47.95 relax.c:91 13.63 29.17 csr.c:172 10.57 20.36 relax.c:87 11.53 1.58 Line Number Parallelism Serial work % Program 11.62 20.47 relax.c:91 13.11 52.32 csr.c:172 10.56 21.43 relax.c:87 11.4 2.75

Use case: AMGmk Initial speedup: 5.4x Relax Axpy Matvec Optimized speedup: 9.1x What-if profile Optimized parallelism profile Line Number Parallelism Serial work % Program 11.62 20.47 relax.c:91 13.11 52.32 csr.c:172 10.56 21.43 relax.c:87 11.4 2.75 Line Number Parallelism Serial work % Program 11.43 17.44 relax.c:91 13.11 51.87 csr.c:179 15.79 21.31 vect.c:383 9.14 3.52

Was it Practical to Use? 62% average profiling overhead compared to parallel execution 28% average memory overhead Only a small fraction of OSPG will be in memory One-the-fly profiling mode to analyze long running programs Eliminates the need for logs and offline analysis

Related Work https://www.openmp.org/resources/openmp-compilers-tools/

Conclusion and Future Work A novel performance model to identify serialization bottlenecks What-if analyses to estimate performance improvements Our first step to characterize performance of OpenMP programs In future work: Identify the right amount of parallelism Offloading support

Thank You OMP-WHIP is available online https://github.com/rutgers-apl/omp-whip