1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.

Slides:

Advertisements

Similar presentations

Performance Analysis Tools for High-Performance Computing Daniel Becker

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

1 Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Markus Geimer 2), Bert Wesarg 1), Brian Wylie.

DKRZ Tutorial 2013, Hamburg Analysis report examination with CUBE Markus Geimer Jülich Supercomputing Centre.

Analysis report examination with CUBE Alexandru Calotoiu German Research School for Simulation Sciences (with content used with permission from tutorials.

1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.

Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.

Programming Project: Hybrid Programming Rebecca Hartman-Baker Oak Ridge National Laboratory

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Sieve of Eratosthenes by Fola Olagbemi. Outline What is the sieve of Eratosthenes? Algorithm used Parallelizing the algorithm Data decomposition options.

1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)

Threaded Programming Lecture 4: Work sharing directives.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Parallel Programming 0024 Week 10 Thomas Gross Spring Semester 2010 May 20, 2010.

Parallel Solution of the Poisson Problem Using MPI

MPI and OpenMP.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Analysis report examination with CUBE Markus Geimer Jülich Supercomputing.

CS 420 Design of Algorithms Parallel Algorithm Design.

Symbolic Analysis of Concurrency Errors in OpenMP Programs Presented by : Steve Diersen Contributors: Hongyi Ma, Liqiang Wang, Chunhua Liao, Daniel Quinlen,

Case Study 5: Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, forces acting on them... Have.

SC‘13: Hands-on Practical Hybrid Parallel Application Performance Engineering Hands-on example code: NPB-MZ-MPI / BT (on Live-ISO/DVD) VI-HPS Team.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Auburn University

Introduction to OpenMP

Parallel Programming By J. H. Wang May 2, 2017.

Introduction to OpenMP

OpenMP Quiz B. Wilkinson January 22, 2016.

Parallel Programming in C with MPI and OpenMP

Multi-core CPU Computing Straightforward with OpenMP

Hybrid Parallel Programming

Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :

Hybrid Parallel Programming

DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.

Introduction to OpenMP

Department of Computer Science, University of Tennessee, Knoxville

Parallel Programming in C with MPI and OpenMP

Hybrid Parallel Programming

Presentation transcript:

1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

2 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: –Finding load imbalances in OpenMP codes Case II: –Finding communication and computation imbalances in MPI codes Outline

3 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: Sparse Matrix Vector Multiplication foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

4 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Naïve OpenMP Algorithm: Distributes the rows of A evenly across the threads in the parallel region The distribution of the non-zero elements may influence the load balance in the parallel application Case I: Sparse Matrix Vector Multiplication #pragma omp parallel for foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

5 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Measuring the static OpenMP application Case I: Load imbalances in OpenMP codes % cd ~/Bottlenecks/smxv % make PREP=scorep scorep gcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp.inc.c"' -DFUNCTION=y_Ax_omp \ -o smxv-omp smxv.c -lm scorep gcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp-dynamic.inc.c"‘ \ -DFUNCTION=y_Ax_omp_dynamic -o smxv-omp-dynamic smxv.c -lm % OMP_NUM_THREADS=8 scan –t./smxv-omp yax_large.bin

6 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Two metrics which indicate load imbalances: –Time spent in OpenMP barriers –Computational imbalance Open prepared measurement on the LiveDVD with Cube Case I: Load imbalances in OpenMP codes: Profile % cube ~/Bottlenecks/smxv/scorep_smxv-omp_large/trace.cubex [CUBE GUI showing trace analysis report]

7 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: Time spent in OpenMP barriers These threads spent up to 20% of there running time in the barrier

8 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: Computational imbalance Master thread does 66% of the work

9 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Improved OpenMP Algorithm Distributes the rows of A dynamically across the threads in the parallel region Case I: Sparse Matrix Vector Multiplication #pragma omp parallel for schedule(dynamic,1000) foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

10 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Two metrics which indicate load imbalances: –Time spent in OpenMP barriers –Computational imbalance Open prepared measurement on the LiveDVD with Cube: Case I: Profile Analysis % cube ~/Bottlenecks/smxv/scorep_smxv-omp-dynamic_large/trace.cubex [CUBE GUI showing trace analysis report]

11 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: Time spent in OpenMP barriers All threads spent similar time in the barrier

12 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: Computational imbalance Threads do nearly equal work

13 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Open prepared measurement on the LiveDVD with Vampir: Case I: Trace Comparison % vampir ~/Bottlenecks/smxv/scorep_smxv-omp_large/traces.otf2 [Vampir GUI showing trace]

14 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: Time spent in OpenMP barriers Improved runtime Less time in OpenMP barrier

15 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: Computational imbalance Great imbalance for time spent in computational code

16 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case I: –Finding load imbalances in OpenMP codes Case II: –Finding communication and computation imbalances in MPI codes Outline

17 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Case II: Heat Conduction Simulation

18 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Application uses MPI for boundary exchange and OpenMP for computation Simulation grid is distributed across MPI ranks Case II: Heat Conduction Simulation

19 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Ranks need to exchange boundaries before next iteration step Case II: Heat Conduction Simulation

20 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering MPI Algorithm Building and measuring the heat conduction application: Open prepared measurement on the LiveDVD with Cube Case II: Profile Analysis foreach step in [1:nsteps] exchangeBoundaries computeHeatConduction % cd ~/Bottlenecks/heat % make PREP=‘scorep --user’ [... make output...] % scan –t mpirun –np 16./heat-MPI % cube ~/Bottlenecks/heat/scorep_heat-MPI_16/trace.cubex [CUBE GUI showing trace analysis report]

21 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Step 1: Compute heat in the area which is communicated to your neighbors Case II: Hide MPI communication with computation

22 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Step 2: Start communicating boundaries with your neighbors Case II: Hide MPI communication with computation

23 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Step 3: Compute heat in the interior area Case II: Hide MPI communication with computation

24 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Improved MPI Algorithm Measuring the improved heat conduction application: Open prepared measurement on the LiveDVD with Cube Case II: Profile Analysis foreach step in [1:nsteps] computeHeatConductionInBoundaries startBoundaryExchange computeHeatConductionInInterior waitForCompletionOfBoundaryExchange % scan –t mpirun –np 16./heat-MPI-overlap % cube ~/Bottlenecks/heat/scorep_heat-MPI-overlap_16/trace.cubex [CUBE GUI showing trace analysis report]

25 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Open prepared measurement on the LiveDVD with Vampir: Case II: Trace Comparison % vampir ~/Bottlenecks/heat/scorep_heat-MPI_16/traces.otf2 [Vampir GUI showing trace]

26 SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Thanks to Dirk Schmidl, RWTH Aachen, for providing the sparse matrix vector multiplication code Acknowledgments