Performance Analysis and optimization of parallel applications

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Debugging and Profiling With some help from Software Carpentry resources.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Software Quality Assurance and Testing Fazal Rehman Shamil.
Single Node Optimization Computational Astrophysics.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
OPERATING SYSTEMS CS 3502 Fall 2017
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to Visual Basic 2008 Programming
Operating Systems (CS 340 D)
Presented by Munezero Immaculee Joselyne PhD in Software Engineering
Pattern Parallel Programming
The University of Adelaide, School of Computer Science
Parallel Density-based Hybrid Clustering
Task Scheduling for Multicore CPUs and NUMA Systems
CRESCO Project: Salvatore Raia
WORKFLOW PETRI NETS USED IN MODELING OF PARALLEL ARCHITECTURES
Performance Analysis, Tools and Optimization
#01 Client/Server Computing
Lecture 5: GPU Compute Architecture
Many-core Software Development Platforms
CPSC 531: System Modeling and Simulation
A configurable binary instrumenter
Intel® Parallel Studio and Advisor
Department of Computer Science University of California, Santa Barbara
Lecture 5: GPU Compute Architecture for the last time
MPI-Message Passing Interface
Fault Tolerance Distributed Web-based Systems
Complete CompTIA A+ Guide to PCs, 6e
Lecture Topics: 11/1 General Operating System Concepts Processes
Allen D. Malony Computer & Information Science Department
COMP60621 Fundamentals of Parallel and Distributed Systems
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
MPJ: A Java-based Parallel Computing System
Outline Chapter 2 (cont) OS Design OS structure
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Database System Architectures
Foundations and Definitions
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel I/O for Distributed Applications (MPI-Conn-IO)
#01 Client/Server Computing
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Performance Analysis and optimization of parallel applications Performance Analysis and optimization of parallel applications. Parallel random numbers generators Emanouil Atanassov Institute of Information and Communication Technologies Bulgarian Academy of Science emanouil@parallel.bas.bg

OUTLINE Goals of performance analysis and optimization of parallel programs How to start Understanding the hardware Determining scalability Analyzing and optimizing a sequential version of the program Analysis of traces of a parallel program Approaches for optimizing an MPI program Parallel random generators Using SPRNG

Goals of performance analysis and optimization of parallel programs Determine scalability Understand the limitations of the program and the issues of the current implementation Understand the weight of various factors that affect performance: Latency Bandwidth CPU power Memory Libraries used Compilers Detect bottlenecks and hotspots Evaluate and carry out viable approaches to optimization

How to start We should start with a reasonable implementation of the MPI program. Techniques that are applicable to single CPU programs should already have been applied. After the correctness and robustness of program has been verified, experiment with more aggressive compiler flags. Look at www.spec.org for highly optimized settings. Have good understanding of target hardware – what kind of CPU, RAM, MPI interconnection. Collect info about latency, bandwidth, benchmark results for widely used applications. Perform such tests by yourself if necessary. Find a good test case, that is representative or relevant for the future real computations.

Understanding the hardware CPU – look at www.spec.org for benchmarking results. Consider single and multicore results for systems close to your target Memory – avoid swap at all costs. Determine appropriate setups. Hyperthreading on or off – running with 8 real cores sometimes better than 16 logical cores. Available interconnect – make sure to always use native Infiniband instead of other options, e.g. TCP/IP Collect benchmarking info about target hardware.

Understanding the hardware Example osu_bw and osu_latency results: Size Bandwidth memory (MB/s) Bandwidth IB (MB/s) 1 0.8 1.5 8 6 12 64 53 99 262144 5319 1729 Size Latency memory (us) Latency IB (us) 1 0.90 1.8 8 1.12 6.8 64 1.14 7.1 262144 58.13 659.

Determining scalability Find a test case that can be run on single CPU. If not possible – on 2, 4, etc. CPUs on one node. Compare results when using 1, 2, 4, 8, 16, 32, … until maximum possible. Usually good target is half of cluster. For our cluster – 256 cores. Try also comparing 2 cores on one node vs 2 nodes using 1 core, 4 cores on one node vs 4 nodes using 1 core. When doing such tests, always take full nodes, otherwise other users can interfere with your work. Evaluate benefit of hyperthreading, e.g., using 16 logical cores vs just 8 real cores. If there is some improvement, still ok.

Analyzing and optimizing a sequential version of the program Where the program spends most of its computing? Use gprof to obtain profiling information. If most of the time is spent in standard libraries consider optimising these or replacing your routines with highly optimised versions coming from, e.g., Intel, AMD. Determine which of your functions are most critical from performance point of view. Vary compiler flags. Look at www.spec.org for some interesting combinations. Note how compilation is achieved in two passes – profiling information from the first pass is fed to the second compilation pass. Think about memory access – maximise use of cache.

Analyzing traces of a parallel program Tracing means to have additional function calls around the MPI routines. This information is usually collected and stored in files, which can be processed later. GUI software for viewing traces exist. You can also define your own custom type of “events” and generate tracing information about these. Looking at traces of a parallel program should give you an idea about which MPI calls are the most critical and when the processes are waiting for completion of these. The main goal is to determine what part of the total CPU time is used for MPI calls and if there is possibility to overlap some communications with computations.

Analyzing traces of parallel programs Main tool for achieving such overlap is the use of non-blocking point-to-point communications. Other possibilities – use MPI_Sendrecv, replace point-to-point with collective operations. Use barrier synchronization to ensure correctness of the program. Goal should be to understand the reason for less than 100% efficiency when looking at the scalability.

Analyzing traces of parallel programs Manu tools exist for trace generation and analysis. We will study tree of them – mpiP, scalasca, MPE. Intel provides also excellent tools as part of their Cluster Studio XE. If deadlock occurs, you can always recompile with debugging on and connect to the process that is stuck with gdb, using its process number. Another approach to see what is happening is: strace - tracing function calls – this program is not MPI specific. /usr/sbin/lsof – shows information about open files. Some problems happening during launch of parallel programs can be debugged with tcpdump.

Analysing traces of parallel programs - mpiP mpiP is a lightweight profiling library for MPI applications. Collects only statistical information about MPI routines Captures and stores information local to each task Uses communication only at the end of the application to merge results from all tasks into one output file. mpiP provides statistical information about a program's MPI calls: Percent of a task's time attributed to MPI calls Where each MPI call is made within the program (callsites) Top 20 callsites Callsite statistics (for all callsites) mpipview provides GUI mpicc -g master_slave.c -o master_slave.out -L/opt/exp_software/mpi/mpiP/lib -lmpiP -lbfd -lm -liberty –lunwind

Analysing traces of parallel programs - scalasca “Scalasca (SCalable performance Analysis of LArge SCale parallel Applications) is an open-source project developed in the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications, yet presenting an advanced and user-friendly graphical interface. Scalasca can be used to help identify bottlenecks and optimization opportunities in application codes by providing a number of important features: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies; flexibility (to focus only on what really matters), user friendliness; and integration with PAPI hardware counters for performance analysis.” Highly scalable!

Approaches for optimizing an MPI program Once you understand if your program is bandwidth or latency dependent, appropriate approach can be taken. The amount of communication can be decreased by packing small data items. Combining OpenMP with MPI may become necessary for large number of cores. Load-balancing is important – distribute work evenly between CPUs and re-balance if necessary. Do not allow interference from other users – always take full nodes. Test different compilers and MPI implementations

Parallel pseudorandom number generators In an MPI program one needs the pseudorandom numbers generated by each process to be uncorrelated. This is a achieved by using parallel pseudorandom number generators, specifically designed for this purpose. Development of parallel generators is notoriously tricky, error-prone and naïve do-it-yourself approaches may produce wrong results. One should always prefer to use portable, well-tested and well understood random number generators. The general approach is that there is one global seed, but each process uses also its own global rank to produce independent stream of pseudorandom numbers.

SPRNG Scalable Parallel Random Number Generators Library Web site: http://sprng.fsu.edu/ Versions that can be used: 2.0, 4.0. Version 4.0 can be used from C++ (use mpiCC for openmpi) or FORTRAN (use mpif77 or mpif90). Version 2.0 can be used from C Tips for how to install SPRNG can be found also in http://wiki.hp-see.eu/index.php/SPRNG where you can see how to solve some problems that you may find following the official instructions.

SPRNG The principle of SPRNG is that at each CPU you create an object called stream, which you then use to generate a stream of random numbers by calling sprng(stream) In your program you must include sprng.h, define the seed and the interface type (simple or default), then initialize the state of the generator and then you can use the above call. A typical SPRNG 2.0 program (see, e.g.) http://sprng.cs.fsu.edu/Version2.0/examples2/sprng_mpi.c #include "sprng.h” #define USE_MPI stream=init_sprng(gtype,streamnum,nstreams,SEED, SPRNG_DEFAULT); …. sprng(stream)

SPRNG A typical SPRNG 4.0 program (see, e.g.) http://sprng.cs.fsu.edu/Version4.0/examples4/sprng.cpp contains: #include "sprng_cpp.h" #define USE_MPI //  necessary only for simple interface Sprng*ptr = SelectType(gtype); ptr->init_sprng(streamnum, nstreams, SEED, SPRNG_DEFAULT); …. ptr->sprng() //  usage FORTRAN similar to C.

Conclusions There are many tools to aid analysis and optimization of MPI programs. Nevertheless experience and knowledge of problem area is key SPRNG is easy to deploy and use and portable across many architectures. It can be used not only in MPI parallel programs, but also in multicore programming.