Performance Analysis and optimization of parallel applications Performance Analysis and optimization of parallel applications. Parallel random numbers generators Emanouil Atanassov Institute of Information and Communication Technologies Bulgarian Academy of Science emanouil@parallel.bas.bg
OUTLINE Goals of performance analysis and optimization of parallel programs How to start Understanding the hardware Determining scalability Analyzing and optimizing a sequential version of the program Analysis of traces of a parallel program Approaches for optimizing an MPI program Parallel random generators Using SPRNG
Goals of performance analysis and optimization of parallel programs Determine scalability Understand the limitations of the program and the issues of the current implementation Understand the weight of various factors that affect performance: Latency Bandwidth CPU power Memory Libraries used Compilers Detect bottlenecks and hotspots Evaluate and carry out viable approaches to optimization
How to start We should start with a reasonable implementation of the MPI program. Techniques that are applicable to single CPU programs should already have been applied. After the correctness and robustness of program has been verified, experiment with more aggressive compiler flags. Look at www.spec.org for highly optimized settings. Have good understanding of target hardware – what kind of CPU, RAM, MPI interconnection. Collect info about latency, bandwidth, benchmark results for widely used applications. Perform such tests by yourself if necessary. Find a good test case, that is representative or relevant for the future real computations.
Understanding the hardware CPU – look at www.spec.org for benchmarking results. Consider single and multicore results for systems close to your target Memory – avoid swap at all costs. Determine appropriate setups. Hyperthreading on or off – running with 8 real cores sometimes better than 16 logical cores. Available interconnect – make sure to always use native Infiniband instead of other options, e.g. TCP/IP Collect benchmarking info about target hardware.
Understanding the hardware Example osu_bw and osu_latency results: Size Bandwidth memory (MB/s) Bandwidth IB (MB/s) 1 0.8 1.5 8 6 12 64 53 99 262144 5319 1729 Size Latency memory (us) Latency IB (us) 1 0.90 1.8 8 1.12 6.8 64 1.14 7.1 262144 58.13 659.
Determining scalability Find a test case that can be run on single CPU. If not possible – on 2, 4, etc. CPUs on one node. Compare results when using 1, 2, 4, 8, 16, 32, … until maximum possible. Usually good target is half of cluster. For our cluster – 256 cores. Try also comparing 2 cores on one node vs 2 nodes using 1 core, 4 cores on one node vs 4 nodes using 1 core. When doing such tests, always take full nodes, otherwise other users can interfere with your work. Evaluate benefit of hyperthreading, e.g., using 16 logical cores vs just 8 real cores. If there is some improvement, still ok.
Analyzing and optimizing a sequential version of the program Where the program spends most of its computing? Use gprof to obtain profiling information. If most of the time is spent in standard libraries consider optimising these or replacing your routines with highly optimised versions coming from, e.g., Intel, AMD. Determine which of your functions are most critical from performance point of view. Vary compiler flags. Look at www.spec.org for some interesting combinations. Note how compilation is achieved in two passes – profiling information from the first pass is fed to the second compilation pass. Think about memory access – maximise use of cache.
Analyzing traces of a parallel program Tracing means to have additional function calls around the MPI routines. This information is usually collected and stored in files, which can be processed later. GUI software for viewing traces exist. You can also define your own custom type of “events” and generate tracing information about these. Looking at traces of a parallel program should give you an idea about which MPI calls are the most critical and when the processes are waiting for completion of these. The main goal is to determine what part of the total CPU time is used for MPI calls and if there is possibility to overlap some communications with computations.
Analyzing traces of parallel programs Main tool for achieving such overlap is the use of non-blocking point-to-point communications. Other possibilities – use MPI_Sendrecv, replace point-to-point with collective operations. Use barrier synchronization to ensure correctness of the program. Goal should be to understand the reason for less than 100% efficiency when looking at the scalability.
Analyzing traces of parallel programs Manu tools exist for trace generation and analysis. We will study tree of them – mpiP, scalasca, MPE. Intel provides also excellent tools as part of their Cluster Studio XE. If deadlock occurs, you can always recompile with debugging on and connect to the process that is stuck with gdb, using its process number. Another approach to see what is happening is: strace - tracing function calls – this program is not MPI specific. /usr/sbin/lsof – shows information about open files. Some problems happening during launch of parallel programs can be debugged with tcpdump.
Analysing traces of parallel programs - mpiP mpiP is a lightweight profiling library for MPI applications. Collects only statistical information about MPI routines Captures and stores information local to each task Uses communication only at the end of the application to merge results from all tasks into one output file. mpiP provides statistical information about a program's MPI calls: Percent of a task's time attributed to MPI calls Where each MPI call is made within the program (callsites) Top 20 callsites Callsite statistics (for all callsites) mpipview provides GUI mpicc -g master_slave.c -o master_slave.out -L/opt/exp_software/mpi/mpiP/lib -lmpiP -lbfd -lm -liberty –lunwind
Analysing traces of parallel programs - scalasca “Scalasca (SCalable performance Analysis of LArge SCale parallel Applications) is an open-source project developed in the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications, yet presenting an advanced and user-friendly graphical interface. Scalasca can be used to help identify bottlenecks and optimization opportunities in application codes by providing a number of important features: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies; flexibility (to focus only on what really matters), user friendliness; and integration with PAPI hardware counters for performance analysis.” Highly scalable!
Approaches for optimizing an MPI program Once you understand if your program is bandwidth or latency dependent, appropriate approach can be taken. The amount of communication can be decreased by packing small data items. Combining OpenMP with MPI may become necessary for large number of cores. Load-balancing is important – distribute work evenly between CPUs and re-balance if necessary. Do not allow interference from other users – always take full nodes. Test different compilers and MPI implementations
Parallel pseudorandom number generators In an MPI program one needs the pseudorandom numbers generated by each process to be uncorrelated. This is a achieved by using parallel pseudorandom number generators, specifically designed for this purpose. Development of parallel generators is notoriously tricky, error-prone and naïve do-it-yourself approaches may produce wrong results. One should always prefer to use portable, well-tested and well understood random number generators. The general approach is that there is one global seed, but each process uses also its own global rank to produce independent stream of pseudorandom numbers.
SPRNG Scalable Parallel Random Number Generators Library Web site: http://sprng.fsu.edu/ Versions that can be used: 2.0, 4.0. Version 4.0 can be used from C++ (use mpiCC for openmpi) or FORTRAN (use mpif77 or mpif90). Version 2.0 can be used from C Tips for how to install SPRNG can be found also in http://wiki.hp-see.eu/index.php/SPRNG where you can see how to solve some problems that you may find following the official instructions.
SPRNG The principle of SPRNG is that at each CPU you create an object called stream, which you then use to generate a stream of random numbers by calling sprng(stream) In your program you must include sprng.h, define the seed and the interface type (simple or default), then initialize the state of the generator and then you can use the above call. A typical SPRNG 2.0 program (see, e.g.) http://sprng.cs.fsu.edu/Version2.0/examples2/sprng_mpi.c #include "sprng.h” #define USE_MPI stream=init_sprng(gtype,streamnum,nstreams,SEED, SPRNG_DEFAULT); …. sprng(stream)
SPRNG A typical SPRNG 4.0 program (see, e.g.) http://sprng.cs.fsu.edu/Version4.0/examples4/sprng.cpp contains: #include "sprng_cpp.h" #define USE_MPI // necessary only for simple interface Sprng*ptr = SelectType(gtype); ptr->init_sprng(streamnum, nstreams, SEED, SPRNG_DEFAULT); …. ptr->sprng() // usage FORTRAN similar to C.
Conclusions There are many tools to aid analysis and optimization of MPI programs. Nevertheless experience and knowledge of problem area is key SPRNG is easy to deploy and use and portable across many architectures. It can be used not only in MPI parallel programs, but also in multicore programming.