Integrated MPI/OpenMP Performance Analysis

Integrated MPI/OpenMP Performance Analysis
KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, Hans-Christian Hoppe,

Outline Why integrated MPI/OpenMP programming?
A performance tool for MPI/OpenMP programming (Phase 1) Integrated performance analysis capability for ASCI Apps (Phase 2)

Why Integrate MPI and OpenMP?
Hardware trends Simple example – How it is done now? An FEA Example ASCI Examples

Parallel Hardware Keeps Coming
Example recently LLNL ASCI clusters Parallel Capacity Resource (PCR) cluster Three clusters totaling 472 Pentium 4s; the largest with 252 Theoretical peak 857 gigaFLOP/s, Linux NetworX via SGI Federal HPCWire 8/31/01 Parallel global file system cluster Total 48 Pentium 4 processors 1,024 clients/servers Deliver I/O rates of over 32 GB/s Fail-over and global lock manager Linux open source NetworX via SGI Federal HPCWire 8/31/01 Introduction Why is integrated MPI/OpenMP programming important? One reason is that cluster architectures are very popular. The major reason is that they present a good compromise of cost and performance. The cluster size is determined by the number of processors that can be added to a backplane designed for an enterprise server, where the bulk of the market is. In addition, the cost of high performance networks has come down pretty quickly to allow a cluster of SMP systems to be configured for a small increment over the cost of the SMP systems themselves. With this hardware impetus, a two level programming model is natural. A shared memory programming model, OpenMP, is mapped to each node in the cluster. A distributed memory programming model, MPI, is mapped across the high performance network. One also consider the significance of NUMA architectures on the market. It is definitely possible to build a shared memory system incorporating many processors. On the other hand, unless you consider applications which use little communication between processors, the effect of Non-Uniform Memory Access (NUMA) are very much present. A straightforward way to address this in an application is to use a distributed memory programming model which forces one to think about the impact of NUMA. Indeed, the most popular programming model on the largest NUMA systems, such as the ASCI Blue Mountain system, is MPI. With respect to two-level programming models, NUMA systems can allow the user to configure a flexible cluster to the optimum size for the OpenMP loop. For example, this size can balance between task granularity and load imbalance. Or, it could be appropriate to optimize the MPI level, constraining the OpenMP cluster to adapt to the MPI granularity / imbalance characteristics.

Parallelism Performance Analysis
Effort Code Performance MPI/OpenMP Performance tools MPI OpenMP Performance tools OpenMP Debuggers, IDEs Performance analysis tools have typically been applied in the later stages of application development. The figure above illustrates first that OpenMP level of parallelism typically is fast to implement but performance becomes limited because OpenMP does not force domain decomposition. MPI development goes more slowly at first because the developer must thoroughly analyze an application and install all of the needed message passing before the application works at all. Then performance analysis tools are used for tuning. OpenMP performance analysis tools can provide feedback earlier in the development but it is still a tuning tools in that the application must be working correctly before it makes sense to tune it. Another mode is to help analyze problem after an application has been turned over to users.

Cost Effective Parallelism Long Term
Wealth of parallelism experience single person codes to large team

ASCI Ultrascale Tools Project
Pathforward project RTS – Parallel System Performance Ten Goals in three areas – Scalability – Work with 10,000+ Processors Integration – How about Hardware Monitors, Object Oriented, and Runtime Environment? Ease of Use – Dynamic Instrumentation and Be Prescriptive, not just Data Management About the Project There are several goals of the project: Scalability – ASCI systems are the largest of systems. Efficient utilization is critical to delivering their potential. Applications that scale to the level to run on an ASCI system have never been written before. The aggressive goals of the project is to quadruple the number of processors that can be analyzed every year. Integration – To perform effective performance analysis there must be an integration of information from several sources. This not only avoids needless work of coordinating output from several tools, but the integration provides a platform for synthesizing an overall performance recommendation. ASCI systems are like no other systems so there is quite a bit to learn about using them efficiently. Ease of use – Needless to say, the number of processors used by ASCI applications presents many user interface challenges. In addition, a key part of ease of use is adding intelligence to the instrumentation which collects data to be analyzed about an application. The project is for three years. The first year has just been completed.

Architecture for Ultrascale Performance
Application Source Guide – Source Instrumentation Vampirtrace – MPI/OpenMP Instrumentation Vampir – MPI Analysis GuideView – OpenMP Analysis  Guide Guidetrace Library Object Files  Vampirtrace Library Executable The figure above shows how the MPI and OpenMP performance tools were integrated. The flow follows 4 steps: first instrumentation at compile time, Step 1; to generating an integrated MPI/OpenMP tracefile at runtime, Step2; to post run performance analysis for MPI with Vampir, Step 3; to OpenMP analysis with GuideView, Step 4. This design interleaves Vampirtrace and Vampir between the OpenMP components, Guide, the Guide Runtime Library, and GuideView. Like most MPI performance analysis tools Vampirtrace uses the MPI library wrapper specification for instrumentation. Then as each MPI call is performed an event is written to a trace file. Vampir is the post run trace file analysis tool. Guide is a portable OpenMP compiler which restructures source code and inserts calls to the Guide Runtime Library. The Guide Runtime Library layers on top of threads to implement OpenMP functions. It is instrumented to clock timers at all the significant OpenMP events. At the end of a run, these timers and counters are written into a statistics file. TraceFile  Vampir  GuideView

Phase One Goal – Integrated MPI/OpenMP
Phase One Goals – Integrated MPI OpenMP Tracing Mode most compatible with ASCI Systems Whole Program Profiling Integrate program profile with parallelism Increased Scalability of Performance Analysis 1000 processors In the first phase, three tasks were implemented: Integrated MPI/OpenMP – A key for the project was to implement performance analysis on hybrid MPI/OpenMP applications. This integration provides an automated channel allowing analysis of the OpenMP regions in any selected MPI context. Whole Program Profiling – This provides integration of a program profile with MPI and OpenMP parallelism. This is critical because the subroutine structure communicates much more to the application developer than the OpenMP regions and especially MPI messages. Scalability – The third goal was to allow analysis of applications using up to 1000 processors.

Vampir – Integrated MPI/OpenMP
SWEEP3D run on 4 MPI tasks with 4 OpenMP Threads each User Interface Once an integrated MPI OpenMP tracefile has been created during the application run, it can be viewed by an integrated user interface. Vampir shows the tracefile events ordered by time in the timeline display. When an MPI process executes an OpenMP region, a wiggle or glyph appears at the top that process’ timeline. The user can select to view that OpenMP region or he can select a set of MPI processes or a timeline section for OpenMP analysis. OpenMP analysis aggregates the OpenMP data structures from all the tracefile events in the selection. Then the aggregated data is written to a file where a GuideView server process reads the file.g Threaded activity during OpenMP region Timeline shows OpenMP regions with glyph

GuideView – Integrated MPI/OpenMP & Profile
SWEEP3D run on 4 MPI tasks each with 4 OpenMP threads All OpenMP regions for process summarized to one bar Highlight (Red arrow) shows speedup curve for that set of threads GuideView displays the OpenMP regions for each MPI process as a separate set of OpenMP information. In this way, the user can use GuideView tools to sort and filter processes with OpenMP problems from among the hundreds of MPI processes that may be running. The sorting allows the user to sort on any type of OpenMP time measure: scheduling imbalance, lock time, time spent in locked region, and overhead are examples. By filtering we mean selecting a subset of the MPI processes. The top or the bottom “n” where the user specifies n. This mechanism allows a user to compose compound performance queries by sorting on one criteria, filtering the top responders, and then sorting by another criteria. (There is a scaling issue illustrated in the above displays. In this application, the total application time has included significant MPI wait time. This appears as serial time in GuideView and tends to dwarf the OpenMP time.) Thread view shows balance between MPI tasks and threads

GuideView – Integrated MPI/OpenMP & Profile
Profile allows comparison of MPI, OpenMP and Application activity inclusive and exclusive Sorting and filtering bring large amounts of information to manageable level The user can also view the subroutine profile for one or a selection of MPI processes within GuideView. This can be viewed as inclusive to allow the user to understand the call tree structure, or exclusive to understand which subroutines consume the most time.

Guide – Compiler Workhorse
Compilation of OpenMP Automatic subroutine entry and exit instrumentation – Fortran C/C++ New compiler options –WGtrace -- link with the Vampirtrace WGprof -- subroutine entry/exit profiling – WGprof_leafprune minimum size of procedures to retain in profile Profiling is integrated with MPI and OpenMP. To do this, it was decided that the common denominator, event tracing, should be used. Thus, in Step 1, Guide was modified to call a Vampirtrace API to log subroutine entries and exits. Each entry or exit causes a trace state change which is time stamped. This mechanism has good and bad parts. The bad part, of course, is that many (in C++ many, many) events are created exacerbating scalability problems. The good part is that the applications structure can be precisely correlated with the parallel processing events via the time stamps. Contrast this with a typical profiler which samples the program counter or stack at regular intervals. At most these profilers can provide the call stack profile. Because message passing ala MPI is not structured, time stamped events are the only way to integrate subroutine and message passing. With OpenMP time sampling may be feasible but observe that if a sampling event where to occur during a critical section, the perturbation would be considerably larger than the timer scheme, Guide uses. In standalone mode, Guide now provides a timer based subroutine profile to integrate OpenMP with program structure. To ameliorate the bad effects of event based profiling several heuristics have been implemented: Guide allows profiling on a source file by source file basis. Guide inserts entry/exit events only when the size of a subroutine exceeds a user specified size, measured in statements.

Vampirtrace – Profiling
Support for pruning of short routines All events that have not been pruned could now be written to the tracefile. This tree will be pruned. ROUTINE X will be marked as having calltree info summarized. ROUTINE X ENTRY ROUTINE Y ENTRY > Δt < Δt Vampirtrace measures the time difference between entry and exit. If the execution time of the subroutine is less than a user specified limit, neither entry nor exit events are written. The figure above illustrates the dynamic pruning heuristic. The Guide instrumentation counts the number of subroutine entries. When it exceeds a user specified limit, entry/exit logging is disabled. (In thread-safe mode, the count is not precise because each thread increments the counter without locking it and some events can be lost.) ROUTINE Y EXIT ROUTINE Z may still be < Δt so cannot yet be written. ROUTINE Z ENTRY ˚ ˚ ˚

Scalability on Phase One
Timeline scaling to 256 Tasks/Nodes Gathering of tasks in node into group Filtering by nodes Expand each node Message statistics by nodes

Phase Two – Integrating Capabilities for ASCI Apps
Phase Two Goals – Deployment to other platforms – Compaq, CPlant, SGI Thread-Safety Scalability – Grouping Statistical Analysis Integrated GuideView Hardware performance monitors Dynamic control of instrumentation Environmental awareness In the first phase, three tasks were implemented: Integrated MPI/OpenMP – A key for the project was to implement performance analysis on hybrid MPI/OpenMP applications. This integration provides an automated channel allowing analysis of the OpenMP regions in any selected MPI context. Whole Program Profiling – This provides integration of a program profile with MPI and OpenMP parallelism. This is critical because the subroutine structure communicates much more to the application developer than the OpenMP regions and especially MPI messages. Scalability – The third goal was to allow analysis of applications using up to 1000 processors.

Thread Safety Collect data from each thread –
Thread-safe Vampirtrace library Per thread profiling data Previous release, only master thread logged data Improves accuracy of data Value to users – Enhances integration between MPI and OpenMP Enhances visibility into functional balance between threads

Scalability: Grouping
Whole system Up to end of FY00 Fixed hierarchy levels (system, nodes, CPUs) Fixed grouping of processes Eg, Impossible to reflect communicators Need more levels Threads are a fourth group Systems with deeper hierarchies (30T) Reduce number of on-screen entities for scalability Node 1 Node n Quadboard T_1 T_p CPU 1 CPU c t_1 t_c

Default Grouping Default Grouping Can be changed in configuration file
By Nodes By Processes By Master Threads All Threads Can be changed in configuration file

Scalability: Grouping
Filter processes dialog Select groups combo-box Display of groups By aggregation By representative Grouping applies to “Timeline bars” Counter streams

Scalability by Grouping
Parallelism display showing all threads Parallelism display showing only master threads alternating between MPI and OpenMP parallelism

Statistical Information Gathering
Collects basic statistics at runtime Saves statistics in an ASCII-file View statistics your favorite spreadsheet ... Reduced overhead compared to tracing Parallel Executable Statsfile (small) Tracefile (big) Perl filter Excel, ...

Can work independent of tracing Significantly lower overhead (memory, runtime) Restriction: for the whole application run ...

GuideView Integrated Inside Vampir
Vampir menus Creating an extension API in Vampir insert menu items include new displays have access to trace data & statistics Vampir GUI engine invoke New GuideView control The figure above shows how the MPI and OpenMP performance tools were integrated. The flow follows 4 steps: first instrumentation at compile time, Step 1; to generating an integrated MPI/OpenMP tracefile at runtime, Step2; to post run performance analysis for MPI with Vampir, Step 3; to OpenMP analysis with GuideView, Step 4. This design interleaves Vampirtrace and Vampir between the OpenMP components, Guide, the Guide Runtime Library, and GuideView. Like most MPI performance analysis tools Vampirtrace uses the MPI library wrapper specification for instrumentation. Then as each MPI call is performed an event is written to a trace file. Vampir is the post run trace file analysis tool. Guide is a portable OpenMP compiler which restructures source code and inserts calls to the Guide Runtime Library. The Guide Runtime Library layers on top of threads to implement OpenMP functions. It is instrumented to clock timers at all the significant OpenMP events. At the end of a run, these timers and counters are written into a statistics file. display access Trace data (in memory) Motif graphics library

New GuideView Whole Program View
Goals – Improve MPI/OpenMP integration Improve scalability Integrate look and feel Works like old GuideView! Load time – Fast!

New GuideView Region View
Looks like old Region view turned on the side! Scalability test 16 MPI tasks 16 OpenMP threads 300 Parallel regions

Hardware Performance Monitors
 User can call HPM API in the source code User can define events in Config file for Guide instrumentation HPM counter events are also logged from Guidetrace and Vampirtrace library Underlying HPM library is PAPI Application Source  Config File Guide Guidetrace  Object Files Vampirtrace  Executable PAPI The figure above shows how the MPI and OpenMP performance tools were integrated. The flow follows 4 steps: first instrumentation at compile time, Step 1; to generating an integrated MPI/OpenMP tracefile at runtime, Step2; to post run performance analysis for MPI with Vampir, Step 3; to OpenMP analysis with GuideView, Step 4. This design interleaves Vampirtrace and Vampir between the OpenMP components, Guide, the Guide Runtime Library, and GuideView. Like most MPI performance analysis tools Vampirtrace uses the MPI library wrapper specification for instrumentation. Then as each MPI call is performed an event is written to a trace file. Vampir is the post run trace file analysis tool. Guide is a portable OpenMP compiler which restructures source code and inserts calls to the Guide Runtime Library. The Guide Runtime Library layers on top of threads to implement OpenMP functions. It is instrumented to clock timers at all the significant OpenMP events. At the end of a run, these timers and counters are written into a statistics file. TraceFile Vampir GuideView

PAPI – Hardware Performance Monitors
int main(int argc, char **argv) { int set_id; int inner,outer,other; set_id = VT_create_event_set(“MySet”); VT_add_event(set_id,PAPI_L1_DCM); VT_add_event(EventSet,PAPI_L2_DCM); VT_symdef(outer, “OUTER”, “USERSTATES”); VT_symdef(inner, “INNER”, “USERSTATES”); VT_symdef(other, “OTHER”, “USERSTATES”); VT_change_hpm(set_id); VT_begin(outer); foo(); VT_begin(inner); bar(); VT_end(inner); VT_end(outer); } Create a new event set to measure L1 & L2 data cache misses. Standardizes names across platforms Users define counter sets User could instrument by-hand -- But better, Counters are instrumented at OpenMP and subrs Can’t support unsup-ported counters Activate the event set Collect the events over two user-defined intervals

Hardware Performance Example
MPI tasks on timeline Or, per MPI task activity correlated in same window Floating pt instructions correlated but in different window

Hardware Performance Can Be Rich
4 x 4 SWEEP3D run showing L1 Data Cache Miss Cycles Stalled Waiting for Memory Accesses

Hardware Performance in GuideView
You can see the HPM data on all GuideView windows L1 data cache misses and stalls in Cycle due to memory stalls in per MPI task profile view

Derived Hardware Counters
In this menu you can arithmetically combine measured counters into derived counters Derived Hardware Counters Vampir and GuideView displays present derived counters

Environmental Counters
Select rusage information like HPMs Environmental Counters Parameter Meaning utime user time used stime system time used maxrss max resident set size ixrss shared memory size idrss unshared data size minflt page reclaims majflt page faults nswap swaps inblock block input operations oublock block output operations Data appears in Vampir and GuideView like HPM data Time-varying OS counters – Config variable sets sampling frequency Difficult to attribute to source code precisely

Environmental Awareness
Type 1: Collects IBM MPI information Treated as static (one time) event in tracefile Over 50 parameters Parameter Meaning MP_EUIDEVICE adapter set to be used for message passing MP_EUILIB communication subsystem library implementation MP_INFOLEVEL level of message reporting MP_BUFFER_MEM size of unexpected message buffers MP_CSS_INTERRUPT generate interrupts for arriving packets MP_EAGER_LIMIT threshold for switching to rendezvous protocol MP_USE_FLOW_CONTROL enforce flow control for outgoing messages

Dynamic Control of Instrumentation
In source, User puts VT_confsync() calls At runtime, TotalView is attached and breakpoint is inserted From process #0, user adjusts several instrumentation settings VTconfigchanged flag is set, breakpoint is exited,  Application Source Guide TotalView  Object Files Vampirtrace Library  Executable  The figure above shows how the MPI and OpenMP performance tools were integrated. The flow follows 4 steps: first instrumentation at compile time, Step 1; to generating an integrated MPI/OpenMP tracefile at runtime, Step2; to post run performance analysis for MPI with Vampir, Step 3; to OpenMP analysis with GuideView, Step 4. This design interleaves Vampirtrace and Vampir between the OpenMP components, Guide, the Guide Runtime Library, and GuideView. Like most MPI performance analysis tools Vampirtrace uses the MPI library wrapper specification for instrumentation. Then as each MPI call is performed an event is written to a trace file. Vampir is the post run trace file analysis tool. Guide is a portable OpenMP compiler which restructures source code and inserts calls to the Guide Runtime Library. The Guide Runtime Library layers on top of threads to implement OpenMP functions. It is instrumented to clock timers at all the significant OpenMP events. At the end of a run, these timers and counters are written into a statistics file. TraceFile Vampir Tracefile reflects change after next VT_confsync() GuideView

Dynamic Control of Instrumentation
Keyword Description Default Value LOGFILE-NAME Tracefile name <argv[0]>.bvt LOGFILE-PREFIX Tracefile path prefix Null string ACTIVITY Trace activities (User defined) * ON SYMBOL Trace symbols (Often subroutines) COUNTER Trace counters OPENMP Trace OpenMP regions PCTRACE Record return address OFF SUM-MPITESTS Collapse MPI probe and test routines ON CLUSTER Trace cluster nodes All enabled PROCESS Trace processes ENVIRONMENT Record environment information MEM-MAXBLOCKS Maximum number of memory blocks Unlimited MEM-OVERWRITE Overwrite in–core buffers PRUNE-LIMIT Execution time threshold No pruning

Structured Trace Files Frames Manage Scalability
A Section of the Timeline A Set of Processors Messages or Collectives OpenMP Regions Instances of a subroutine

Structured Trace Files Consist of Frames
Frames are defined In the source code – int VT framedef( char name, unsigned int type mask, int * frame handle ) int VT framestart( int frame handle ) int VT framestop( int frame handle ) Type_mask defines the types of data collected – VT FUNCTION VT REGION VT PAR REGION VT OPENMP VT COUNTER VT MESSAGE VT COLL OP VT COMMUNICATION VT ALL Analyze time frames will be available

Structured Trace Files Rapid Access By Frames
Index File Frame Frame Frame Frame 2) Vampir Thumbnail Displays Represent Frames 1) Structured Tracefile 3) Selecting Thumbnails Displays Frames in Vampir

Object Oriented Performance Analysis
Use TAU model Object Oriented Performance Analysis How to avoid SOOX – Instrument with API (Scalability Object Oriented eXplosion) C++ templates, classes make it much easier Can be used with or without source VT Activity/ InformerMappings ImX ImY ImZ Informers I_A I_B I_C I_D ImQ Events MPI_Send MPI_Recv MPI_Finalize Func A Func Init Func X Func Y Func Z

Example of OO Informers
Create three Matrix instances: A (mapped to “Matrix” bin), B (mapped to “LargeMatrix” bin), and C (mapped to “LargeMatrix” bin) class Matrix { public: InformerMapping im; Matrix(int rows, int columns) { if (rows * columns > 500) im.Rename(“LargeMatrix”); else im.Rename(“Matrix”); } void invert () { Informer(im, “invert”, 12, 15, “Example.C”); #pragma omp parallel { .... } MPI_send(...); } void compl () { Informer(im, “typeid(…)” ); .... }; void main(int argc, char **argv) { Matrix A(10,10),B(512,512),C(1000,1000); // line 1 B.im.Rename(“MediumMatrix”); // line 2 A.invert(); // line 3 B.compl(); // line 4 C.invert(); // line 5 } Remap B to “MediumMatrix” bin A.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in Matrix bin B.compl is traced. Entry and exit events are collected and associated with (“Matrix:void compl(void)”) in MediumMatrix bin C.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in LargeMatrix bin

Vampir OO Timeline Shows Informer Bins
InformerMappings: display each bin as a Vampir activity. MPI is put into a separate activity with same prefix Rename as ‘Mangled name’ InformerMapping:Informer:NormalEventName

Vampir OO Profile Shows Informer Bins
Time in Classes: Queens MPI Time in Class: Queens

OO GuideView Shows Regions in Bins
Time and counter data per thread by Bin

Parallel Performance Engineering
ASCI Ultrascale Performance Tools Scalability Integration Ease of Use Read about what was presented ftp://ftp.kai.com/private/Lab_notes_2001.doc.gz Contact: Thank you for your attention! Two possible audiences – ASCI – very important to define phase 3 GVG customers Application analysis Once we get users we will get user success stories Programming principles analysis More important to motivate users and buyers Quantify MPI/OpenMP tradeoffs Analyze load imbalance MPI to OpenMP Look forward to motivate coming features MPI/OpenMP environment, “virtual machine model”

Integrated MPI/OpenMP Performance Analysis

Similar presentations

Presentation on theme: "Integrated MPI/OpenMP Performance Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrated MPI/OpenMP Performance Analysis

Similar presentations

Presentation on theme: "Integrated MPI/OpenMP Performance Analysis"— Presentation transcript:

Similar presentations

About project

Feedback