Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf

Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu fwolf@cs.utk.edu Performance Research Laboratory Department of Computer and Information Science University of Oregon Innovative Computing Laboratory, University of Tennessee

EuroPVM-MPI 2005 2 Performance Profiling Overhead Compensation for MPI Programs Outline  Problem description  Overhead modeling and compensation analysis  Motivating example  Master-worker testcase  Profiling and on-the-fly compensation models  Schemes to piggyback delay in message passing  MPI implementation  Demonstration of overhead compensation  Monte Carlo master-worker application profiling  Conclusions

EuroPVM-MPI 2005 3 Performance Profiling Overhead Compensation for MPI Programs Empirical Parallel Performance Analysis  Measurement-based analysis process  Performance observation (measurement)  Performance diagnosis (finding / explaining problems)  Profiling and tracing are two main measurement methods  Profiling computes summary statistics  Trades online analysis for less information  Extra computation with less runtime data size  Tracing captures time-sequenced records of events  More analysis opportunities, including profile analysis  Off-line analysis may be more complex  Produces larger volume of performance information  Tracing is considered to be of “higher cost”

EuroPVM-MPI 2005 4 Performance Profiling Overhead Compensation for MPI Programs Overhead, Intrusion, and Perturbation  All performance measurements generate overhead  Overhead is the cost of performance measurement  Execution time, # instructions, memory references, …  Overhead causes (generates) performance intrusion  Intrusion is the dynamic performance effect of overhead  Execution time slowdown, increased # instructions, …  Intrusion causes (potentially) performance perturbation  Perturbation is the change in performance behavior  Alteration of “probable” performance behavior  Measurement does not change “possible” executions  Perturbation can lead to erroneous performance results

EuroPVM-MPI 2005 5 Performance Profiling Overhead Compensation for MPI Programs Performance Analysis Conundrum  What is the “true” parallel computation performance?  Some technique must be used to observe performance  Performance measurement causes intrusion  Any performance intrusion might result in perturbation  Performance analysis is based on performance measures  How is “accuracy” of performance analysis evaluated?  How is this done when “true” performance is unknown?  Uncertainty applies to all experimental methods  “Truth” lies just beyond the reach of observation  Accuracy will be a relative assessment

EuroPVM-MPI 2005 6 Performance Profiling Overhead Compensation for MPI Programs Profiling Types  Flat profiles  Performance distributed onto the static program structure  Path profiles  Performance associated with program execution paths  Events  Entry/Exit (Begin/End) - change in metrics between events  Atomic - current value of metric at event occurence  Profile analysis  Inclusive statistics  performance of descendant events are included  Exclusive statistics  performance of descendant events are not included

EuroPVM-MPI 2005 7 Performance Profiling Overhead Compensation for MPI Programs Profiling Strategy and Compensation Problem  Advocate measured profiling as a method of choice  Profiling intrusion reported as % slowdown in execution  Implicit assumption that overhead is equally distributed  Profiling results may be distorted without compensation  Parallel profiling results may show skewed results  Is it possible to account for overhead effects?  Is it possible to compensate for overhead?  How are profiling analyses affected?  What improvements in profiling accuracy result?

EuroPVM-MPI 2005 8 Performance Profiling Overhead Compensation for MPI Programs Overhead Compensation Methods  Trace-based (Malony, Ph.D. thesis, PPoPP ‘91 / ’92)  Overhead compensation in event-based execution replay  post-mortem and off-line  Analysis and repair of performance perturbations  apply Lamport’s “happened before” relation  correct “errors” maintaining partial order dependencies  Both profile and trace performance analysis possible  Profile-based  Need online compensation models  On-the-fly measurement and profile analysis algorithm  Explicit process interaction required

EuroPVM-MPI 2005 9 Performance Profiling Overhead Compensation for MPI Programs Models for Overhead Compensation in Profiling  Overhead compensation in profiling is a harder problem  “Serial” compensation models (Euro-Par 2004)  Compensation of local process overhead only  Do not take into account parallel dependencies  Parallel compensation models (Euro-Par 2005)  Interdependency of effects of “overhead”  Must track and exchange “delay” information  Attempt to correct waiting time  Cannot correct execution order perturbations  but execution order perturbations can be identified  Model implementation in MPI (EuroPVM-MPI 2005)  On-the-fly algorithm implemented in TAU

EuroPVM-MPI 2005 10 Performance Profiling Overhead Compensation for MPI Programs  Measured execution  Approximated execution (rational reconstruction) Motivating Example: Parallel Master-Worker Workers must communicate overhead to the master in order for the master to know when messages would have been received M W1 W2 W3 t t M W1 W2 W3 M encounters very little overhead only at the begin and end of the execution waiting overhead t t ∆ waiting ∆ 11 22 33

EuroPVM-MPI 2005 11 Performance Profiling Overhead Compensation for MPI Programs Profiling and On-the-Fly Compensation Models  Study parallel profile measurement cases  Rational explanation of effects of measurement overhead  Local (independent) events and measurement intrusion  Interprocess (dependent) events and impact on interactions  Reconstruct execution behavior without measurement  Use knowledge of events and overhead  Remove overhead and recompute event timings  Maintain interprocess dependencies  Learn compensation algorithms from reconstructed cases  Compare measured vs. approximated executions  Study enough cases until general solution appears

EuroPVM-MPI 2005 12 Performance Profiling Overhead Compensation for MPI Programs  Measured execution  Approximated execution (rational reconstruction) One-Message Scenario (Case 1) P1 P2 S Rb Re w o1 o2 P1 P2 S Rb Re Sending process must tell other process how much earlier message would have been sent w’= 0 o2’ = o2 + w x2= min(o1, o2+w) = o2 + w o1 (= x1) x2 o1 >= o2 + w Overhead must absorb errononeous waiting!! x1 t t t t x1 and x2 represent “delay” of future events S - sendRb - receive begin w - waitRe - receive end

EuroPVM-MPI 2005 13 Performance Profiling Overhead Compensation for MPI Programs  Measured execution  Approximated execution One-Message Scenario (Case 2) P1 P2 S Rb Re w o1 o2 P1 P2 S Rb Re o1 < o2 + w w’= w + (o2-o1) o2’= o2 - (o1-o2) if o1>o2 x2= min(o1, o2+w) = o1 o1 (=x1) x2 w’ Waiting time may increase!! x1 t t t t x1 and x2 represent “delay” of future events S - sendRb - receive begin w - waitRe - receive end

EuroPVM-MPI 2005 14 Performance Profiling Overhead Compensation for MPI Programs General Algorithm  Based on generalization of two-process model  Update local overhead and delay based on measurement  Update local overhead and delay based on messages  receive messages only  use report delay values from sender  Process transmits local delay with every send message  Important to note that only overhead value used  Profile calculations subtract only the overhead  inclusive and exclusive performance calculations  Implement general algorithm in parallel profiling systems  Implemented in the TAU performance system

EuroPVM-MPI 2005 15 Performance Profiling Overhead Compensation for MPI Programs How Is the Local Delay Communicated?  Sender must send local delay in each message  The problem is how to do this  Need to avoid adding extra intrusion  Need to avoid further perturbing performance  Goal  Provide a widely portable prototype  Efficiently implemented and easily applied  Capability is not currently available in MPI library  Build prototype for MPI in TAU

EuroPVM-MPI 2005 16 Performance Profiling Overhead Compensation for MPI Programs MPI Implementation – Scheme 1  Ideally the delay would be included in each send message  Look for methods to “piggyback” delay on messages  Modify source code of underlying MPI implementation  Extend message header in communication substrate  Approach taken by Photon  Not portable to all MPI implementations  Relies on specially instrumented communication library

EuroPVM-MPI 2005 17 Performance Profiling Overhead Compensation for MPI Programs MPI Implementation – Scheme 2  Send an additional message containing delay information  Done using portable MPI wrapper interposition library  PMPI  Portable to all MPI implementations  Performance penalty with extra message transmission  Penalty not incurred in first scheme  Penalty is both overhead and perturbation  Would require further compensation  Delay information should be tied to original message  Hard to guarantee a tight coupling  Could lead to other problems if not

EuroPVM-MPI 2005 18 Performance Profiling Overhead Compensation for MPI Programs MPI Implementation – Scheme 3  Copy contents of original message to a new message  Create new message header to include delay information  Send new message and receive new message  Receiver must copy original message to destination  Portability advantage of second scheme  Could be implemented using PMPI  Also avoids transmission of additional message  Copying message contents is expensive  Must regard as overhead and compensate

EuroPVM-MPI 2005 19 Performance Profiling Overhead Compensation for MPI Programs MPI Implementation – Scheme 4  Problem with third scheme is creating a new message  Needed because put delay information in header  Suppose we put the delay information in the message  Problem is that cannot modify original message  Idea is to create a new “structured” datatype  Contains two members  pointer to original message buffer (n elements of datatype) n elements of datatype passed to original MPI call  double precision number containing the local delay  Committed as a new user-defined datatype  MPI instructed to send or receive one element of datatype

EuroPVM-MPI 2005 20 Performance Profiling Overhead Compensation for MPI Programs MPI Implementation – Scheme 4 (continued)  Only one message is sent  Avoids expensive copying of data buffers  MPI decides internally how to transmit the message  One option is to use vector reads and write calls  Instead of scalar counterparts  Solution is portable to all MPI implementations  Structured datatype are defined in MPI standard  Can be implemented using PMPI  MPI implementation for efficient transmission  Must wrap each MPI API  Want to handle both synchronous and asynchronous calls

EuroPVM-MPI 2005 21 Performance Profiling Overhead Compensation for MPI Programs Mapping MPI Calls – Send and Receive  Synchronous ( MPI_Send and MPI_Recv )  Auto variable holding delay value allocated on stack  Asynchronous ( MPI_Isend and MPI_Irecv )  Sender implementation  global variable in heap memory used to store delay value  location of variable is used in new datatype structure  Receiver implementation  received piggyback delay copied to heap location  need to create a map linking delay value to MPI call  cannot put calculations in MPI_Isend and MPI_Irecv  message is only visible in waiting or testing calls MPI_Wait, MPI_Test (or variants)

EuroPVM-MPI 2005 22 Performance Profiling Overhead Compensation for MPI Programs Mapping MPI Calls – Collective Operations  Asynchronous MPI calls can be perturbed by overhead  Different receive order than without measurement  Must maintain receive order to ensure determinancy  We know more about collective operations  Consider MPI_Gather  Extract all piggybacked delays values into an array  Compute minimum delay from MPI communicator  effectively identifies last process to arrive to gather  Root waiting time adjusted based on this minimum delay  Collective operations reduce to finding minimum delay  Overhead compensated accordingly

EuroPVM-MPI 2005 23 Performance Profiling Overhead Compensation for MPI Programs Mapping MPI Calls – More Collective Operations  MPI_Bcast reduces to synchronous send/receive  For each process in MPI communicator  MPI_Scatter behaves like MPI_Gather  Same as receiving a message from root in all tasks  MPI_Barrier implemented as a combination  MPI_Gather of local delays to root  root find minimum delay and adjust waiting time  root determines its local delay  MPI_Bcast to communicate root’s local delay  Done for all processes in communicator  Efficiencies of underlying MPI substrate preserved

EuroPVM-MPI 2005 24 Performance Profiling Overhead Compensation for MPI Programs  Measured execution  Compensated execution Compensated Parallel Master-Worker Scenario M W1 W2 W3 M W1 W2 W3 Must maintain receive event ordering in overhead compensation!!! waiting

EuroPVM-MPI 2005 25 Performance Profiling Overhead Compensation for MPI Programs Master-Worker Overhead Compensation in TAU  MPI program to compute π with Monte Carlo integration  Master generates work (pair of random coordinates)  Workers determine if coordinates above or below π curve  Iteratively estimate to within a given range  Four execution modes  No instrumentation  Full with no compensation  Full with local only compensation  Full with parallel compensation MasterWorker 73.92673.834 128.179128.173 139.5673.212 74.12673.909

EuroPVM-MPI 2005 26 Performance Profiling Overhead Compensation for MPI Programs Full instrumentation, no compensation Full instrumentation, local compensation MPI only instrumentations Full instrumentation, parallel compensation Monte Carlo Integration of π – Profiles  Compare TAU profiles for different instrumentations  Profile many application events to generate overhead W1 W2 W3 M W1 W2 W3 M W1 W2 W3 M W1 W2 W3 M 74% error! reference execution 89% error in master! 1.0 - 1.4% error with all events!

EuroPVM-MPI 2005 27 Performance Profiling Overhead Compensation for MPI Programs Conclusion  Developed models for parallel overhead compensation  Account for interprocess dependencies  Identified need to communicate “delay”  Constructed on-the-fly algorithms based on models  Support message passing parallelism  Integrated in TAU parallel profiling system  Validated parallel overhead compensation  Master-worker application  Extend techniques for semantics-based compensation  Utilize knowledge communication operation

EuroPVM-MPI 2005 28 Performance Profiling Overhead Compensation for MPI Programs Acknowledgements  Department of Energy (DOE)  MICS office  “Performance Technology for Tera-class Parallel Computer Systems: Evolution of the TAU Performance System”  “Performance Technology for Productive Parallel Computing”  NNSA/ASC  University of Utah DOE ASCI Level 1 sub-contract  ASCI Level 3 project (LANL, LLNL, SNL)  URLs  TAU: http://www.cs.uoregon.edu/research/tau  PDT: http://www.cs.uoregon.edu/research/pdt

Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf

Similar presentations

Presentation on theme: "Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf

Similar presentations

Presentation on theme: "Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf"— Presentation transcript:

Similar presentations

About project

Feedback