Download presentation
Presentation is loading. Please wait.
Published byMaximilian Jeffrey Golden Modified over 9 years ago
1
D. Becker, M. Geimer, R. Rabenseifner, and F. Wolf Laboratory for Parallel Programming | September 21 2010 Synchronizing the timestamps of concurrent events in traces of hybrid MPI/OpenMP applications
2
2Daniel Becker Cluster systems represent majority of today’s supercomputers – Availability of inexpensive commodity components Vast diversity – Architecture – Interconnect technology – Software environment Message-passing and shared-memory programming models for communication and synchronization Cluster systems Resulting in hybrid MPI/OpenMP applications and the need for generic software tools
3
3Daniel Becker Application areas – Performance analysis Time-line visualization Wait-state analysis – Performance modeling – Performance prediction – Debugging Events recorded at runtime to enable post-mortem analysis of dynamic program behavior Event includes at least timestamp, location, and event type Event tracing Send Recv Barrier E E SXE MX RXE …… ESX E …… ERX E … … SRX X EEE merge (opt.) write record
4
4Daniel Becker Problem: Non-synchronized clocks
5
5Daniel Becker MotivationClock synchronizationLogical synchronizationAlgorithmic extensionsParallel synchronizationExperimental evaluationSummary Outline
6
6Daniel Becker Lamport, Mattern, Fidge, Rabenseifner Restore and preserve logical correctness Lamport, Mattern, Fidge, Rabenseifner Restore and preserve logical correctness Dunigan, Maillet, Tron, Doleschal Measure offset values and determine interpolation function Determine medial smoothing function based on send/receive differences Duda, Hofman, Hilgers Query time from reference clocks synchronized at regular intervals Mills Clock synchronization Network-based synchronization Error estimation Offset interpolation Logical synchronization Clock synchronization
7
7Daniel Becker Local correction Clock condition violation Forward amortization Subsequent events Backward amortization Preceding events Controlled logical clock E X E S µ min XXRE
8
8Daniel Becker MPI semantics E E MX E E E E E E E E E E
9
9Daniel Becker Neither restores nor preserves clock condition in OpenMP event semantics May introduce violations in locations that were previously intact Limitations of the CLC algorithm R S omp_barrier R
10
10Daniel Becker Collective communication omp_barrier E E OX Consider OpenMP constructs as composed of multiple logical messages Define logical send/receive pairs for each flavor
11
11Daniel Becker OpenMP semantics E E E F J OX E E E U U L Tasking U U L U
12
12Daniel Becker Operation may have multiple logical receive and send events Multiple receives used to synchronize multiple clocks Latest send event is the relevant send event Happened-before relation MXE E OX E E
13
13Daniel Becker Correct local traces in parallel – Keep whole trace in memory – Exploit distributed memory & processing capabilities Parallelization Replay communication – Traverse trace in parallel – Exchange data at synchronization points – Use operation of same type MPI functions OpenMP constructs Linear offset interpolation Forward replay Backward replay
14
14Daniel Becker 22 2 1 3 Forward replay 1 …… 3 …… 2 …… omp_barrier 2 1 3
15
15Daniel Becker Avoid new violations Do not advance send farther than matching receive Backward amortization Exchange remote event data Store those data temporarily Backward replay Determine correction Apply correction independently Piece-wise correction R S S R
16
16Daniel Becker Data on sender side needed Communication direction – Communication precedes in backward direction – Roles of sender and receiver are inverted Traversal direction – Start at end of trace – Avoid deadlocks Backward replay SR …… SR …… S S R R R R S S
17
17Daniel Becker Piece-wise correction LC i b R R R R SSSSS ∆t R R LC i b Controlled logical clock without jump discontinuities LC i ’ – LC i b Controlled logical clock with jump discontinuities LC i A’ - LC i b Linear interpolation for backward amortization LC i A - LC i b Piecewise linear interpolation for backward amortization Amortization interval min(LC k ’(corr. receive event) - µ - LC i b ) differences to LC i b
18
18Daniel Becker Experimental evaluation Significant percentage of messages was violated (up to 5%) After correction all traces were free of clock condition violations Nicole cluster JSC@FZJ 32 compute nodes 2 quad-core Opteron running at 2.4 GHz Infiniband Applications PEPC (4 threads per process) Jacobi solver (2 threads per process) Evaluation focused on frequency of clock violations, accuracy, and scalability of the correction
19
19Daniel Becker Event position – Absolute deviations correspond to value clock condition violations – Relative deviations are negligible Accuracy of the algorithm Event distance – Larger relative deviations possible – Impact on analysis results negligible Correction only marginally changes the length of local intervals Correction changed the length of local intervals only marginally
20
20Daniel Becker Only violated MPI semantics in original trace Roughly half of the corrections correspond to OpenMP semantics Synchronizing hybrid codes Algorithm preserved OpenMP semantics R R S omp_barrier
21
21Daniel Becker Scalability Correction easily scaled with target application
22
22Daniel Becker Summary Controlled logical clock algorithm is limited Problem characterized Identified happened- before relations in OpenMP semantics Coverage of realistic hybrid codes Algorithm extended Parallel forward and backward replay Good accuracy and scalability Algorithm parallelized
23
23Daniel Becker Exploit knowledge of MPI- internal messaging inside collective operations using PERUSE Leverage periodic offset measurements at global synchronization points Outlook Measure (indirect) offsets periodically CLC can increase accuracy between measurements Combined method desirable
24
24Daniel Becker Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.