Download presentation
Presentation is loading. Please wait.
Published byVictor Hardy Modified over 9 years ago
1
A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho, Jesus Labarta, Mateo Valero Barcelona Supercomputing Centre Universidad Politecnica de Catalunya
2
Overview Introduction Proposed framework Results and findings Conclusion and future work 2/17
3
Motivation Communication is overhead! Networks are becoming increasingly expensive Rather than creating new cycles on the network, we must learn to profit more from the already existing ones Communication-Computation Overlap Promise: Execution speedup Relaxation of network requirements Modern networks are already there Current platforms provide good support for overlap What is the overlapping potential of applications? 3/17
4
Overlap at the MPI level Mechanisms of overlap: 1) chunking 2) advancing sends 3) postponing receptions 4) double buffering Overlap of the i-th chunk: extremely dependent on computation patterns! computation Tp2 Tp3Tp4Tp1 Tc1Tc2Tc3Tc4 MPI transfer MPI send buffer MPI receive buffer computation comm iter i Iter i+1 process B process A time computation Tp2 Tp3Tp4Tp1 computation process A computation iter i process B Tc1Tc2Tc3Tc4 computation Iter i+1 time double buffering overlapped execution non-overlapped execution 4/17
5
Related work Implementations Software techniques (Danalis[SC 05]) Compiler (Das[IPDPS 08]) Fail to clearly determine the potential of overlap Quantifying the potential of overlap in applications Performance modeling (Sancho[SC 06]) Continuing Sancho’s work, our goal is Design a simulation framework for studying overlap Automatically estimate potential of overlap in a given code Provide richer result Allow more detailed analysis 5/17 SC 05: A. Danalis, K-Y Kim, L. Pollock, and M. Swany. Transformations to Parallel Codes for Communication-Computation Overlap. In Supercomputing, SC 05. 2005. IPDPS 08: D. Das, M. Gupta, R. Ravindran, W. Shivani, P. Sivakeshava,and R. Uppal. Compiler-controlled extraction of computation-communication overlap in MPI applications. In IPDPS, pages 1–8, 2008. SC 06: J.C. Sancho, K.J. Barker, D.J. Kerbyson, and K. Davis. Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications. In Supercomputing, SC 06. 2006.
6
Designed environment Valgrind tracer Intercept MPI routines Intercept memory accesses time ~ #instructions Dimemas simulator Linear comm. model Non-linear effects (#buses, #ports) Paraver visualization Dimemas simulator MPI process Valgrind tracer Trace of real non-overlapped run Original (non-overlapped) execution Potential (overlapped) execution MPI execution Trace of potential overlapped run 6/17
7
Tracing methodology computation Tp2 Tp3Tp4Tp1 Tc1Tc2Tc3Tc4 MPI transfer MPI send buffer MPI receive buffer computation comm iter i Iter i+1 process B process A time computation Tp2 Tp3Tp4Tp1 computation process A computation iter i process B Tc1Tc2Tc3Tc4 computation Iter i+1 time double buffering overlapped execution non-overlapped execution process A: iter i CPU burst (Tp) Send -> B (Size) process B: iter i+1 : Receive <- A (Size) CPU burst (Tc) process A: iter i CPU burst (Tp1) Send -> B (Size/4) CPU burst (Tp2) Send -> B (Size/4) CPU burst (Tp3) Send -> B (Size/4) CPU burst (Tp4) Send -> B (Size/4) process B: iter i+1 IReceive <- B (Size) Wait <- B (Size/4) CPU burst (Tc1) Wait <- B (Size/4) CPU burst (Tc2) Wait <- B (Size/4) CPU burst (Tc3) Wait <- B (Size/4) CPU burst (Tc4) obtained traces * Tp = Tp1+Tp2+Tp3+Tp4 Tc = Tc1+Tc2+Tc3+Tc4 7/17
8
Modeling ideal patterns computation Tp2 Tp3Tp4Tp1 Tc1Tc2Tc3Tc4 MPI transfer MPI send buffer MPI receive buffer computation comm iter i Iter i+1 process B process A time computation Tp2 Tp3Tp4Tp1 computation process A computation iter i process B Tc1Tc2Tc3Tc4 computation Iter i+1 time double buffering overlapped execution non-overlapped execution process A: iter i CPU burst (Tp) Send -> B (Size) process B: iter i+1 : Receive <- A (Size) CPU burst (Tc) process A: iter i CPU burst (Tp/4) Send -> B (Size/4) CPU burst (Tp/4) Send -> B (Size/4) CPU burst (Tp/4) Send -> B (Size/4) CPU burst (Tp/4) Send -> B (Size/4) process B: iter i+1 IReceive <- B (Size) Wait <- B (Size/4) CPU burst (Tc/4) Wait <- B (Size/4) CPU burst (Tc/4) Wait <- B (Size/4) CPU burst (Tc/4) Wait <- B (Size/4) CPU burst (Tc/4) obtained traces * Tp = Tp1+Tp2+Tp3+Tp4 Tc = Tc1+Tc2+Tc3+Tc4 8/17
9
Potential of the framework Allows to study various influences on the overlap Application Parallel behavior Computation patterns Network configuration Overlapping technique used Visualization Spot hard synchronization points help to optimize the application 9/17 NAS-CG, 4 MPI processes
10
Experimental setup Applications: NAS benchmark: BT and CG Sweep3D POP SPECFEM3D Alya Dimemas configuration set to model MareNostrum 64 PowerPC 970, 2.3 GHz Myrinet, unidirectional bandwidth 250MB/s Overlapping technique Breaking each MPI message into 4 chunks 10/17
11
Results: ideal computation patterns On n% of the iteration progress Process produces n% of the message to be sent Process consumes n% of the received message Ideal (linear) patterns: production patternsconsumption patterns 11/17
12
Results: measured computation patterns Very high diversity of the measured patterns Sweep3D - productionNAS-BT - consumption Brutally rough characterization: 12/17 Real (measured) patterns are unfavorable for overlap
13
Results: overlapping speedup 13/17 Speedup [%] (a)Overlapping speedup Speedup of the overlapped execution over the non-overlapped execution Real patterns are unfavorable for overlap High speedup in Sweep3D because of wavefront (pipeline) behavior
14
Results: relaxing network requirements 14/17 Real patterns are unfavorable for overlap Extreme relaxation in Sweep3D because of wavefront (pipeline) behavior Network bandwidth [MB/s] (b) Relaxation of network bandwidth: Network bandwidth at which the overlapped execution gives the same performance as the non-overlapped at 250MB/s
15
Results: equivalent bandwidth advancement 15/17 Real patterns are not favorable for overlap In Sweep3D Performance achieved by overlap cannot be achieved by bandwidth advancement In SPECFEM The overlapping speedup is small But the equivalent bandwidth advancement is very high Network bandwidth [MB/s] (c) Equivalent bandwidth advancement: Network bandwidth at which the non-overlapped execution gives the same performance as the overlapped at 250MB/s
16
Conclusion and future work Contributions Automatic approach Visualization Detailed analysis Findings on overlap Strongly limited by computation patterns Significant relaxation of network requirements Overlapping benefits cannot be achieved by bandwidth advancement Benefits especially high in the case of pipeline applications Future work Overlap at MPI level is not enough! More overlap requires more dataflow execution Turn to MPI+SMPSs Using the same simulation methodology 16/17
17
Questions A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho, Jesus Labarta, Mateo Valero Barcelona Supercomputing Centre Universidad Politecnica de Catalunya
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.