Flexible and Efficient Control of Data Transfers for Loosely Coupled Components Joe Shang-Chieh Wu Department of Computer Science University of Maryland, USA
What & How Obtain more accurate results by coupling existing (parallel) physical simulation components Different time and space scales for data produced in shared or overlapped regions Runtime decisions for which time-stamped data objects should be exchanged Performance might be a concern
Roadmap Approximate Match [Grid 2004] Collective Buffering [IPDPS 2007] Distributed App Match + Eager Transfer [under submission] Conclusion
Matching is OUTSIDE components Separate matching (coupling) information from the participating components Maintainability – Components can be developed/upgraded individually Flexibility – Change participants/components easily Functionality – Support variable-sized time interval numerical algorithms or visualizations
Distributed Array Transfer Library Basic Operation Runtime-based Approximate Match Library Importer component Request Array for T = 2.5 Matched Array for T = 3 Approximate Match Exporter component T=4 T=3 T=2 Exported Distributed Array Imported Distributed Array Arrays are distributed among multiple processes T=1
Separate codes from matching define region R1 define region R4 define region R5... Do t = 1, N, Step0... // computation jobs export(R1,t) export(R4,t) export(R5,t) EndDo define region R2... Do t = 1, M, Step1 import(R2,t)... // computation jobs EndDo Importer App1 Exporter App0 Configuration file # App0 cluster0 /bin/App App1 cluster1 /bin/App App2 cluster2 /bin/App App4 cluster4 /bin/App4 4 # App0.R1 App4.R0 REGL 0.05 App0.R1 App2.R0 REG 0.1 App0.R4 App1.R2 REGU 0.5 # Connection-Wise Approximate Match Policy Precision Find t’ in App0, s.t. (a) t <= t’ <= t (b) minimize t’ – t Sourc e Sink
Execution time is composed of Computation time (T comp ) Buffering time (T buf ) Matched data transfer time (T tran ) T buf matters when exporter components (data sources) run more slowly T tran matters when import components (data sinks) run more slowly Dissection of Execution Time
Collective Buffering (when exporters run more slowly) Fastest export process sends runtime match results to slower processes in the same program Unnecessary memory copies can be avoided in slower processes Optimal State: only required exported data are buffered
Collective Buffering Result Data Exporting Time for the Slowest Process Copy All Copy Some Only Copy Required Optimal State
Eager Transfer + Distributed Match (when importer runs more slowly) Bandwidth and Latency both contribute matched data transfer time Eager transfer, transferring predicted data in advance, solves bandwidth issue Distributed approximate match, running on both exporter and importer, solves latency issue
Original ET Only ET+DM
Conclusion Runtime-based approximate match is a solution to couple different time scale components Performance can be improved –When exporter runs more slowly, avoid unnecessary memory copies –When importer runs more slowly, transfer predicted data and meta-data in advance
The End
Questions ? (
Distributed Array Transfer Library Basic Operation Runtime-based Approximate Match Library Importer component Request Array for T = 2.5 Matched Array for T = 3 Approximate Match Exporter component T=4 T=3 T=2 Exported Distributed Array Imported Distributed Array Arrays are distributed among multiple processes T=1
On-Demand Approach Import Component Makes Request Perform Approx Match on Export Component, and then Transfer Matched Data Need Data Transfer Time (T 3 – T 2 ) and 2 one-way delays (T 2 – T 1 )
Eager Transfer Only Get permission to push predicted data Transfer predicted data in advance Import component makes request Perform approx match on export component Need 2 one-way delays ( T 16 – T 15 )
Eager Transfer With Distributed Match … Transfer predicted data + meta-data in advance Import component makes request becomes local operations Local operation time T 26 – T 25 is needed, independent to one- way delay
All Together
Supported matching policies = LUBminimum f(x) with f(x) ≥ x GLBmaximum f(x) with f(x) ≤ x REGf(x) minimizes |f(x)-x| with |f(x)-x| ≤ p REGUf(x) minimizes f(x)-x with 0 ≤ f(x)-x ≤ p REGLf(x) minimizes x-f(x) with 0 ≤ x-f(x) ≤ p FASTRany f(x) with |f(x)-x| ≤ p FASTUany f(x) with 0 ≤ f(x)-x ≤ p FASTLany f(x) with 0 ≤ x-f(x) ≤ p