Download presentation
Presentation is loading. Please wait.
Published byClare Greenhow Modified over 10 years ago
1
µπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology SimuTools, Malaga, Spain March 17, 2010
2
2Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Motivation & Background Software & Hardware Lifetimes Lifetime of large parallel machine: 5 years Lifetime of large parallel machine: 5 years Lifetime of useful parallel code: 20 years Lifetime of useful parallel code: 20 years Port, analyze, optimize Port, analyze, optimize Ease of development: Obviate actual scaled hardware Ease of development: Obviate actual scaled hardware Energy efficient: Reduce failed runs at actual scale Energy efficient: Reduce failed runs at actual scale Software & Hardware Design Co-design: E.g., 1 μs barrier cost/benefit Co-design: E.g., 1 μs barrier cost/benefit Hardware: E.g., Load from application Hardware: E.g., Load from application Software: Scaling, debugging, testing, customizing Software: Scaling, debugging, testing, customizing
3
3Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) μπ Performance Investigation System μπ = micro parallel performance investigator – Performance prediction for MPI, Portals and other parallel applications – Actual application code executed on the real hardware – Platform is simulated at large virtual scale – Timing customized by user-defined machine μπ = micro parallel performance investigator – Performance prediction for MPI, Portals and other parallel applications – Actual application code executed on the real hardware – Platform is simulated at large virtual scale – Timing customized by user-defined machine Scale is key differentiator – Target: 1,000,000 virtual cores – E.g., 1,000,000 virtual MPI ranks in simulated MPI application Based on µsik micro simulator kernel – Highly scalable PDES engine Scale is key differentiator – Target: 1,000,000 virtual cores – E.g., 1,000,000 virtual MPI ranks in simulated MPI application Based on µsik micro simulator kernel – Highly scalable PDES engine
4
4Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Generalized Interface & Timing Framework Accommodates arbitrary level of timing detail – Compute time: can use a full system simulation (instruction-level) on the side, or model with cache-effects, other corrected processor speed, etc., depending on user desire, accuracy-cost trade-off – Communication time: can use network simulator, queueing and congestion models, etc., depending on user desire, accuracy-cost
5
5Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Compiling MPI application with μπ Modify #include and recompile Change #include to #include Relink to μπ library – Instead of –lmpi use -lmupi Modify #include and recompile Change #include to #include Relink to μπ library – Instead of –lmpi use -lmupi
6
6Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Executing MPI application over μπ Run the modified MPI application (a μπ simulation) – mpirun –np 4 test -nvp 32 runs test with 32 virtual MPI ranks simulation uses 4 real cores μπ itself uses multiple real cores to run simulation in parallel Run the modified MPI application (a μπ simulation) – mpirun –np 4 test -nvp 32 runs test with 32 virtual MPI ranks simulation uses 4 real cores μπ itself uses multiple real cores to run simulation in parallel
7
7Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Interface Support Existing, Sufficient MPI_Init(), MPI_Finalize() MPI_Init(), MPI_Finalize() MPI_Comm_rank() MPI_Comm_size() MPI_Comm_rank() MPI_Comm_size() MPI_Barrier() MPI_Barrier() MPI_Send(), MPI_Recv() MPI_Send(), MPI_Recv() MPI_Isend(), MPI_Irecv() MPI_Isend(), MPI_Irecv() MPI_Waitall() MPI_Waitall() MPI_Wtime() MPI_Wtime() MPI_COMM_WORLD MPI_COMM_WORLD Planned, Optional Other wait variants Other wait variants Other send/recv variants Other send/recv variants Other collectives Other collectives Group communication Group communication Other, Performance-Oriented MPI_Elapse_time(dt) MPI_Elapse_time(dt) Added for simulation speed Added for simulation speed Avoids actual computation, instead simply elapses time Avoids actual computation, instead simply elapses time MPI_Elapse_time(dt) MPI_Elapse_time(dt) Added for simulation speed Added for simulation speed Avoids actual computation, instead simply elapses time Avoids actual computation, instead simply elapses time
8
8Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Performance Study Benchmarks Benchmarks – Zero lookahead – 10μs lookahead Platform Platform – Cray XT5, 226K cores Scaling Results Scaling Results – Event Cost – Synchronization Overhead – Multiplexing Gain Benchmarks Benchmarks – Zero lookahead – 10μs lookahead Platform Platform – Cray XT5, 226K cores Scaling Results Scaling Results – Event Cost – Synchronization Overhead – Multiplexing Gain 8
9
9Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Experimentation Platform: Jaguar* * Data and images from http://nccs.gov
10
10Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Event Cost
11
11Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Synchronization Speed
12
12Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Multiplexing Gain
13
13Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) μπ Summary - Quantitative Unprecedented scalability – 27,648,000 virtual MPI ranks on 216,000 actual cores Optimal multiplex-factor seen – 64 virtual ranks per real rank Low slowdown even on zero-lookahead scenarios – Even on fast virtual networks Unprecedented scalability – 27,648,000 virtual MPI ranks on 216,000 actual cores Optimal multiplex-factor seen – 64 virtual ranks per real rank Low slowdown even on zero-lookahead scenarios – Even on fast virtual networks
14
14Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) μπ Summary - Qualitative The only available simulator for highly scaled MPI runs – Suitable for source-available, trace-driven, or modeled applications Configurable hardware timing – User-specified latencies, bandwidths, arbitrary inter-network models Executions repeatable and deterministic – Global time-stamped ordering – Deterministic timing model, and – Purely discrete event simulation Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled – Trapped: on-line, live actual execution – Instrumented: off-line trace generation, trace-driven on-line execution – Modeled: model-driven computation and MPI communication patterns Nearly zero perturbation with unlimited instrumentation The only available simulator for highly scaled MPI runs – Suitable for source-available, trace-driven, or modeled applications Configurable hardware timing – User-specified latencies, bandwidths, arbitrary inter-network models Executions repeatable and deterministic – Global time-stamped ordering – Deterministic timing model, and – Purely discrete event simulation Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled – Trapped: on-line, live actual execution – Instrumented: off-line trace generation, trace-driven on-line execution – Modeled: model-driven computation and MPI communication patterns Nearly zero perturbation with unlimited instrumentation
15
15Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Ongoing Work NAS Benchmarks – E.g., FFT Actual at-scale application – E.g., Chemistry Optimized implementation of certain MPI primitives – E.g., MPI_Barrier(), MPI_Reduce() Tie to other important phenomena – E.g., energy consumption models NAS Benchmarks – E.g., FFT Actual at-scale application – E.g., Chemistry Optimized implementation of certain MPI primitives – E.g., MPI_Barrier(), MPI_Reduce() Tie to other important phenomena – E.g., energy consumption models
16
Thank you! Questions? Discrete Computing Systems www.ornl.gov/~2ip Discrete Computing Systems www.ornl.gov/~2ip
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.