Presentation is loading. Please wait.

Presentation is loading. Please wait.

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Similar presentations


Presentation on theme: "PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,"— Presentation transcript:

1 PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST

2 Background “Peta” is tremendous! –Compared with “Giga or Tera” scale machines How are you Mr. Tera? I am fine! How about you, Mr. Peta?

3 Background “Peta” is tremendous! –Compared with “Giga or Tera” scale machines If you would like to develop a “Peta-Scale” supercomputer, it is required to… –Explore the design space both of computation nodes and inter-connection network! –Verify the effective performance to be achieved! So, we need a performance evaluation environment for peta-scale supercomputers!

4 Our Goal! Problem… –Simulations are 3-orders of magnitude slower than real machines! –“Peta-scale” is 3-orders of magnitude larger than “Tera-scale” (i.e. available machines) ! –How can we bridge the gap? Develop an efficient performance evaluation environment: PSI-SIM –Divide compute-node simulations and network simulations! –Abstract the target application program to accelerate simulation speed!

5 Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine

6 Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine

7 What is the Skeleton Code? Original code foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F for (j=0; j<n; j++) for (k=0; k<n; k++) Func( ); } foo( ) { BSIM_ADD_TIME(10ms) MPI_Comm. BSIM_ADD_TIME(1ms) BSIM_ADD_TIME(15s) } Skeleton code Computation blocks are replaced by “Estimated” execution times! Other modifications (e.g. reducing required memory size)

8 Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine

9 Generating Communication Profile BSIM-Logger –Executes the skeleton code on an existing machine –Emulates the behavior of target machine –Generates a communication profile under the assumption of a ZERO-latency ideal network Why Fast? –Abstracted computation blocks are NOT executed (just update virtual timers) –Mask real communications, but generate accurate logs

10 How Fast? How Accurate? ERI (Electron Repulsion Integral) Time for logging (s) Original Skeleton Exe. Time Predicted (s) Original Skeleton NAS PARALLEL FT Time for logging (s) Original Skeleton Exe. Time Predicted (s) Original Skeleton

11 Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine

12 Fast, Flexible Interconnection Network Simulator NSIM –Inputs the communication profile and a network configuration file –Generates a communication profile with estimated interconnect latency Why Fast? Why Flexible? –Parallelized implementation –Support a number of parameters Topology, Spec. of routers/switches, buffer size, and so on

13 Performance of BSIM + NSIM Performance prediction for HPL execution @16nodes PC cluster <120s (problem size = 5,000) @8CPU About 9,000 MPI-Comm./s@8CPU Execution Time (s) MeasuredPredicted Error=5.3% Not skeleton execution

14 Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine

15 ANA GroupWork Viewer Group Work Indicate load balance Performance Indicator Execution time after load-balance optimization Communication Indicator Amount of communications per second

16 Conclusions PSI-SIM –Performance evaluation environment for supercomputers –BSIM+NSIM+ANA On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a “Giga-scale” machine (e.g. 32 CPU Cores) – “Peta-Scale” machine (4K PSI-SIMD CPUs) by using a “Giga-scale” machine

17 Backup Slides

18 Peta-scale Performance Prediction Assumption –HPL problem size: 3Million –#of nodes: 4K (PSI-SIMD) –BSIM: use 32 cpus (3GHz Xeon) –NSIM: 10,000 MPI-Comm./s@8CPU How long we need to spend? –BSIM: about 300h (<2 weeks) –NSIM: about ?? under the estimation…

19 予測実行時間 (FT) 誤差 - 11.6% 誤差 -11.3% Target machine?: rscc Used machine?: rscc

20 通信プロファイル時間 (FT) 86% 削減 19% 削減 Target machine?: rscc Used machine?: rscc

21 予測実行時間( ERI ) 誤差 -0.2% 誤差 1.5% 誤差 -0.6% Target machine?: rscc Used machine?: rscc

22 通信プロファイル生成時間( ERI ) 91% 削減 96% 削減 97% 削減 Target machine?: rscc Used machine?: rscc

23 実行時間の予測性能 通信レイテンシ 評価アプリケーションの規模増加 ⇒ 予測精度が向上 予測精度: 94.7%

24 シミュレーション時間(問題サイズ固 定: 2000 ) 評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上 最近の成果(高速化)分 16 プロセス 256 プロセス 1,024 プロセス

25 Performance of NSIM Accuracy : 94.7% 7.92,8.36,8.04 114s Target machine? : PSI-hexa Used machine?: PSI-hexa


Download ppt "PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,"

Similar presentations


Ads by Google