PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST

Background “Peta” is tremendous! –Compared with “Giga or Tera” scale machines How are you Mr. Tera? I am fine! How about you, Mr. Peta?

Background “Peta” is tremendous! –Compared with “Giga or Tera” scale machines If you would like to develop a “Peta-Scale” supercomputer, it is required to… –Explore the design space both of computation nodes and inter-connection network! –Verify the effective performance to be achieved! So, we need a performance evaluation environment for peta-scale supercomputers!

Our Goal! Problem… –Simulations are 3-orders of magnitude slower than real machines! –“Peta-scale” is 3-orders of magnitude larger than “Tera-scale” (i.e. available machines) ! –How can we bridge the gap? Develop an efficient performance evaluation environment: PSI-SIM –Divide compute-node simulations and network simulations! –Abstract the target application program to accelerate simulation speed!

Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine

What is the Skeleton Code? Original code foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F for (j=0; j<n; j++) for (k=0; k<n; k++) Func( ); } foo( ) { BSIM_ADD_TIME(10ms) MPI_Comm. BSIM_ADD_TIME(1ms) BSIM_ADD_TIME(15s) } Skeleton code Computation blocks are replaced by “Estimated” execution times! Other modifications (e.g. reducing required memory size)

Generating Communication Profile BSIM-Logger –Executes the skeleton code on an existing machine –Emulates the behavior of target machine –Generates a communication profile under the assumption of a ZERO-latency ideal network Why Fast? –Abstracted computation blocks are NOT executed (just update virtual timers) –Mask real communications, but generate accurate logs

How Fast? How Accurate? ERI (Electron Repulsion Integral) Time for logging (s) Original Skeleton Exe. Time Predicted (s) Original Skeleton NAS PARALLEL FT Time for logging (s) Original Skeleton Exe. Time Predicted (s) Original Skeleton

Fast, Flexible Interconnection Network Simulator NSIM –Inputs the communication profile and a network configuration file –Generates a communication profile with estimated interconnect latency Why Fast? Why Flexible? –Parallelized implementation –Support a number of parameters Topology, Spec. of routers/switches, buffer size, and so on

Performance of BSIM + NSIM Performance prediction for HPL execution @16nodes PC cluster <120s (problem size = 5,000) @8CPU About 9,000 MPI-Comm./s@8CPU Execution Time (s) MeasuredPredicted Error=5.3% Not skeleton execution

ANA GroupWork Viewer Group Work Indicate load balance Performance Indicator Execution time after load-balance optimization Communication Indicator Amount of communications per second

Conclusions PSI-SIM –Performance evaluation environment for supercomputers –BSIM+NSIM+ANA On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a “Giga-scale” machine (e.g. 32 CPU Cores) – “Peta-Scale” machine (4K PSI-SIMD CPUs) by using a “Giga-scale” machine

Backup Slides

Peta-scale Performance Prediction Assumption –HPL problem size: 3Million –#of nodes: 4K (PSI-SIMD) –BSIM: use 32 cpus (3GHz Xeon) –NSIM: 10,000 MPI-Comm./s@8CPU How long we need to spend? –BSIM: about 300h (<2 weeks) –NSIM: about ?? under the estimation…

予測実行時間 (FT) 誤差 - 11.6% 誤差 -11.3% Target machine?: rscc Used machine?: rscc

通信プロファイル時間 (FT) 86% 削減 19% 削減 Target machine?: rscc Used machine?: rscc

予測実行時間（ ERI ）誤差 -0.2% 誤差 1.5% 誤差 -0.6% Target machine?: rscc Used machine?: rscc

通信プロファイル生成時間（ ERI ） 91% 削減 96% 削減 97% 削減 Target machine?: rscc Used machine?: rscc

実行時間の予測性能通信レイテンシ評価アプリケーションの規模増加 ⇒ 予測精度が向上予測精度： 94.7%

シミュレーション時間（問題サイズ固定： 2000 ）評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上最近の成果（高速化）分 16 プロセス 256 プロセス 1,024 プロセス

Performance of NSIM Accuracy ： 94.7% ７．９２，８．３６，８．０４１１４ｓ Target machine? ： PSI-hexa Used machine?: PSI-hexa

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Similar presentations

Presentation on theme: "PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Similar presentations

Presentation on theme: "PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,"— Presentation transcript:

Similar presentations

About project

Feedback