Download presentation
Presentation is loading. Please wait.
Published byWarren Sanders Modified over 9 years ago
1
PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST
2
Background “Peta” is tremendous! –Compared with “Giga or Tera” scale machines How are you Mr. Tera? I am fine! How about you, Mr. Peta?
3
Background “Peta” is tremendous! –Compared with “Giga or Tera” scale machines If you would like to develop a “Peta-Scale” supercomputer, it is required to… –Explore the design space both of computation nodes and inter-connection network! –Verify the effective performance to be achieved! So, we need a performance evaluation environment for peta-scale supercomputers!
4
Our Goal! Problem… –Simulations are 3-orders of magnitude slower than real machines! –“Peta-scale” is 3-orders of magnitude larger than “Tera-scale” (i.e. available machines) ! –How can we bridge the gap? Develop an efficient performance evaluation environment: PSI-SIM –Divide compute-node simulations and network simulations! –Abstract the target application program to accelerate simulation speed!
5
Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine
6
Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine
7
What is the Skeleton Code? Original code foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F for (j=0; j<n; j++) for (k=0; k<n; k++) Func( ); } foo( ) { BSIM_ADD_TIME(10ms) MPI_Comm. BSIM_ADD_TIME(1ms) BSIM_ADD_TIME(15s) } Skeleton code Computation blocks are replaced by “Estimated” execution times! Other modifications (e.g. reducing required memory size)
8
Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine
9
Generating Communication Profile BSIM-Logger –Executes the skeleton code on an existing machine –Emulates the behavior of target machine –Generates a communication profile under the assumption of a ZERO-latency ideal network Why Fast? –Abstracted computation blocks are NOT executed (just update virtual timers) –Mask real communications, but generate accurate logs
10
How Fast? How Accurate? ERI (Electron Repulsion Integral) Time for logging (s) Original Skeleton Exe. Time Predicted (s) Original Skeleton NAS PARALLEL FT Time for logging (s) Original Skeleton Exe. Time Predicted (s) Original Skeleton
11
Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine
12
Fast, Flexible Interconnection Network Simulator NSIM –Inputs the communication profile and a network configuration file –Generates a communication profile with estimated interconnect latency Why Fast? Why Flexible? –Parallelized implementation –Support a number of parameters Topology, Spec. of routers/switches, buffer size, and so on
13
Performance of BSIM + NSIM Performance prediction for HPL execution @16nodes PC cluster <120s (problem size = 5,000) @8CPU About 9,000 MPI-Comm./s@8CPU Execution Time (s) MeasuredPredicted Error=5.3% Not skeleton execution
14
Performance-Evaluation Flow of PSI-SIM BSIM-Parser BSIM-Logger Comm. profile (w/o Latency) Comm. Profile (w/ Latency) ANA Performance Info. Interconnect Configuration DB for Processors Interconnect Arch. Visualization Hints for Optimization Parallelized Application (e.g. Peta-scale) Skeleton Code Step1: Generate a skeleton code Step2: Execute on an existing machine Step3: Simulate inter connection network Step4: Visualize and analyze the results NSIM Target machine
15
ANA GroupWork Viewer Group Work Indicate load balance Performance Indicator Execution time after load-balance optimization Communication Indicator Amount of communications per second
16
Conclusions PSI-SIM –Performance evaluation environment for supercomputers –BSIM+NSIM+ANA On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a “Giga-scale” machine (e.g. 32 CPU Cores) – “Peta-Scale” machine (4K PSI-SIMD CPUs) by using a “Giga-scale” machine
17
Backup Slides
18
Peta-scale Performance Prediction Assumption –HPL problem size: 3Million –#of nodes: 4K (PSI-SIMD) –BSIM: use 32 cpus (3GHz Xeon) –NSIM: 10,000 MPI-Comm./s@8CPU How long we need to spend? –BSIM: about 300h (<2 weeks) –NSIM: about ?? under the estimation…
19
予測実行時間 (FT) 誤差 - 11.6% 誤差 -11.3% Target machine?: rscc Used machine?: rscc
20
通信プロファイル時間 (FT) 86% 削減 19% 削減 Target machine?: rscc Used machine?: rscc
21
予測実行時間( ERI ) 誤差 -0.2% 誤差 1.5% 誤差 -0.6% Target machine?: rscc Used machine?: rscc
22
通信プロファイル生成時間( ERI ) 91% 削減 96% 削減 97% 削減 Target machine?: rscc Used machine?: rscc
23
実行時間の予測性能 通信レイテンシ 評価アプリケーションの規模増加 ⇒ 予測精度が向上 予測精度: 94.7%
24
シミュレーション時間(問題サイズ固 定: 2000 ) 評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上 最近の成果(高速化)分 16 プロセス 256 プロセス 1,024 プロセス
25
Performance of NSIM Accuracy : 94.7% 7.92,8.36,8.04 114s Target machine? : PSI-hexa Used machine?: PSI-hexa
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.