Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.

Similar presentations


Presentation on theme: "Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView."— Presentation transcript:

1 Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView

2 DataStar Overview P655 :: ( 8-way, 16GB) 176 nodes P655+ :: ( 8-way, 32GB) 96 nodes P690 :: ( 32-way, 64GB) 2 nodes P690 :: ( 32-way, 128GB) 4 nodes P690 :: ( 32-way, 256GB) 2 nodes Total – 280 nodes :::: 2,432 processors.

3 Batch/Interactive computing Batch Job Queues: –Job queue Manager – Load Leveler (tool from IBM) –Job queue Scheduler – Catalina ( SDSC internal tool) –Job queue Monitoring – Various tools (commands) –Jobs Accounting – Job filter (SDSC internal PERL scripts)

4 DataStar Access Three Login Nodes :: Access modes (platforms) (usage mode) –dslogin.sdsc.edu :: Production runs (P690, 32-way, 64GB) –dspoe.sdsc.edu :: Test/debug runs (P655, 8-way, 16GB) –dsdirect.sdsc.edu :: Special needs (P690, 32-way, 256GB) Note : Above Usage modes division is not very strict.

5

6 Test/debug runs (Usage from dspoe) Queue/Class Name Node type Memory limit Max Wall clock hours Max Num Nodes interactivep655 nodes (8-CPU), 16 GB 2 hrs3 expressp655 nodes (8-CPU), 16 GB 2 hrs4 [dspoe.sdsc.edu :: P655, 8-way, 16GB] Access to two queues: –P655 nodes [shared] –P655 nodes [Not – shared] Job queues have Job filter + Load Leveler only (very fast) Special command line submission (along with job script).

7 Production runs (Usage from dslogin) Queue/Class Name Node type Memory limit Max Wall clock hours Max Num Nodes normalp655 nodes (8-CPU), 16 GB & 32 GB 18 hrs265 normal32p690 nodes (32-CPU), 128 GB 18 hrs5 [dslogin.sdsc.edu :: P690, 32-way, 64GB] Data transfer/ Src editing/Compliation etc… Two queues: Onto p655/p655+ nodes [not shared] Onto p690 nodes [shared ] Job ques have Job filter + LoadLeveler + Catalina ( Slowupdates )

8 All Special needs (Usage from dsdirect) [dsdirect.sdsc.edu :: P690, 32-way, 256GB] All Visualization needs All post data analysis needs Shared node (with 256 GB of memory) Process accounting in place Total (a.out) interactive usage. No Job filter, No Load Leveler, No Catalina

9 IBM Hardware Performance Monitor (hpm)

10 What is Performance? - Where is time spent and how is time spent? MIPS – Millions of Instructions Per Second MFLOPS – Millions of Floating-Point Operations Per Second Run time/CPU time

11 What is a Performance Monitor? -Provides detailed processor/system data Processor Monitors –Typically a group of registers –Special purpose registers keep track of programmable events –Non-intrusive counts result in “accurate” measurement of processor events –Typical Events counted are Instruction, floating point instruction, cache misses, etc. System Level Monitors –Can be hardware or software –Intended to measure system activity –Examples: bus monitor: measures memory traffic, can analyze cache coherency issues in multiprocessor system Network monitor: measures network traffic, can analyze web traffic internally and externally

12 Hardware Counter Motivations -To understand execution behavior of application code Why not use software? –Strength: simple, GUI interface –Weakness: large overhead, intrusive, higher level, abstraction and simplicity How about using a simulator? –Strength: control, low-level, accurate –Weakness: limit on size of code, difficult to implement, time- consuming to run When should we directly use hardware counters? –Software and simulators not available or not enough –Strength: non-intrusive, instruction level analysis, moderate control, very accurate, low overhead –Weakness: not typically reusable, OS kernel support

13 Ptools Project PMAPI Project –Common standard API for industry –Supported by IBM, SUN, SGI, COMPAQ etc PAPI Project –Standard application programming interface –Portable, available through a module –Can access hardware counter info HPM Toolkit –Easy to use –Doesn’t effect code performance –Use hardware counters –Designed specifically for IBM SPs and Power

14 Problem Set Should we collect all events all the time? –Not necessary and wasteful What counts should be used? –Gather only what you need Cycles Committed Instructions Loads Stores L1/L2 misses L1/L2 stores Committed fl pt instr Branches Branch misses TLB misses Cache misses

15 IBM HPM Toolkit H igh P erformance M onitor Developed for performance measurement of applications running on IBM Power3 systems. It consists of: –An utility ( hpmcount ) –An instrumentation library ( libhpm ) –A graphical user interface ( hpmviz ). Requires PMAPI kernel extensions to be loaded Works on IBM 630 and 604e processors Based on IBM’s PMAPI – low level interface

16 HPM Count Utilities for performance measurement of application Extra logic inserted to the processor to count specific events Updated at every cycle Provide a summary output at the end of the execution: –Wall clock time –Resource usage statistics –Hardware performance counters information –Derived hardware metrics Serial/Parallel, Gives each performance numbers for each task

17 Timers Time usually reports three metrics: User Time –The time used by your code on CPU, also CPU time –Total time in user mode = Cycles/Processor Frequency System Time –The time used by your code running kernel code (doing I/O, writing to disk, or printing to the screen etc). –It is worth to minimize the system time, by speeding up the disk I/O, doing I/O in parallel, or doing I/O in background while your CPU computes in the foreground Wall Clock time –Total execution time, the combination of the time 1 and 2 plus the time spent idle (waiting for resources) –In parallel performance tuning, only wall clock time counts –Interprocessor communication consumes a significant amount of your execution time (user/system time usually don’t account for it), need to rely on wall clock time for all the time consumed by the job

18 Floating Point Measures PM_FPU0_CMPL (FPU 0 instructions) –The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. Each FPU can start a new instruction at every cycle. This counter shows the number of floating point instructions that have been executed by the first FPU. PM_FPU1_CMPL (FPU 1 instructions) –This counter shows the number of floating point instructions (add, multiply, subtract, divide, multiply & add) that have been processed by the second FPU. PM_EXEC_FMA (FMAs executed) –This is the number of Floating point Multiply & Add (FMA) instructions. This instruction does a computation of following type x = s * a + b So two floating point operations are done within one instruction. The compiler generate this instruction as often as possible to speed up the program. But sometimes additional manual optimization is necessary to replace single multiply instructions and corresponding add instructions by one FMA.

19 Total Flop Rate Float point instructions + FMA rate –This is the most often mentioned performance index, the MFlops rate. –The peak performance of the POWER3-II processor is 1500 MFlops. (375 MHZ clock x 2 FPUs x 2 Flops/FMA instruction). –Many applications do not reach more than 10 percent of this peak performance. Average number of loads per TLB miss –This value is the ratio PM_LD_CMPL / PM_TLB_MISS. Each time after a TLB miss has been processed, fast access to a new page of data is possible. Small values for this metric indicate that the program has a poor data locality, a redesign of the data structures in the program may result in significant performance improvements. Computation intensity –Computational intensity is the ratio of Load and store operations and Floating point operations

20 PERF

21 The perf utility provides a succinct code performance report to help get the most out of HPM output or MPI_Trace output. It can help make your case for an allocation request.

22 Trace Libraries IBM Trace Libraries are a set of libraries used for MPI performance instrumentation. These libraries can measure the amount of time spent in each routine, what function was used, and how many bytes were sent. To use a library: Compile your code with the -g flag Relink your object files. For example, for mpitrace: -L/usr/local/apps/mpitrace -lmpiprof Make sure your code exits through mpi_finalize. It will produce mpi_profile.task_number output files.

23 Perf The perf utility provides a succinct code performance report to help get the most out of HPM output or MPI_Trace output. It can help make your case for an allocation request. To use perf: Add /usr/local/apps/perf/perf to your path OR Alias it in your.cshrc file: alias perf '/usr/local/apps/perf/perf \!*' Then run it in the same directory as your output files: perf hpm_out > perf_summary

24 Example of perf_summary Computation performance measured for all 4 cpus: Execution wall clock time = 11.469 seconds Total FPU arithmetic results = 5.381e+09 (31.2% of these were FMAs) Aggregate flop rate = 0.619 Gflop/s Average flop rate per cpu = 154.860 Mflop/s = 2.6% of `peak‘ Communication wall clock time for 4 cpus: max = 0.019 seconds min = 0.000 seconds Communication took 0.17% of total wall clock time.

25 IPM - Integrated Performance Monitoring

26 Integrated Performance Monitoring (IPM) Integrated Performance Monitoring (IPM) is a tool that allows users to obtain a concise summary of the performance and communication characteristics of their codes. IPM is invoked by the user at the time a job is run. By default, a short, text-based summary of the code's performance is provided, and a more detailed Web page. More details at: http://www.sdsc.edu/us/tools/top/ipm/

27 VAMPIR – Visualization and Analysis of MPI Programs

28 VAMPIR Much harder to debug and tune parallel programs than sequential ones. The reasons for performance problems, in particular, are notoriously hard to find. Assume that the performance is disappointing.Initially, the programmer has no idea where and for what to look to identify the performance bottleneck.

29 VAMPIR converts the trace information into a variety of graphical views, e.g.: timeline displays showing state changes and communication, communication statistics indicating data volumes and transmission rates, and more.

30 Setting the Vampir path and variables: setenv PAL_LICENSEFILE /usr/local/apps/vampir/etc/license.dat set path = ($path /usr/local/apps/vampir/bin) Compile: mpcc –o parpi – L/usr/local/apps/vampirtrace/lib –lVT –lm –lld parpi.c Run: poe parpi –nodes 1 –tasks_per_node 4 -rmpool 1 –euilib us –euidevice sn_all Calling Vampir: vampir parpi.stf

31

32

33

34

35

36

37

38

39 TotalView

40 Discovering TotalView The Etnus TotalView® debugger is a powerful, sophisticated, and programmable tool that allows you to debug, analyze, and tune the performance of complex serial, multiprocessor, and multithreaded programs. If you want to jump in and get started quickly, you should go to the Website at http://www.etnus.com and select TotalView's "Getting Started" area. (It's the blue oval link on the right near the bottom.)http://www.etnus.com


Download ppt "Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView."

Similar presentations


Ads by Google