Download presentation
Presentation is loading. Please wait.
Published byMalcolm Phillips Modified over 9 years ago
1
Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView
2
DataStar Overview P655 :: ( 8-way, 16GB) 176 nodes P655+ :: ( 8-way, 32GB) 96 nodes P690 :: ( 32-way, 64GB) 2 nodes P690 :: ( 32-way, 128GB) 4 nodes P690 :: ( 32-way, 256GB) 2 nodes Total – 280 nodes :::: 2,432 processors.
3
Batch/Interactive computing Batch Job Queues: –Job queue Manager – Load Leveler (tool from IBM) –Job queue Scheduler – Catalina ( SDSC internal tool) –Job queue Monitoring – Various tools (commands) –Jobs Accounting – Job filter (SDSC internal PERL scripts)
4
DataStar Access Three Login Nodes :: Access modes (platforms) (usage mode) –dslogin.sdsc.edu :: Production runs (P690, 32-way, 64GB) –dspoe.sdsc.edu :: Test/debug runs (P655, 8-way, 16GB) –dsdirect.sdsc.edu :: Special needs (P690, 32-way, 256GB) Note : Above Usage modes division is not very strict.
6
Test/debug runs (Usage from dspoe) Queue/Class Name Node type Memory limit Max Wall clock hours Max Num Nodes interactivep655 nodes (8-CPU), 16 GB 2 hrs3 expressp655 nodes (8-CPU), 16 GB 2 hrs4 [dspoe.sdsc.edu :: P655, 8-way, 16GB] Access to two queues: –P655 nodes [shared] –P655 nodes [Not – shared] Job queues have Job filter + Load Leveler only (very fast) Special command line submission (along with job script).
7
Production runs (Usage from dslogin) Queue/Class Name Node type Memory limit Max Wall clock hours Max Num Nodes normalp655 nodes (8-CPU), 16 GB & 32 GB 18 hrs265 normal32p690 nodes (32-CPU), 128 GB 18 hrs5 [dslogin.sdsc.edu :: P690, 32-way, 64GB] Data transfer/ Src editing/Compliation etc… Two queues: Onto p655/p655+ nodes [not shared] Onto p690 nodes [shared ] Job ques have Job filter + LoadLeveler + Catalina ( Slowupdates )
8
All Special needs (Usage from dsdirect) [dsdirect.sdsc.edu :: P690, 32-way, 256GB] All Visualization needs All post data analysis needs Shared node (with 256 GB of memory) Process accounting in place Total (a.out) interactive usage. No Job filter, No Load Leveler, No Catalina
9
IBM Hardware Performance Monitor (hpm)
10
What is Performance? - Where is time spent and how is time spent? MIPS – Millions of Instructions Per Second MFLOPS – Millions of Floating-Point Operations Per Second Run time/CPU time
11
What is a Performance Monitor? -Provides detailed processor/system data Processor Monitors –Typically a group of registers –Special purpose registers keep track of programmable events –Non-intrusive counts result in “accurate” measurement of processor events –Typical Events counted are Instruction, floating point instruction, cache misses, etc. System Level Monitors –Can be hardware or software –Intended to measure system activity –Examples: bus monitor: measures memory traffic, can analyze cache coherency issues in multiprocessor system Network monitor: measures network traffic, can analyze web traffic internally and externally
12
Hardware Counter Motivations -To understand execution behavior of application code Why not use software? –Strength: simple, GUI interface –Weakness: large overhead, intrusive, higher level, abstraction and simplicity How about using a simulator? –Strength: control, low-level, accurate –Weakness: limit on size of code, difficult to implement, time- consuming to run When should we directly use hardware counters? –Software and simulators not available or not enough –Strength: non-intrusive, instruction level analysis, moderate control, very accurate, low overhead –Weakness: not typically reusable, OS kernel support
13
Ptools Project PMAPI Project –Common standard API for industry –Supported by IBM, SUN, SGI, COMPAQ etc PAPI Project –Standard application programming interface –Portable, available through a module –Can access hardware counter info HPM Toolkit –Easy to use –Doesn’t effect code performance –Use hardware counters –Designed specifically for IBM SPs and Power
14
Problem Set Should we collect all events all the time? –Not necessary and wasteful What counts should be used? –Gather only what you need Cycles Committed Instructions Loads Stores L1/L2 misses L1/L2 stores Committed fl pt instr Branches Branch misses TLB misses Cache misses
15
IBM HPM Toolkit H igh P erformance M onitor Developed for performance measurement of applications running on IBM Power3 systems. It consists of: –An utility ( hpmcount ) –An instrumentation library ( libhpm ) –A graphical user interface ( hpmviz ). Requires PMAPI kernel extensions to be loaded Works on IBM 630 and 604e processors Based on IBM’s PMAPI – low level interface
16
HPM Count Utilities for performance measurement of application Extra logic inserted to the processor to count specific events Updated at every cycle Provide a summary output at the end of the execution: –Wall clock time –Resource usage statistics –Hardware performance counters information –Derived hardware metrics Serial/Parallel, Gives each performance numbers for each task
17
Timers Time usually reports three metrics: User Time –The time used by your code on CPU, also CPU time –Total time in user mode = Cycles/Processor Frequency System Time –The time used by your code running kernel code (doing I/O, writing to disk, or printing to the screen etc). –It is worth to minimize the system time, by speeding up the disk I/O, doing I/O in parallel, or doing I/O in background while your CPU computes in the foreground Wall Clock time –Total execution time, the combination of the time 1 and 2 plus the time spent idle (waiting for resources) –In parallel performance tuning, only wall clock time counts –Interprocessor communication consumes a significant amount of your execution time (user/system time usually don’t account for it), need to rely on wall clock time for all the time consumed by the job
18
Floating Point Measures PM_FPU0_CMPL (FPU 0 instructions) –The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. Each FPU can start a new instruction at every cycle. This counter shows the number of floating point instructions that have been executed by the first FPU. PM_FPU1_CMPL (FPU 1 instructions) –This counter shows the number of floating point instructions (add, multiply, subtract, divide, multiply & add) that have been processed by the second FPU. PM_EXEC_FMA (FMAs executed) –This is the number of Floating point Multiply & Add (FMA) instructions. This instruction does a computation of following type x = s * a + b So two floating point operations are done within one instruction. The compiler generate this instruction as often as possible to speed up the program. But sometimes additional manual optimization is necessary to replace single multiply instructions and corresponding add instructions by one FMA.
19
Total Flop Rate Float point instructions + FMA rate –This is the most often mentioned performance index, the MFlops rate. –The peak performance of the POWER3-II processor is 1500 MFlops. (375 MHZ clock x 2 FPUs x 2 Flops/FMA instruction). –Many applications do not reach more than 10 percent of this peak performance. Average number of loads per TLB miss –This value is the ratio PM_LD_CMPL / PM_TLB_MISS. Each time after a TLB miss has been processed, fast access to a new page of data is possible. Small values for this metric indicate that the program has a poor data locality, a redesign of the data structures in the program may result in significant performance improvements. Computation intensity –Computational intensity is the ratio of Load and store operations and Floating point operations
20
PERF
21
The perf utility provides a succinct code performance report to help get the most out of HPM output or MPI_Trace output. It can help make your case for an allocation request.
22
Trace Libraries IBM Trace Libraries are a set of libraries used for MPI performance instrumentation. These libraries can measure the amount of time spent in each routine, what function was used, and how many bytes were sent. To use a library: Compile your code with the -g flag Relink your object files. For example, for mpitrace: -L/usr/local/apps/mpitrace -lmpiprof Make sure your code exits through mpi_finalize. It will produce mpi_profile.task_number output files.
23
Perf The perf utility provides a succinct code performance report to help get the most out of HPM output or MPI_Trace output. It can help make your case for an allocation request. To use perf: Add /usr/local/apps/perf/perf to your path OR Alias it in your.cshrc file: alias perf '/usr/local/apps/perf/perf \!*' Then run it in the same directory as your output files: perf hpm_out > perf_summary
24
Example of perf_summary Computation performance measured for all 4 cpus: Execution wall clock time = 11.469 seconds Total FPU arithmetic results = 5.381e+09 (31.2% of these were FMAs) Aggregate flop rate = 0.619 Gflop/s Average flop rate per cpu = 154.860 Mflop/s = 2.6% of `peak‘ Communication wall clock time for 4 cpus: max = 0.019 seconds min = 0.000 seconds Communication took 0.17% of total wall clock time.
25
IPM - Integrated Performance Monitoring
26
Integrated Performance Monitoring (IPM) Integrated Performance Monitoring (IPM) is a tool that allows users to obtain a concise summary of the performance and communication characteristics of their codes. IPM is invoked by the user at the time a job is run. By default, a short, text-based summary of the code's performance is provided, and a more detailed Web page. More details at: http://www.sdsc.edu/us/tools/top/ipm/
27
VAMPIR – Visualization and Analysis of MPI Programs
28
VAMPIR Much harder to debug and tune parallel programs than sequential ones. The reasons for performance problems, in particular, are notoriously hard to find. Assume that the performance is disappointing.Initially, the programmer has no idea where and for what to look to identify the performance bottleneck.
29
VAMPIR converts the trace information into a variety of graphical views, e.g.: timeline displays showing state changes and communication, communication statistics indicating data volumes and transmission rates, and more.
30
Setting the Vampir path and variables: setenv PAL_LICENSEFILE /usr/local/apps/vampir/etc/license.dat set path = ($path /usr/local/apps/vampir/bin) Compile: mpcc –o parpi – L/usr/local/apps/vampirtrace/lib –lVT –lm –lld parpi.c Run: poe parpi –nodes 1 –tasks_per_node 4 -rmpool 1 –euilib us –euidevice sn_all Calling Vampir: vampir parpi.stf
39
TotalView
40
Discovering TotalView The Etnus TotalView® debugger is a powerful, sophisticated, and programmable tool that allows you to debug, analyze, and tune the performance of complex serial, multiprocessor, and multithreaded programs. If you want to jump in and get started quickly, you should go to the Website at http://www.etnus.com and select TotalView's "Getting Started" area. (It's the blue oval link on the right near the bottom.)http://www.etnus.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.