Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research.

Similar presentations


Presentation on theme: "Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research."— Presentation transcript:

1 Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research Assistant Professor Alan D. George, Principal Investigator HCS Research Laboratory University of Florida

2 Outline Introduction talk (~20 minutes) Hands on  PPW basics  Performance data collection  Performance analysis  Automatic analysis Feel free to ask question during the talk or hands-on 2

3 Parallel Performance Analysis The need for Performance Analysis  High-performance computing has performance as an explicit, fundamental goal I just got my parallel program working, and… My program does NOT yield the expected performance Why is this? How do I fix my program? The challenge of Performance Analysis  Understanding performance of sequential applications can be challenging  Complexity of parallel computing makes it more difficult to understand program performance without tools for performance analysis 3

4 Performance Analysis Approaches Three general performance analysis approaches  Analytical modeling Mostly predictive methods Could also be used in conjunction with experimental performance measurement Pros: easy to use, fast, can be performed without running the program Cons: usually not very accurate  Simulation Pros: allow performance estimation of program with various system architectures Cons: slow, not generally applicable for regular UPC/SHMEM users  Experimental performance measurement Strategy used by most modern performance analysis tools Uses actual event measurement to perform analysis Pros: most accurate Cons: can be time-consuming (iterative tuning process) 4

5 Role of a Performance Analysis Tool Original Application Optimized Application Runtime Performance Data Gathering Data Processing and Analysis Data and Result Presentation 5

6 Performance Analysis Stages InstrumentationInsertion of code to facilitate perf. measurement MeasurementCollection of perf. data at runtime AnalysisExamination & processing of perf. data to find & potentially resolve bottlenecks PresentationDisplay of analyzed data to tool user OptimizationModifying application to remove perf. problems 6

7 Instrumentation Techniques Runtime/compiler instrumentation  Provides the most detailed information about user’s program  Requires vendor cooperation (modifications to compiler/runtime) Source instrumentation  Directly modify user’s source code  Can provide much information, but may interfere with compiler optimizations Interposition (“wrapper libraries”)  No recompilation needed, only relinking  Only get information about library calls  Can be difficult to get source-level information  Relies on alternate function entry points or dynamic linker hacks Binary instrumentation  Most of the benefits of source instrumentation without need for recompilation  Can be difficult to get source-level information  Highly platform-specific, existing toolkits lack support for some platforms (eg, Cray) 7

8 Measurement Techniques Profiling  Record statistical information about execution time and/or hardware counter values (PAPI)  Relate information to basic blocks (functions, upc_forall loops) in source code  Important concept: inclusive vs. exclusive time (self vs. total) Tracing  Record full log of when events happen at runtime and how long  Gives very complete information about what happened at runtime  Requires much more storage space than profiling! Sampling  Special low-overhead mode of profiling that attributes performance information via indirect measurement (samples) 8

9 Parallel Performance Wizard (PPW) Performance analysis tool developed in HCS Lab here at UF  Designed for partitioned global-address-space (PGAS) programming models (UPC and SHMEM)  Also supports MPI; other support in the works Features  Uses experimental measurement approach  Provides profiling and tracing support  Has numerous visualizations and advanced automated analysis Overarching design goals  Be user-friendly  Enhance productivity  Aim for portability 9

10 PPW Hands-on… 10

11 Hands-on Boot liveDVD in a VM or directly or hardware Initial Setup  Export PATH variable to include recent release of PPW and UPC export PATH=/usr/local/packages/ppw- 2.6.2/bin/:/usr/local/packages/bupc- 2.12.1/bin/:$PATH All applications we use today are in the directory cd /home/livetau/workshop-point/UPC_apps You can download these slides from (following slides has necessary commands and will come in handy), http://hcs.ufl.edu/~prakash/pgas/PPW_Tutorial.ppt http://hcs.ufl.edu/~prakash/pgas/PPW_Tutorial.pdf 11

12 Programming in UPC (bupc) Compiling an UPC program  upcc hello.c –o hello Execution  upcrun –n 4 hello 12

13 Using PPW in a Nutshell Recompile application (Instrumentation)  Use ppwupcc instead of upcc  ppwshemecc (for SHMEM) and ppwmpicc (for MPI) Run application (Measurement)  ppwrun View performance data (Analysis + Presentation)  ppw file.par Change code (Optimization), recompile, repeat 13

14 PPW(for UPC) in a Nutshell Recompile application (Instrumentation)  ppwupcc CAMEL_upc.c -o camel Run application (Measurement)  ppwrun -–output=file.par upcrun –n 4 camel abcd1234 View performance data (Analysis + Presentation)  ppw file.par Change code (Optimization), recompile, repeat Note: PPW should be compiled --with-upc and Berkeley UPC should be compiled with --with-multiconf=+opt_inst 14

15 PPW Useful Options Tracking user functions entry and exit  pass --inst-functions to ppwupcc Communication matrix  pass --comm-stats to ppwrun Just open the.par file using ppw to find all the data.  ppw file.par Source archive (.sar file)  Required during execution  Retain the.sar file in the same directory as executable 15

16 NPB 2.4 Compiling  cd NPB2.4/FT  make CLASS=X NP=N where X can be S,A,B,C. Preferably use S or A. Execution same as before NPB2.4 is developed and maintained by George Washington University (upc.gwu.edu) 16

17 Tracing Compilation is same as before using ppwupcc Pass --trace option to request tracing  ppwrun --trace --output=a.par upcrun -n 4 ft.A.4 Convert to slog2 using ppw (or par2slog2)  File -> Export -> Use jumpshot to view the trace  jumpshot ft.slog2 17

18 Export: Covert to Other Popular Formats par file can be exported to different popular performance data formats, supported formats include  TAU profile  CUBE profile  OTF trace file (Vampir)  SLOG-2 (Jumpshot) 18

19 Case Study: Analyzing FT of NPB2.4 NPB2.4 FT benchmark (class=A, np=4) executed on an IB cluster with 1 thread per node You can download the par file and slog2 file at http://hcs.ufl.edu/~prakash/pgas/ftA4.par http://hcs.ufl.edu/~prakash/pgas/ftA4.slog2 19

20 Case Study: FT Identify the bottleneck  Sort by total time, look for bottlenecks upc_getmem ft.c:1950  Cannot be confirmed by looking at profile, so take a look at the trace Observe the trace output and the behavior of the code section ft.c:1943 till ft.c:1953  Serialization of upc_getmem, which is unnecessary in this case 20

21 Case Study: FT How to fix?  Use bupc_getmem_async – Berkeley UPC extension for asynchronous getmem http://upc.lbl.gov/publications/upc_memcpy.pdf Did it improve performance?  Download the par file generated after changes to ft.c http://hcs.ufl.edu/~prakash/pgas/ftA4_m.par  Observe the changes in profile data 21

22 Automatic Analysis Why do we need automatic analysis?  Increasing size of performance data set makes it hard to identify and resolve bottlenecks What will automatic analyses do?  Automatically detect, diagnose and possibly resolve bottlenecks 22

23 Automatic Analysis Application analyses  Deals with a single run and includes, Bottleneck detection Cause analysis High-level analysis Experiment set analyses  Compare performance of related runs Scalability analysis Revision analysis 23

24 Application Analysis Bottleneck detection Examine profile data and identify the bottleneck profiling entries Baseline comparison and deviation evaluation method Cause analysis Identify the reason for bottlenecks and requires trace data to complete analysis High-level analysis High-level analysis is mainly used to detect bottleneck nodes that, when optimized, could improve the application performance for a single experiment 24

25 Application Analysis Analysis -> Run Application Analysis 25

26 Experiment Set Analyses Scalability analysis  Plots the scaling factor (relative speedup) values against the ideal scaling value  Scaling factor of 1.00 implies perfect scalability Analysis->Run Scalability Analysis Revision analysis  Compare and evaluate different versions of the same application Profile Charts -> Total Times by Function 26

27 For More Information on PPW Visit the PPW website  http://ppw.hcs.ufl.edu http://ppw.hcs.ufl.edu Website has  Overview of tool  Links to detailed online/printable user manual  Downloadable source code for entire tool  Workstation GUI installers Windows installer Linux packages  Publications covering PPW and related research 27


Download ppt "Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research."

Similar presentations


Ads by Google