CS 594:SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE ANALYSIS TOOLS: PART III Gabriel Marin Includes slides from John Mellor-Crummey.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
1 Optimizing compilers Managing Cache Bercovici Sivan.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
CS 594:SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE ANALYSIS TOOLS: PART I Gabriel Marin
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Precision Going back to constant prop, in what cases would we lose precision?
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Tool Integration and Autotuning for SUPER Performance Optimization Allen D. Malony, Nick ChaimovUniversity of Oregon Mary HallUniversity of Utah Jeff HollingsworthUniversity.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Performance of mathematical software Agner Fog Technical University of Denmark
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Single Node Optimization Computational Astrophysics.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Tuning Threaded Code with Intel® Parallel Amplifier.
CS 352H: Computer Systems Architecture
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
5.2 Eleven Advanced Optimizations of Cache Performance
Workshop in Nihzny Novgorod State University Activity Report
/ Computer Architecture and Design
Logistic Regression and Perceptron Prediction of Instruction Branches
Many-core Software Development Platforms
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Levels of Parallelism within a Single Processor
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
/ Computer Architecture and Design
Multi-Core Programming Assignment
Levels of Parallelism within a Single Processor
Presentation transcript:

CS 594:SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE ANALYSIS TOOLS: PART III Gabriel Marin Includes slides from John Mellor-Crummey

OUTLINE  Part III HPCToolkit: Low overhead, full code profiling using hardware counters sampling MIAMI: Performance diagnosis based on machine-independent application modeling 2

3 CHALLENGES FOR COMPUTATIONAL SCIENTISTS Execution environments and applications are rapidly evolving Architecture rapidly changing multicore microprocessor designs, increasing scale of parallel systems, growing use of accelerators Applications adding additional scientific capabilities to existing applications, MPI everywhere to threaded implementations Steep increase in application development effort to attain performance, evolvability, and portability Application developers need to Assess weaknesses in algorithms and their implementations overhaul algorithms & data structures as needed Adapt to changes in emerging architectures Improve scalability of executions within and across nodes

4 PERFORMANCE ANALYSIS CHALLENGES Complex architectures are hard to use efficiently Multi-level parallelism: multi-core, ILP, SIMD instructions Multi-level memory hierarchy Result: gap between typical and peak performance is huge Complex applications present challenges For measurement and analysis For understanding and tuning Performance tools can play an important role as a guide

HPCToolkit DESIGN PRINCIPLES Employ binary-level measurement and analysis observe fully optimized, dynamically linked executions support multi-lingual codes with external binary-only libraries Use sampling-based measurement (avoid instrumentation) controllable overhead minimize systematic error and avoid blind spots enable data collection for large-scale parallelism Collect and correlate multiple derived performance metrics diagnosis typically requires more than one species of metric Associate metrics with both static and dynamic context loop nests, procedures, inlined code, calling context Support top-down performance analysis natural approach that minimizes burden on developers 5

6 app. source optimized binary optimized binary compile & link call stack profile profile execution [hpcrun] profile execution [hpcrun] binary analysis [hpcstruct] binary analysis [hpcstruct] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] database presentation [hpcviewer/ hpctraceviewer] presentation [hpcviewer/ hpctraceviewer] program structure HPCToolkit WORKFLOW

7 app. source optimized binary optimized binary compile & link call stack profile profile execution [hpcrun] profile execution [hpcrun] binary analysis [hpcstruct] binary analysis [hpcstruct] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] database presentation [hpcviewer/ hpctraceviewer] presentation [hpcviewer/ hpctraceviewer] program structure HPCToolkit WORKFLOW For dynamically-linked executables on stock Linux compile and link as you usually do: nothing special needed For statically-linked executables (e.g. for Blue Gene, Cray) add monitoring by using hpclink as prefix to your link line uses “ linker wrapping ” to catch “ control ” operations process and thread creation, finalization, signals,...

8 app. source optimized binary optimized binary compile & link call stack profile profile execution [hpcrun] profile execution [hpcrun] binary analysis [hpcstruct] binary analysis [hpcstruct] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] database presentation [hpcviewer/ hpctraceviewer] presentation [hpcviewer/ hpctraceviewer] program structure HPCToolkit WORKFLOW Measure execution unobtrusively launch optimized application binaries dynamically-linked applications: launch with hpcrun to measure statically-linked applications: measurement library added at link time control with environment variable settings collect statistical call path profiles of events of interest

9 app. source optimized binary optimized binary compile & link call stack profile profile execution [hpcrun] profile execution [hpcrun] binary analysis [hpcstruct] binary analysis [hpcstruct] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] database presentation [hpcviewer/ hpctraceviewer] presentation [hpcviewer/ hpctraceviewer] program structure HPCToolkit WORKFLOW Analyze binary with hpcstruct : recover program structure analyze machine code, line map, debugging information extract loop nesting & identify inlined procedures map transformed loops and procedures to source

10 app. source optimized binary optimized binary compile & link call stack profile profile execution [hpcrun] profile execution [hpcrun] binary analysis [hpcstruct] binary analysis [hpcstruct] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] database presentation [hpcviewer/ hpctraceviewer] presentation [hpcviewer/ hpctraceviewer] program structure HPCToolkit WORKFLOW Combine multiple profiles multiple threads; multiple processes; multiple executions Correlate metrics to static & dynamic program structure

11 app. source optimized binary optimized binary compile & link call stack profile profile execution [hpcrun] profile execution [hpcrun] binary analysis [hpcstruct] binary analysis [hpcstruct] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] interpret profile correlate w/ source [hpcprof/hpcprof-mpi] database presentation [hpcviewer/ hpctraceviewer] presentation [hpcviewer/ hpctraceviewer] program structure HPCToolkit WORKFLOW Presentation explore performance data from multiple perspectives rank order by metrics to focus on what ’ s important compute derived metrics to help gain insight e.g. scalability losses, waste, CPI, bandwidth graph thread-level metrics for contexts explore evolution of behavior over time

12 ANALYZING RESULTS WITH hpcviewer costs for inlined procedures loops function calls in full context source pane navigation panemetric pane view control metric display

13 PRINCIPAL VIEWS Calling context tree view - “ top-down ” (down the call chain) associate metrics with each dynamic calling context high-level, hierarchical view of distribution of costs example: quantify initialization, solve, post-processing Caller ’ s view - “ bottom-up ” (up the call chain) apportion a procedure ’ s metrics to its dynamic calling contexts understand costs of a procedure called in many places example: see where PGAS put traffic is originating Flat view - ignores the calling context of each sample point aggregate all metrics for a procedure, from any context attribute costs to loop nests and lines within a procedure example: assess the overall memory hierarchy performance within a critical procedure

14 HPCToolkit DOCUMENTATION Comprehensive user manual: Quick start guide essential overview that almost fits on one page Using HPCToolkit with statically linked programs a guide for using hpctoolkit on BG/P and Cray XT The hpcviewer user interface Effective strategies for analyzing program performance with HPCToolkit analyzing scalability, waste, multicore performance... HPCToolkit and MPI HPCToolkit Troubleshooting why don ’ t I have any source code in the viewer? Installation guide

15 USING HPCToolkit Add hpctoolkit’s bin directory to your path Download, build and usage instructions at Installed on ICL machines in “/iclscratch1/homes/hpctoolkit” Perhaps adjust your compiler flags for your application sadly, most compilers throw away the line map unless -g is on the command line. add -g flag after any optimization flags if using anything but the Cray compilers/ Cray compilers provide attribution to source without -g. Decide what hardware counters to monitor dynamically-linked executables (e.g., Linux) use hpcrun -L to learn about counters available for profiling use papi_avail you can sample any event listed as “profilable”

16 USING HPCToolkit Profile execution: hpcrun –e [-e …] [command-arguments] Produces one.hpcrun results file per thread Recover program structure hpcstruct Produces one.hpcstruct file containing the loop structure of the binary Interpret profile / correlate measurements with source code hpcprof [–S ] [-M thread] [–o ] Creates performance database Use hpcviewer to visualize the performance database Download hpcviewer for your platform from

17 HANDS-ON DEMO Recall the matrix-multiply example compiled with two different compilers from Part I of the class Performance questions What is causing performance to vary with matrix size? What factors are limiting performance for each binary? The more efficient version runs at < 50% of peak FLOPS void compute(int reps) { register int i, j, k, r; for (r=0 ; r<reps ; ++r) { for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { for (k = 0; k < N; k++) { C(i,j) += A(i,k) * B(k,j); }

HANDS-ON DEMO: MAT-MUL PERFORMANCE 18 Why the gap? Why the change? Why the difference?

19 HANDS-ON DEMO: USING HPCToolkit Recall performance inefficiencies from Part I Some native performance events for AMD K10 CPU_CLK_UNHALTED – CPU clock cycles / CPU time RETIRED_INSTRUCTIONS – # instructions retired RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS - # mispredicted branches DATA_CACHE_ACCESSES – # accesses to L1 DATA_CACHE_MISSES – L1 D-cache misses DATA_CACHE_REFILLS:ALL – L1 cache refills (L1 misses) DATA_CACHE_REFILLS_FROM_SYSTEM:ALL – L1 refills from system (L3+memory) L1_DTLB_MISS_AND_L2_DTLB_HIT:ALL – L1 DTLB misses that hit in L2 DTLB L1_DTLB_AND_L2_DTLB_MISS:ALL – L2 DTLB misses DATA_PREFETCHES:ATTEMPTED – prefetches initiated by the DC prefetcher REQUESTS_TO_L2:DATA – requests to L2 from the L1 data cache (includes L1 misses and DC prefetches) REQUESTS_TO_L2:HW_PREFETCH_FROM_DC – requests to L2 from the DC prefetcher L2_CACHE_MISS:DATA – L2 data cache misses

20 HANDS-ON DEMO: USING HPCToolkit INSTRUCTION_CACHE_FETCHES – accesses to L1 I-cache INSTRUCTION_CACHE_MISSES – L1 I-cache misses INSTRUCTION_CACHE_REFILLS_FROM_L2 – L1 I-cache refills from L2 INSTRUCTION_CACHE_REFILLS_FROM_SYSTEM – L1 I-cache refills from system L1_ITLB_MISS_AND_L2_ITLB_HIT – L1 ITLB misses that hit in L2 ITLB L1_ITLB_MISS_AND_L2_ITLB_MISS:ALL – L2 ITLB misses INSTRUCTION_FETCH_STALL – CPU cycles when instruction fetch stalled DECODER_EMPTY – CPU cycles when decoder is idle DISPATCH_STALLS – CPU cycles when dispatched was stalled DISPATCH_STALL_FOR_REORDER_BUFFER_FULL – dispatch stalled due to full ROB DISPATCH_STALL_FOR_RESERVATION_STATION_FULL – dispatch stalled due to full reservation station DISPATCH_STALL_FOR_FPU_FULL DISPATCH_STALL_FOR_LS_FULL – dispatch store due to LS buffer full MEMORY_CONTROLLER_REQUESTS:READ_REQUESTS – read memory requests MEMORY_CONTROLLER_REQUESTS:WRITE_REQUESTS – write memory requests MEMORY_CONTROLLER_REQUESTS:PREFETCH_REQUESTS – memory prefetch requests L3_CACHE_MISSES:ANY_READ – data reads that miss in L3

PERFORMANCE ANALYSIS CHALLENGES Current tools measure performance effects How much time is spent and how many cache misses are in a loop / routine Pinpoint hotspots Do not tell you if what you see is good or bad User must determine what factors are limiting performance 21

MIAMI OVERVIEW Performance modeling tool MIAMI: Machine Independent Application Models for performance Insight Automatically extract application features Works on fully-optimized binaries No performance effects are measured directly Separately model target architecture Done manually once per machine Compute application performance from first order principles 22

WHAT IT SOLVES Identifies performance limiting factors Enables “what if” analysis Reveals performance improvement potential Useful for prioritizing work and for understanding if “fixing” is worth the effort 23

MIAMI DIAGRAM 24 x86 object code CFGs, edge counts PIN MIAMI code IR instr /μop / registers XED Machine model (MDL) Loop nesting structure Dependence graph at loop level Dependence graph customized for machine instruction latencies, idiom replacement Memory reuse distance analysis PIN Cache miss predictions data reuse insight Performance predictions, performance limiters, potential for performance improvement map metrics to source code and data structures Binutils SymtabAPI CSV files / XML performance database hpcviewer Streaming concurrency sim. PIN Prefetching effectiveness

MIAMI DIAGRAM 25 x86 object code CFGs, edge counts PIN MIAMI code IR instr /μop / registers XED Machine model (MDL) Loop nesting structure Dependence graph at loop level Dependence graph customized for machine instruction latencies, idiom replacement Memory reuse distance analysis PIN Cache miss predictions data reuse insight Performance predictions, performance limiters, potential for performance improvement map metrics to source code and data structures Binutils SymtabAPI CSV files / XML performance database hpcviewer Streaming concurrency sim. PIN Prefetching effectiveness Diagnose utilization of CPU cores Model CPU back-end Identify instruction schedule inefficiencies Understand potential for improvement Diagnose utilization of CPU cores Model CPU back-end Identify instruction schedule inefficiencies Understand potential for improvement Diagnose cache reuse Understand data reuse at each memory level Identify memory access patterns with poor locality Understand what code and data layout transformations are needed Diagnose cache reuse Understand data reuse at each memory level Identify memory access patterns with poor locality Understand what code and data layout transformations are needed Diagnose stream prefetching perf. Understand data streaming behavior and number of concurrent streams Identify memory access patterns unfriendly to the hardware prefetchers Diagnose stream prefetching perf. Understand data streaming behavior and number of concurrent streams Identify memory access patterns unfriendly to the hardware prefetchers

MACHINE DESCRIPTION LANGUAGE (MDL) Enumerate back-end CPU resources Baseline performance limited by the back-end Describe instruction execution templates & resource usage Scheduling constraints between resources Idiom replacement Account for differences in ISAs, micro-architecture features / optimizations Memory hierarchy characteristics Other machine features 26 Construct a model of the target architecture

UNDERSTAND CPU CORES UTILIZATION Recover application CFG and understand execution frequency of paths in CFG Decode native x86 instructions to MIAMI IR Map application micro-ops to target machine resources Identify the factors limiting schedule length Application: insufficient ILP, instruction mix, SIMD Architecture: resource contention, retirement rate Idealize the limiting constraints to understand the maximum potential for improvement 27

28 MATRIX MULTIPLY HANDS-ON DEMO

INSIGHT FROM MIAMI Understand losses due to insufficient ILP Utilization of various machine resources Instruction mix Understand if vector instructions are used Contention on machine resources Few options from an application perspective, must change instruction mix Contention on load/store unit -> improve register reuse 29

SUMMARY Performance tools help us understand application performance HPCToolkit: low overhead, full-code profiler Uses hardware counter sampling through PAPI Maps performance data to functions, loops, calling contexts Intuitive viewer Enables top-down analysis Custom derived metrics enable quick performance analysis at loop level MIAMI: performance diagnosis based on performance modeling Uses profiling and static analysis of full application binaries Models CPU back-end to understand the main performance inefficiencies Data reuse and data streaming analysis reveal opportunities for optimization It is a research tool, not publicly available yet 30