Download presentation
Presentation is loading. Please wait.
Published byZachariah Tyler Modified over 10 years ago
1
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee mucci@cs.utk.edu http://www.cs.utk.edu/~mucci
2
Motivation Tuning real DOD and DOE applications! –Performance on most codes is low. –Poor overall efficiency due to poor single node performance. –Show good scalability because of the above and faster interconnects. –The expertise is not there, nor should it be.
3
Description To use data available at run-time to better compilation and optimization technology. Empirically determine how well the code maps to the underlying architecture. Bottlenecks can be identified and possibly corrected by an explicit set of rules and transformations.
4
Information not being used Hardware statistics gathered through simulation or monitoring can identify the problem. (sample listing) –Cache and branching behavior –Cycle/Load/Store/FLOP counts –Bottleneck determination –Reference pattern –Dynamic memory placement
5
Problem Areas Efficient use of the memory hierarchy Register re-use Aliasing Inlining Demotion Algorithms (iterative vs. direct)
6
Solutions Understanding (tutorials, reference material) Tools Preprocessors Compilers Manpower
7
Increasing Cache Performance How do we better the use of the memory hierarchy? For computer scientists, its not that hard. We need the right tools. How much can we automate? Through available tools and source analysis we can usually get down to the function.
8
Cache Simulation Instrumentation of routines Run of the executable Analysis and correlation with source code! Old idea, new implementation.
9
Cache Simulation Hardware independence Information on: –Locality –Placement –Reference pattern and Reuse –Line usage
10
Locality Spatial and Temporal –misses/memory reference –misses/re-use Conflict vs. Capacity
11
Placement Padding can be very important Not always possible to do during static analysis phase. Reference pattern can affect padding.
12
Reference Pattern Again, not always possible to do during static analysis. Even harder to analyze when dealing with pseudo-optimized code. Examples: Stencils, Sparse solvers etc...
13
Reuse Blocking is critical to applications where there is re-use. We need to identify re-use potential, to spot areas where blocking and register allocation should be focused on.
14
Source Code Mapping Most cache tools are hard to use and relate to the source code. This tool simulates the cache(s) on each memory reference and thus can easy correlate the data. Instrumentation is at the source level, not object code.
15
Statistics Global, per file, per statement, per reference References, misses, cold misses, re-used references Conflict/Re-use matrix –M(A,B) = x means some element of A ejected some element of B from the cache x times iff that element of A has been in the cache before.
16
Development status GUI for selective instrumentation Real parsers (F90, C, C++) Better report generation
17
Implementation Simulator written in C Instrumentation in Perl GUI in Java Report generator in Perl
18
Relevance Why shouldnt this technology be part of a feedback loop? –Compile with instrumentation –Run –Recompile with information from the run –Watch input sensitivity issues.
19
Integration Identifying and correcting poor cache behavior can be made explicit and part of a compiler. (Ideally a source-to-source transformer or preprocessor) Simulator can stand alone for detailed analysis and optimization by CS folks. Our knowledge and expertise made available through the tools.
20
Hardware Counters Virtually every processor available has hardware counters The interfaces and documentation are poor or non-existent. Hardware differs greatly as do the semantics Useful for measurement, analysis, optimization, modeling and benchmarking.
21
Performance Data Standard Standardize an API to obtain hardware performance counters Standardize the definitions of what those counters mean API is lightweight and portable
22
Performance Data Standard Target platforms –R10K, R12K –P2SC, Power PC 604e, Power 3 –Sun Ultra 2/3 –Intel PII, Katmai, Merced –Alpha 21164, 21264
23
Performance Data Standard Motivation –Portable performance tools –Optimization through feedback –Developers wanting simple and accurate timing and statistics –Modeling, evaluation
24
Performance Data Standard Small number of useful measurement points –Timing cycles, microseconds –I/D cache misses, invalidations –Branch mispredictions –Load,store,FLOP,instruction counts –I/D TLB misses
25
Performance Data Standard API Efficient counter multiplexing Thread safety Functions for –start, stop, reset, get, accumulate, query, control Use the best available vendor supported interface or API Possible pairing with DAIS, Dyninst for naming
26
Development status Research on the various machines available hardware and interfaces Compilation of findings, web page and mailing list API specification to appear mid August for discussion Vendors are lurking http://www.cs.utk.edu/~mucci/pdsa
27
Deliverables API for O2K, T3E, SP Portable prof implementation
28
People Shirley Browne (UT) Jeff Brown (LANL) Jeff Durachta (IBM, LANL) Christopher Kerr (IBM, LANL) George Ho (UT) Kevin London (UT) Philip Mucci (UT, Sandia)
29
Rice/UTK Collaboration RiceUT DOEDOD Low level support Optimization Technology Tools Apps
30
Deliverables F90,C++ preprocessor with feedback Cache tool Analysis and Optimization of poor codes Performance API
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.