Performance Engineering Research Institute (DOE SciDAC)

Performance Engineering Research Institute (DOE SciDAC)
Katherine Yelick LBNL and UC Berkeley

Performance Engineering Enabling Petascale Science
Performance engineering is getting harder: Systems are more complicated O(100K) processing nodes Multi-core with SIMD extensions Applications are more complicated Multi-disciplinary and multi-scale PERI approach: Modeling: performance prediction Application engagement: assist in performance engineering Automatic performance tuning: tools to improve performance IBM BlueGene at LLNL Cray Xt3 at ORNL POP model of El Nino Beam3D accelerator modeling

Engaging SciDAC Software Developers
Application Engagement Work with DOE computational scientists Ensure successful performance porting of scientific software Focus PERI research on real problems Application Liaisons Build long-term personal relationships Tiger Teams Focus on DOE’s highest priorities SciDAC-2 applications LCF Pioneering applications INCITE applications Optimizing arithmetic kernels Maximizing scientific throughput

Automatic Performance Tuning of Scientific Code
Source Code Guidance measurements models hardware information sample inpu annotations assertions Long-term goals of PERI Automate the process of tuning Improve performance portability Address performance expert shortage; replace human time by computer time Build on 40 year of human experience and recent success with auto tuned libraries (Atlas, FFTW, OSKI) Triage Analysis Domain-Specific Code Generation External Software Transformations - Specific Code Generation Code Selection Application Assembly Runtime Performance Data Training Runs Production Execution Runtime Adaptation Persistent Database PERI automatic tuning framework

Participating Institutions
Lead PI: Bob Lucas Institutions: Argonne National Laboratory Lawrence Berkeley National Laboratory Lawrence Livermore National Laboratory Oak Ridge National Laboratory Rice University University of California at San Diego University of Maryland University of North Carolina University of Southern California University of Tennessee

Major Tuning Activities in PERI
Triage: discover tuning targets Identifying bottlenecks (HPC Toolkit) Use hardware events (PAPI) Library-based tuning Dense linear algebra (Atlas) Sparse linear algebra (OSKI) Stencil operations Application-based tuning Parameterized applications (Active Harmony) Automatic source-based tuning (Rose and CG)

Triage Tools: HPC Toolkit
Goal: Discover tuning opportunities (Rice) Features of HPC Toolkit: Ease of use No manual code instrumentation; handle large multi-lingual codes Perform detailed measurements Both communication and computation Many granularities: node, core, procedure, loop, and statement Identify inefficiencies in code: Parallel inefficiencies: load imbalance, communication overhead, etc. Computation inefficiencies: pipeline stalls, memory bottlenecks, etc. implications of random replacement? Concern: 32 threads = high demand on memory subsystem crossbar to shared, banked L2 four on-chip memory controllers: 20GB/s memory bandwidth data sharing in commercial server codes: L2 access instead of slow SMP coherence misses

On-line Hardware Monitoring: PAPI
Goal: machine-independent Performance API (UTK) Multi-substrate support recently added to PAPI Enables simultaneous monitoring of On-processor counters Off-processor counters (e.g., network counters) Temperature sensors Heterogeneous multi-core hybrid systems Online monitoring will help enable runtime tuning

Dense Linear Algebra: Atlas
Goal: Auto-tuning for dense linear algebra (UTK) Atlas features and plans: Performance portability across processors Massively multi-threaded and multi-core architectures, which requires Asynchrony (e.g., lookahead) Modern vectorization (SIMD extensions) Hiding of memory latency Overlap of communication with computation Hand techniques being automated Better search algorithms and parallel search

Sparse Linear Algebra OSKI: Optimized Sparse Kernel Interface (Berkeley) Extra work can improve performance Cannot make decisions offline: need matrix structure Example: Pad 3x3 blocks with zeros “Fill ratio” = 1.5 PIII speedup: 1.5x The main point is that there can be a considerable pay-off for judicious choice of “fill” (r x c), but that allowing for fill makes the implementation space even more complicated. For this matrix on a Pentium III, we observed a 1.5x speedup, even after filling in an additional 50% explicit zeros. Two effects: Filling in zeros, but eliminating integer indices (overhead) (2) Quality of r x c code produced by compiler may be much better for particular r’s and c’s. In this particular example, overall data structure size stays the same, but 3x3 code is 2x faster than 1x1 code for a dense matrix stored in sparse format. Joint work with Bebop group

Optimizations Available in OSKI
Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels (focus for new work) AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB New: vector and multicore support, better code generation See bebop.cs.berkeley.edu for research results. Items in “bold” are optimizations for which we have an automated heuristic for selecting the transformation. Joint work with Bebop group, see R. Vuduc PhD thesis

OSKI-PETSc Proof-of-Concept Results
Recent work by Rich Vuduc Integration of OSKI into PETSc Example matrix: Accelerator cavity design N ~ 1 M, ~40 M non-zeros (SLAC matrix) 2x2 dense block substructure Uses register blocking and symmetry Improves performance of local computation Preliminary speedup numbers: 8 node Xeon cluster Speedup: 1.6x Proof-of-concept experiment to show OSKI-PETSc implementation “works” (I.e., doesn’t change scaling behavior you’d expect from your current PETSc code) Joint work with Bebop group, see R. Vuduc PhD thesis

Stencil Computations Stencils have simple inner loops
Typically ~1 FLOP per load Run at small fraction of peak (<15%)! Strategies: minimize cache misses Cache blocked within 1 sweep Time skewed (and blocked): merge across iterations Cache oblivious: recursive blocking across iterations Observations: Iteration merging only works in some algorithms! Reducing misses does not always minimize time Prefetch is as important as caches (unit stride runs) Difference between 1D, 2D, and 3D results in practice Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Cache-Optimized Stencils

Tuning for the Cell Architecture
Cell will be used in the PS3  high volume Current system problems: Off-chip bandwidth and power Double precision floating point interface ~14x slower than single Problem for computationally intensive kernels (BLAS3) Consider a variation call Cell+ that fixes this Memory system Software controlled memory (like explicit out of core) Improves bandwidth and power usage But increases programming complexity Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

Scientific Kernels on Cell (double precision)
55 Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

User-Assisted Runtime Performance Optimization
Active Harmony: Runtime optimization (UMD) Automatic library selection (code) Monitor library performance Switch library if necessary Automatic performance tuning (parameter) Monitor system performance Adjust runtime parameters Results Cluster-based web service – up to 16% improvement POP – up to 17% improvement GS2 – up to 3.4x faster New: improved search algorithms Tuning of component-based software (ANL) To solve those problems, we developed the Active Harmony system. The Harmony is a real time performance optimization system. It helps the application to selection the appropriate underlying programming library. That is, it select the “right code” to use. In order to do so, it monitors the performance for all underlying programming libraries. Then during execution, it will switch among those programming libraries if necessary. After selecting the appropriate underlying programming library, it will also tune the library together with all other parameters. The system monitors the whole system performance and adjust the tunable parameters during runtime.

Active Harmony Example Parallel Ocean Program (POP)
Parameterized over block dimension Problem size – 3600x2400 on 480 processors (NERSC IBM SP - seaborg) Up to 15% improvement in execution time

Kernel Extraction (Code Isolator)
Original Program Code Fragment to be executed void main(){ Call OutlineFunc((<InputParameters>) } void OutlineFunc(<InputParameters>){ Isolated Program Isolated code StoreInitialDataValues <InputParameters>=SetInitialDataValues MANUAL AUTOMATE IN OPEN64 AUTOMATED IN SUIF CaptureMachineState SetMachineState LCPC ‘04

ROSE Project Software analysis and optimization for scientific applications Tool for building source-to-source translators Support for C and C++ F90 in development Loop optimizations Performance analysis Software engineering Lab, academic, and industry use Domain-specific analysis and optimizations Development of new optimization approaches Optimization of object-oriented abstractions

Source Based Optimizations in Rose
Robustness (handles real lab applications): Kull (ASC), ALE3D (ASC), Ares (ASC, in progress), hyper, IRS (ASC, benchmark), Khola (ASC, benchmark), CHOMBO (LBL AMR framework), ROSE (compiles itself, in progress) Ten separate million line applications (current work) Custom Application Analysis: Analysis to find and remove static class initialization Custom Loop Classification for Kull loop structures Analysis for combined procedure inlining, code motion, and loop fusion for Kull loop structures (optimization done by hand in one loop by Brian Miller, automating search for other opportunities in Kull (current work). Optimization: Demonstrated data structure splitting for Kull (2X improvement on a Kull benchmark; implemented by hand now in Kull)

Empirical Optimization (loop fusion w/ Ken Kennedy + students)
Empirical Evaluation of hundreds of loop fusion options for the Hyperbolic PPM Scheme (~50 loops) (PPM by Woodward/Colella) Uses ROSE loop optimizer with parameters to control loop fusion Static Evaluation Faster Performance Slower to generate Dynamic Evaluation Slower performance Correlated to cleanly generated code 10X faster to evaluate search space Un-fused loops for(i = 0; i < size; i++) a[i] = c[i] + d; b[i] = c[i] – d; Fused loops for(i = 0; i < size; i++) { a[i] = c[i] + d; b[i] = c[i] – d; }

Matrix Multiply: Comparison of ECO, ATLAS, vendor BLAS and compiler
ATLAS BLAS Native ECO matrix multiply on SGI R10K

Summary Many solved and open problems in automatic tuning
Berkeley-specific activities OSKI: extra floating point work can save time Stencil tuning: beware of prefetch New architectures: vectors, Cell, integration with PETSc for clusters PERI Basic auto-tuning framework Library and application level tuning; online and offline Source transformations and domain specific generators Many forms of “guidance” to control optimizations Performance modeling and application engagement too Opportunities to collaborate Reorder relative to 3Ps slide Change to portability point for 3rd one Take out cache thrashing notes (or combine with 2nd part of performance) Add URLs

PERI Automatic Tuning Tools Runtime Performance Data
Guidance models hardware information annotations assertions Source Code PERI Automatic Tuning Tools Triage Analysis Domain-Specific Code Generation External Software Transformations - Specific Code Generation Code Selection Application Assembly Runtime Performance Data Training Runs Production Execution Runtime Adaptation Persistent Database

Runtime Tuning with Components
Tuning of component-based software (Norris & Hovland, ANL) Initial implementation of intra-component performance analysis for CQoS (FY08, Q1) Intra-component analysis for generating performance models of single components (FY08, Q4) Define specification for CQoS support for component SciDAC apps (FY09, Q1)

Source-Based Empirical Optimization
Source-based optimization (Quinlan/LLNL, Hall/ISI) Combine Model-guided and empirical optimization compiler models prune unprofitable solutions empirical data provide accurate measure of optimization impact Supporting framework kernel extraction tools (code isolator) Prototypes for C/C++ and F90 experience base to maintain previous results (later years) More talks on these projects later today

FY07 Plan for Source Tuning (USC)
1. From proposal: “Develop optimizations for imperfectly nested loops” STATUS: New transformation framework underway, uses Omega 2. Nearer term milestone for out-year deliverable Frontend to kernel extraction tool in Open64 PLAN: Instrument original application code to collect loop bounds, control flow and input data 3. New! Targeting locality + multimedia extension architectures (AltiVec and SSE3) STATUS: Preliminary MM results on AltiVec, working on SSE3 4. Need help for out-year milestone! Apply to “selected loops in SciDAC applications” Plan for identifying these?

Source-Based Tuning (LLNL)
Develop optimization that span multiple kernels (FY08 Q2) Initial integration of analysis and transformation engine (FY08 Q3) Deliver kernel extraction tool for C/C++ to PERC-3 portal (FY09 Q4)

Tuning (UNC) Develop triage tools for power usage and optimization (FY07 Q4) Develop tools to characterize energy and running time (FY07 Q4) Initial integration of power consumption analysis(FY08 Q3) Develop adaptive algorithms for optimizing energy and running time (FY08 Q4)

Pop Quiz What are: HPCToolkit Rose BeBOP Active Harmony PAPI Atlas Eco
OSKI Should we have an index for PERI portal? 1-sentence description of each tool and relationship to PERI (if any) Is google good enough?

Challenges Technical challenges Multicore, etc.:
This is under control (modulo inability to control SMPs) Would do well to target key apps kernels Scaling, communication, load imbalance: Less experience here, but some results for communication tuning Load imbalance is likely to be an app-level problem Management challenges Tuning core community is as described Minor: Mary and Dan need to work closely Lots of “outer circle” tuning activities Relationship to modeling Identify specific opportunities

PERI Tuning Motivation:
Hand-tuning is too time-consuming, and is not robust… Especially as we move towards Petascale Topology may matter, multi-core memory systems are complicated, memory and network latency are not getting better Solution: automatic performance tuning Use tools to identified tuning opportunities Build apps to be auto-tunable by parameters + tool Use auto-tuned libraries in applications Tune full applications using source-to-source transforms

How OSKI Tunes (Overview)
Library Install-Time (offline) Application Run-Time [Animation] This cartoon illustrates at a very high-level when “tuning” occurs. The diagram is partitioned into two phases to show you what happens when you download and install OSKI (“Library install-time”, left-side) and when you call OSKI at run-time (“Application Run-Time”, right). Diagram key: Ovals == actions taken by library Cylinders == data stored by or with the library Solid arrows == control flow Dashed arrows == “data” flow :: At library-installation time :: 1. “Build”: Pre-compile source code. Possible code variants are stored in dynamic libraries (“Code Variants”). 2. “Benchmark”: The installation process measures the speed of the possible code variants and stores the results (“Benchmark Data”) The entire build process uses standard & portable GNU configure. :: At run-time, from within the user’s application :: 1. “Matrix from user”: The user passes her pre-assembled matrix in a standard format like compressed sparse row (CSR) or column (CSC), and the library 2. “Evaluate models”: The library contains a list of “heuristic models.” Each model is actually a procedure that analyzes the matrix, workload, and benchmarking data and chooses a data structure & code it thinks is the best for that matrix and workload. A model is typically specialized to predict tuning parameters for a particular kernel & class of data structures (e.g., predict the block size for register blocked matvec). However, higher-level models (meta-models) that combine several heuristics or predict over several possible data structures and kernels are also possible. In the initial implementation, “Evaluate Models” does the following: * Based on the workload, decide on an allowable amount of time for tuning (a “tuning budget”) * WHILE there is time left for tuning DO - Select and evaluate a model to get best predicted performance & corresponding tuning parameters Joint work with Bebop group, see R. Vuduc PhD thesis

Library Install-Time (offline) Application Run-Time 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data [Animation] This cartoon illustrates at a very high-level when “tuning” occurs. The diagram is partitioned into two phases to show you what happens when you download and install OSKI (“Library install-time”, left-side) and when you call OSKI at run-time (“Application Run-Time”, right). Diagram key: Ovals == actions taken by library Cylinders == data stored by or with the library Solid arrows == control flow Dashed arrows == “data” flow :: At library-installation time :: 1. “Build”: Pre-compile source code. Possible code variants are stored in dynamic libraries (“Code Variants”). 2. “Benchmark”: The installation process measures the speed of the possible code variants and stores the results (“Benchmark Data”) The entire build process uses standard & portable GNU configure. :: At run-time, from within the user’s application :: 1. “Matrix from user”: The user passes her pre-assembled matrix in a standard format like compressed sparse row (CSR) or column (CSC), and the library 2. “Evaluate models”: The library contains a list of “heuristic models.” Each model is actually a procedure that analyzes the matrix, workload, and benchmarking data and chooses a data structure & code it thinks is the best for that matrix and workload. A model is typically specialized to predict tuning parameters for a particular kernel & class of data structures (e.g., predict the block size for register blocked matvec). However, higher-level models (meta-models) that combine several heuristics or predict over several possible data structures and kernels are also possible. In the initial implementation, “Evaluate Models” does the following: * Based on the workload, decide on an allowable amount of time for tuning (a “tuning budget”) * WHILE there is time left for tuning DO - Select and evaluate a model to get best predicted performance & corresponding tuning parameters Joint work with Bebop group, see R. Vuduc PhD thesis

Library Install-Time (offline) Application Run-Time 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models [Animation] This cartoon illustrates at a very high-level when “tuning” occurs. The diagram is partitioned into two phases to show you what happens when you download and install OSKI (“Library install-time”, left-side) and when you call OSKI at run-time (“Application Run-Time”, right). Diagram key: Ovals == actions taken by library Cylinders == data stored by or with the library Solid arrows == control flow Dashed arrows == “data” flow :: At library-installation time :: 1. “Build”: Pre-compile source code. Possible code variants are stored in dynamic libraries (“Code Variants”). 2. “Benchmark”: The installation process measures the speed of the possible code variants and stores the results (“Benchmark Data”) The entire build process uses standard & portable GNU configure. :: At run-time, from within the user’s application :: 1. “Matrix from user”: The user passes her pre-assembled matrix in a standard format like compressed sparse row (CSR) or column (CSC), and the library 2. “Evaluate models”: The library contains a list of “heuristic models.” Each model is actually a procedure that analyzes the matrix, workload, and benchmarking data and chooses a data structure & code it thinks is the best for that matrix and workload. A model is typically specialized to predict tuning parameters for a particular kernel & class of data structures (e.g., predict the block size for register blocked matvec). However, higher-level models (meta-models) that combine several heuristics or predict over several possible data structures and kernels are also possible. In the initial implementation, “Evaluate Models” does the following: * Based on the workload, decide on an allowable amount of time for tuning (a “tuning budget”) * WHILE there is time left for tuning DO - Select and evaluate a model to get best predicted performance & corresponding tuning parameters Joint work with Bebop group, see R. Vuduc PhD thesis

Library Install-Time (offline) Application Run-Time 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models [Animation] This cartoon illustrates at a very high-level when “tuning” occurs. The diagram is partitioned into two phases to show you what happens when you download and install OSKI (“Library install-time”, left-side) and when you call OSKI at run-time (“Application Run-Time”, right). Diagram key: Ovals == actions taken by library Cylinders == data stored by or with the library Solid arrows == control flow Dashed arrows == “data” flow :: At library-installation time :: 1. “Build”: Pre-compile source code. Possible code variants are stored in dynamic libraries (“Code Variants”). 2. “Benchmark”: The installation process measures the speed of the possible code variants and stores the results (“Benchmark Data”) The entire build process uses standard & portable GNU configure. :: At run-time, from within the user’s application :: 1. “Matrix from user”: The user passes her pre-assembled matrix in a standard format like compressed sparse row (CSR) or column (CSC), and the library 2. “Evaluate models”: The library contains a list of “heuristic models.” Each model is actually a procedure that analyzes the matrix, workload, and benchmarking data and chooses a data structure & code it thinks is the best for that matrix and workload. A model is typically specialized to predict tuning parameters for a particular kernel & class of data structures (e.g., predict the block size for register blocked matvec). However, higher-level models (meta-models) that combine several heuristics or predict over several possible data structures and kernels are also possible. In the initial implementation, “Evaluate Models” does the following: * Based on the workload, decide on an allowable amount of time for tuning (a “tuning budget”) * WHILE there is time left for tuning DO - Select and evaluate a model to get best predicted performance & corresponding tuning parameters 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. Joint work with Bebop group, see R. Vuduc PhD thesis

OSKI-PETSc Performance: Accel. Cavity
p=1: 234 Mflop/s p=1: 315 Mflop/s OSKI-PETSc p=1: 480 Mflop/s OSKI-PETSc p=8: 6.2

Stanza Triad . . . . . . stanza stanza
Even smaller benchmark for prefetching Derived from STREAM Triad Stanza (L) is the length of a unit stride run while i < arraylength for each L element stanza A[i] = scalar * X[i] + Y[i] skip k elements . . . . . . 1) do L triads 2) skip k elements 3) do L triads stanza stanza Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Results Without prefetching:
performance would be independent of stanza length; flat line at STREAM peak our results show performance depends on stanza length Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Cost Model for Stanza Triad
First cache line in every L-element stanza is not prefetched assign cost Cnon-prefetched get value from Stanza Triad with L=cache line size The rest of the cache lines are prefetched assign cost Cprefetched value from Stanza Triad with large L Total Cost: Cost = #non-prefetched * Cnon-prefetched + #prefetched * Cprefetched maybe need a diagram here? Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Model Works well, except Itanium2
Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Memory Model 2
Instead of 2 pt piecewise function, use 3 pts Models all 3 architectures Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stencil Cost Model for Cache Blocking

Stencil Probe Cost Model

Stencil Cache Blocking Summary
Speedups only with large grid sizes unblocked unit-stride dimension Currently applying to cross-iteration optimizations Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Performance Engineering Research Institute (DOE SciDAC)

Similar presentations

Presentation on theme: "Performance Engineering Research Institute (DOE SciDAC)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Engineering Research Institute (DOE SciDAC)

Similar presentations

Presentation on theme: "Performance Engineering Research Institute (DOE SciDAC)"— Presentation transcript:

Similar presentations

About project

Feedback