Presentation is loading. Please wait.

Presentation is loading. Please wait.

for more information ... Performance Tuning

Similar presentations


Presentation on theme: "for more information ... Performance Tuning"— Presentation transcript:

1 for more information ... Performance Tuning http://www.tops-scidac.org
TOPS is providing applications with highly efficient implementations of common sparse matrix computational kernels, automatically tuned for a user’s kernel, matrix, and machine. Trends and The Need for Automatically Tuned Sparse Kernels Best Reference (CSR) Less than 10% of peak: Typical untuned sparse matrix-vector multiply (SpMV) performance is below 10% of peak on modern cache-based superscalar machines. With careful tuning, 2x speedups and 30% of peak or more are possible. The optimal choice of tuning parameters can be surprising: (Left) A matrix that naturally contains 8x8 dense blocks. (Right) On an Itanium 2, the optimal block size of 4x2 achieves 1.1 Gflop/s (31% of peak) and is over 4x faster than the conventional unblocked (1x1) implementation. Extra work can improve performance: Filling in explicit zeros (shown as x) followed by 3x3 blocking increases the number of flops by 1.5x for this matrix, but SpMV still runs in 1.5x less time than not blocking on a Pentium III because the raw speed in Mflop/s increases by 2.25x. Search-based Methodology for Automatic Performance Tuning Approach to automatic tuning Identify and generate a space of implementations Search this space using empirical models and experiments Example: Choosing an rxc block size Off-line benchmark [machine] Mflops(r,c) for dense matrix in sparse format Run-time search [matrix] Estimate Fill(r,c) for all r, c Heuristic model [combine] Choose r, c to maximize: Estimated Mflops = Mflops(r,c) / Fill(r,c) Yields performance within 10% of best r, c Dense (90% of non-zeros) Performance Optimizations for SpMV Register blocking (RB): up to 4x speedups over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonal segmenting: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 2.2x over CSR Multiple vectors: 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A2*x: 2x over CSR, 1.5x over RB Matrix triple products, … Mflop/s 90 50 Mflop/s 1190 190 Complex combinations of dense substructures arise in practice. We are developing tunable data structures and implementations, and automated tuning parameter selection techniques. Off-line benchmarking characterizes the machine: For r x c register blocking, performance as a function of r and c varies across platforms. (Left) Ultra 3, 1.8 Gflop/s peak. (Right) Itanium 2, 3.6 Gflop/s peak. Impact on Applications and Evaluation of Architectures Current and Future Work Public software release Low-level “Sparse BLAS” primitives Integration with PETSc Integration with DOE applications SLAC collaboration Geophysical simulation based on Block Lanczos (ATA*X; LBL) New sparse benchmarking effort With University of Tennessee Multithreaded and MPI versions Sparse kernels Automatic tuning of MPI collective ops Pointers Berkeley Benchmarking and Optimization (BeBOP) bebop.cs.berkeley.edu Self-Adapting Numerical Software (SANS) Effort icl.cs.utk.edu/sans Before: Green + Red After: Green + Blue Potential improvements to Tau3P/T3P/Omega3P, SciDAC accelerator cavity design applications by Ko, et al., at the Stanford Linear Accelerator Center (SLAC): (Left) Reordering matrix rows and columns, based on approximately solving the Traveling Salesman Problem (TSP), improves locality by creating dense block structure. (Right) Combining TSP reordering, symmetric storage, and register-level blocking leads to uniprocessor speedups between 1.5–3.3x compared to a naturally ordered, non-symmetric blocked implementation. Evaluating SpMV performance across architectures: Using a combination of analytical modeling of performance bounds and benchmarking tools being developed by SciDAC-PERC, we are studying the impact of architecture on sparse kernel performance. for more information ...


Download ppt "for more information ... Performance Tuning"

Similar presentations


Ads by Google