for more information ... Performance Tuning

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.

Lecture 6: Multicore Systems

Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.

POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick

B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance Understanding,

Automatic Performance Tuning Sparse Matrix Algorithms James Demmel

Automatic Performance Tuning of Sparse Matrix Kernels Berkeley Benchmarking and OPtimization (BeBOP) Project James.

Automatic Performance Tuning Sparse Matrix Kernels James Demmel

The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK

Automatic Performance Tuning of Sparse Matrix Kernels: Recent Progress Jim Demmel, Kathy Yelick Berkeley Benchmarking and OPtimization (BeBOP) Project.

Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

Minimizing Communication in Numerical Linear Algebra Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory April 2, 2007.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

JAVA AND MATRIX COMPUTATION

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Analysis of Sparse Convolutional Neural Networks

A computational loop k k Integration Newton Iteration

Optimizing Cache Performance in Matrix Multiplication

Automatic Performance Tuning of Sparse Matrix Kernels

Richard Dorrance Literature Review: 1/11/13

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Real-Time Ray Tracing Stefan Popov.

Optimizing Cache Performance in Matrix Multiplication

Performance Engineering Research Institute (DOE SciDAC)

OSKI: A Library of Automatically Tuned Sparse Matrix Kernels

Automatic Tuning of Collective Communications in MPI

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Tze Meng Low, Qi Guo, Franz Franchetti

University of Wisconsin-Madison

1CECA, Peking University, China

Support vector machines

Memory System Performance Chapter 3

© 2012 Elsevier, Inc. All rights reserved.

Department of Computer Science University of California, Santa Barbara

A computational loop k k Integration Newton Iteration

Presentation transcript:

for more information ... Performance Tuning http://www.tops-scidac.org TOPS is providing applications with highly efficient implementations of common sparse matrix computational kernels, automatically tuned for a user’s kernel, matrix, and machine. Trends and The Need for Automatically Tuned Sparse Kernels Best Reference (CSR) Less than 10% of peak: Typical untuned sparse matrix-vector multiply (SpMV) performance is below 10% of peak on modern cache-based superscalar machines. With careful tuning, 2x speedups and 30% of peak or more are possible. The optimal choice of tuning parameters can be surprising: (Left) A matrix that naturally contains 8x8 dense blocks. (Right) On an Itanium 2, the optimal block size of 4x2 achieves 1.1 Gflop/s (31% of peak) and is over 4x faster than the conventional unblocked (1x1) implementation. Extra work can improve performance: Filling in explicit zeros (shown as x) followed by 3x3 blocking increases the number of flops by 1.5x for this matrix, but SpMV still runs in 1.5x less time than not blocking on a Pentium III because the raw speed in Mflop/s increases by 2.25x. Search-based Methodology for Automatic Performance Tuning Approach to automatic tuning Identify and generate a space of implementations Search this space using empirical models and experiments Example: Choosing an rxc block size Off-line benchmark [machine] Mflops(r,c) for dense matrix in sparse format Run-time search [matrix] Estimate Fill(r,c) for all r, c Heuristic model [combine] Choose r, c to maximize: Estimated Mflops = Mflops(r,c) / Fill(r,c) Yields performance within 10% of best r, c Dense (90% of non-zeros) Performance Optimizations for SpMV Register blocking (RB): up to 4x speedups over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonal segmenting: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 2.2x over CSR Multiple vectors: 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A2*x: 2x over CSR, 1.5x over RB Matrix triple products, … Mflop/s 90 50 Mflop/s 1190 190 Complex combinations of dense substructures arise in practice. We are developing tunable data structures and implementations, and automated tuning parameter selection techniques. Off-line benchmarking characterizes the machine: For r x c register blocking, performance as a function of r and c varies across platforms. (Left) Ultra 3, 1.8 Gflop/s peak. (Right) Itanium 2, 3.6 Gflop/s peak. Impact on Applications and Evaluation of Architectures Current and Future Work Public software release Low-level “Sparse BLAS” primitives Integration with PETSc Integration with DOE applications SLAC collaboration Geophysical simulation based on Block Lanczos (ATA*X; LBL) New sparse benchmarking effort With University of Tennessee Multithreaded and MPI versions Sparse kernels Automatic tuning of MPI collective ops Pointers Berkeley Benchmarking and Optimization (BeBOP) bebop.cs.berkeley.edu Self-Adapting Numerical Software (SANS) Effort icl.cs.utk.edu/sans Before: Green + Red After: Green + Blue Potential improvements to Tau3P/T3P/Omega3P, SciDAC accelerator cavity design applications by Ko, et al., at the Stanford Linear Accelerator Center (SLAC): (Left) Reordering matrix rows and columns, based on approximately solving the Traveling Salesman Problem (TSP), improves locality by creating dense block structure. (Right) Combining TSP reordering, symmetric storage, and register-level blocking leads to uniprocessor speedups between 1.5–3.3x compared to a naturally ordered, non-symmetric blocked implementation. Evaluating SpMV performance across architectures: Using a combination of analytical modeling of performance bounds and benchmarking tools being developed by SciDAC-PERC, we are studying the impact of architecture on sparse kernel performance. for more information ... http://www.tops-scidac.org