Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory April 2, 2007.

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Introductory Courses in High Performance Computing at Illinois David Padua.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

OpenFOAM on a GPU-based Heterogeneous Cluster

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Automatic Performance Tuning of Sparse Matrix Kernels Berkeley Benchmarking and OPtimization (BeBOP) Project James.

Automatic Performance Tuning Sparse Matrix Kernels James Demmel

Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.

Automatic Performance Tuning of Sparse Matrix Kernels: Recent Progress Jim Demmel, Kathy Yelick Berkeley Benchmarking and OPtimization (BeBOP) Project.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Tool Integration and Autotuning for SUPER Performance Optimization Allen D. Malony, Nick ChaimovUniversity of Oregon Mary HallUniversity of Utah Jeff HollingsworthUniversity.

Minimizing Communication in Numerical Linear Algebra Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Michelle Mills Strout OpenAnalysis: Representation- Independent Program Analysis CCA Meeting January 17, 2008.

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

IMP: Indirect Memory Prefetcher

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

A few words on locality and arrays

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Code Optimization.

Richard Dorrance Literature Review: 1/11/13

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Performance Engineering Research Institute (DOE SciDAC)

for more information ... Performance Tuning

VSIPL++: Parallel Performance HPEC 2004

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory April 2, 2007

View from space Riddle: Who am I? I am a Big Repository Of useful And useless Facts alike.

View from space Why “autotune” sparse kernels?  Sparse kernels abound, but usually perform poorly  Physical & economic models, PageRank, …  Sparse matrix-vector multiply (SpMV) <= 10% peak, decreasing  Performance challenges  Indirect, irregular memory access  Low computational intensity vs. dense linear algebra  Depends on matrix (run-time) and machine  Claim: Need automatic performance tuning  2  speedup from tuning, will increase  Manual tuning is difficult, getting harder  Tune target app, input, machine using automated experiments

View from space This talk: Key ideas and conclusions  OSKI tunes sparse kernels automatically  Off-line benchmarking + empirical run-time models  Exploit structure aggressively  SpMV: 4  speedups, 31% peak  Higher-level kernels: A T A·x, A k ·x, …  Application impact  Autotuning compiler? (very early-stage work)  Different talks: Other work in high-perf. computing (HPC)  Analytical and statistical performance modeling  Compiler-based domain-specific optimization  Parallel debugging using static and dynamic analysis

A case for sparse kernel autotuning Outline  A case for sparse kernel autotuning  Trends?  Why tune?  OSKI  Autotuning compiler  Summary and future directions

A case for sparse kernel autotuning SpMV crash course: Compressed Sparse Row (CSR) storage  Matrix-vector multiply: y = A*x  for all A(i, j): y(i) = y(i) + A(i, j) * x(j) Dominant cost: Compress? Irregular, indirect: x[ind[…]] “Regularize?”

A case for sparse kernel autotuning Experiment: How hard is SpMV tuning?  Exploit 8  8 blocks  Compress data  Unroll block multiplies  Regularize accesses  As r  c , speed 

A case for sparse kernel autotuning Speedups on Itanium 2: The need for search Reference Mflop/s (7.6%) Mflop/s (31.1%) Best: 4  2

A case for sparse kernel autotuning SpMV Performance—raefsky3

A case for sparse kernel autotuning SpMV Performance—raefsky3

A case for sparse kernel autotuning Better, worse, or about the same?

A case for sparse kernel autotuning Better, worse, or about the same? Itanium 2, 900 MHz  1.3 GHz * Reference improves * * Best possible worsens slightly *

A case for sparse kernel autotuning Better, worse, or about the same? Power4  Power5 * Reference worsens! * * Relative importance of tuning increases *

A case for sparse kernel autotuning Better, worse, or about the same? Pentium M  Core 2 Duo (1-core) * Reference & best improve; relative speedup improves (~1.4 to 1.6  ) * * Best decreases from 11% to 9.6% of peak *

A case for sparse kernel autotuning More complex structures in practice  Example: 3  3 blocking  Logical grid of 3  3 cells

A case for sparse kernel autotuning Extra work can improve efficiency!  Example: 3  3 blocking  Logical grid of 3  3 cells  Fill-in explicit zeros  Unroll 3x3 block multiplies  “Fill ratio” = 1.5  On Pentium III: 1.5   i.e., 2/3 time

A case for sparse kernel autotuning Trends: My predictions from 2003  Need for “autotuning” will increase over time  So kindly approve my dissertation topic  Example: SpMV, 1987 to present  Untuned: 10% of peak or less, decreasing  Tuned: 2  speedup, increasing over time  Tuning is getting harder (qualitative)  More complex machines & workloads  Parallelism

A case for sparse kernel autotuning Trends in uniprocessor SpMV performance (Mflop/s), pre-2004

A case for sparse kernel autotuning Trends in uniprocessor SpMV performance (fraction of peak)

A case for sparse kernel autotuning Summary: A case for sparse kernel autotuning  “Trends” for SpMV  10% of peak, decreasing  2x speedup, increasing  No guarantees with new generation systems  Manual tuning is “hard”  Sparse differs from dense  Not LINPACK!  Data structure overheads  Indirect, irregular memory access  Run-time dependence (i.e., matrix)

Outline  A case for sparse kernel autotuning  OSKI  Tuning is hard—how to automate?  Autotuning compiler  Summary and future directions

The view from space Basic approach to autotuning  High-level idea: Given kernel, machine…  Identify “space” of candidate implementations  Generate implementations in the space  Choose (search for) fastest by modeling and/or experiments  Example: Dense matrix multiply  Space = {Impl(R, C)}, where R x C is cache block size  Measure time to run Impl(R, C) for all R, C  Perform this expensive search once per machine  Successes: ATLAS/PHiPAC, FFTW, SPIRAL, …  SpMV: Depends on run-time input (matrix)!

OSKI OSKI: Optimized Sparse Kernel Interface  Autotuned kernels for user’s matrix & machine  a la PHiPAC/ATLAS, FFTW, SPIRAL, …  BLAS-style interface: mat-vec (SpMV), tri. solve (TrSV), …  Hides complexity of run-time tuning  Includes fast locality-aware kernels: A T A·x, A k ·x, …  Fast in practice  Standard SpMV < 10% peak, vs. up to 31% with OSKI  Up to 4  faster SpMV, 1.8  triangular solve, 4x A T A·x, …  For “advanced” users & solver library writers  OSKI-PETSc; Trilinos (Heroux)  Adopted by ClearShape, Inc. for shipping product (2  speedup)

OSKI How OSKI tunes (Overview) Library Install-Time (offline) Application Run-Time Benchmark data 1. Build for Target Arch. 2. Benchmark Generated code variants Heuristic models 1. Evaluate Models Workload from program monitoring History Matrix 2. Select Data Struct. & Code To user: Matrix handle for kernel calls

OSKI Heuristic model example: Select block size  Idea: Hybrid off-line / run-time model  Characterize machine with off-line benchmark  Precompute Mflops(r, c) using dense matrix for all r, c  Once per machine  Estimate matrix properties at run-time  Sample A to estimate Fill(r, c)  Run-time “search”  Select r, c to maximize Mflops(r, c) / Fill(r, c)  Run-time costs ~ 40 SpMVs  80%+ = time to convert to new r  c format

OSKI Accuracy of the Tuning Heuristics (1/4) NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”) DGEMV

OSKI Accuracy of the Tuning Heuristics (2/4) DGEMV NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)

OSKI Tunable optimization techniques  Optimizations for SpMV  Register blocking (RB): up to 4  over CSR  Variable block splitting: 2.1  over CSR, 1.8  over RB  Diagonals: 2  over CSR  Reordering to create dense structure + splitting: 2  over CSR  Symmetry: 2.8  over CSR, 2.6  over RB  Cache blocking: 3  over CSR  Multiple vectors (SpMM): 7  over CSR  And combinations…  Sparse triangular solve  Hybrid sparse/dense data structure: 1.8  over CSR  Higher-level kernels  AA T ·x or A T A·x: 4  over CSR, 1.8  over RB  A 2 ·x: 2  over CSR, 1.5  over RB

The view from space Structural splitting for complex patterns  Idea: Split A = A 1 + A 2 + …, and tune A i independently  Sample to detect “canonical” structures  Saves time and/or storage (avoid fill)

The view from space Example: Variable Block Row (Matrix #12) 2.1  over CSR 1.8  over RB

The view from space Example: Row-segmented diagonals 2  over CSR

The view from space Dense sub-triangles for triangular solve Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%!  Solve Tx = b for x, T triangular  Raefsky4 (structural problem) + SuperLU + colmmd  N=19779, nnz=12.6 M

The view from space  Idea: Interleave multiplication by A, A T  Combine with register optimizations: a i = r  c block row Cache optimizations for AA T ·x dot product “axpy”

OSKI OSKI tunes for workloads  Bi-conjugate gradients - equal mix of A·x and A T ·y  3  1: A·x, A T ·y = 1053, 343 Mflop/s  517 Mflop/s  3  3: A·x, A T ·y = 806, 826 Mflop/s  816 Mflop/s  Higher-level operation - (A·x, A T ·y) kernel  3  1: 757 Mflop/s  3  3: 1400 Mflop/s  Workload tuning  Evaluate weighted sums of empirical models  Dynamic programming to evaluate alternatives

The view from space Matrix powers kernel  Idea: Serial sparse tiling (Strout, et al.)  Extend for arbitrary A, combine other techniques (e.g., blocking)  Compute dependences, layer new data structure on (B)CSR  Matrix powers (A k ·x) with data structure transformations  A 2 ·x: up to 2  faster  New latency-tolerant solvers? (Hoemmen’s thesis at Berkeley)

The view from space Example: A 2 ·x  Let A be 5  5 tridiagonal  Consider y = A 2 ·x  t = A·x, y = A·t  Nodes: vector elements  Edges: matrix elements a ij y1y1 y2y2 y3y3 y4y4 y5y5 t1t1 t2t2 t3t3 t4t4 t5t5 x1x1 x2x2 x3x3 x4x4 x5x5 a 11 a 12

The view from space Example: A 2 ·x  Let A be 5  5 tridiagonal  Consider y = A 2 ·x  t=A·x, y=A·t  Nodes: vector elements  Edges: matrix elements a ij  Orange = everything needed to compute y 1  Reuse a 11, a 12 y1y1 y2y2 y3y3 y4y4 y5y5 t1t1 t2t2 t3t3 t4t4 t5t5 x1x1 x2x2 x3x3 x4x4 x5x5 a 11 a 12

The view from space Example: A 2 ·x  Let A be 5  5 tridiagonal  Consider y = A 2 ·x  t=A·x, y=A·t  Nodes: vector elements  Edges: matrix elements a ij  Orange = everything needed to compute y 1  Reuse a 11, a 12  Grey = y 2, y 3  Reuse a 23, a 33, a 43 y1y1 y2y2 y3y3 y4y4 y5y5 t1t1 t2t2 t3t3 t4t4 t5t5 x1x1 x2x2 x3x3 x4x4 x5x5

OSKI Examples of OSKI’s early impact  Integrating into major linear solver libraries  PETSc  Trilinos – R&D100 (Heroux)  Early adopter: ClearShape, Inc.  Core product: lithography process simulator  2  speedup on full simulation after using OSKI  Proof-of-concept: SLAC T3P accelerator design app  SpMV dominates execution time  Symmetry, 2  2 block structure  2  speedups

OSKI OSKI-PETSc Performance: Accel. Cavity

Outline  A case for sparse kernel autotuning  OSKI  Autotuning compiler  Sparse kernels aren’t my bottleneck…  Summary and future directions

An autotuning compiler Strengths and limits of the library approach  Strengths  Isolates optimization in the library for portable performance  Exploits domain-specific information aggressively  Handles run-time tuning naturally  Limitations  “Generation Me”: What about my application?  Run-time tuning: run-time overheads  Limited context for optimization (w/o delayed evaluation)  Limited extensibility (fixed interfaces)

An autotuning compiler A framework for performance tuning Source: SciDAC Performance Engineering Research Institute (PERI)

An autotuning compiler OSKI’s place in the tuning framework

An autotuning compiler An empirical tuning framework using ROSE gprof, HPCtoolkit Open SpeedShop POET Search engine Empirical Tuning Framework using ROSE

An autotuning compiler What is ROSE?  Research: Optimize use of high-level abstractions  Lead: Dan Quinlan at LLNL  Target DOE apps  Study application-specific analysis and optimization  Extend traditional compiler optimizations to abstraction use  Performance portability via empirical tuning  Infrastructure: Tool for building source-to-source tools  Full compiler: basic analysis, loop optimizer, OpenMP  Support for C, C++; Fortran 90 in progress  Targets “non-compiler audience”  Open-source

An autotuning compiler A compiler-based autotuning framework  Guiding philosophy  Leverage external stand-alone components  Provide open components and tools for community  User or “system” profiles to collect data and/or analyses  In ROSE  Profile-based analysis to find targets  Extract (“outline”) targets  Make “benchmark” using checkpointing  Generate parameterized target (not yet automated)  Independent search engine performs search

An autotuning compiler Case study: Loop optimizations for SMG2000  SMG2000, implements semi-coarsening multigrid on structured grids (ASC Purple benchmark)  Residual computation has an SpMV bottleneck  Loop below looks simple but non-trivial to extract for (si = 0; si < NS; ++si) for (k = 0; k < NZ; ++k) for (j = 0; j < NY; ++j) for (i = 0; i < NX; ++i) r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]]

An autotuning compiler Before transformation for (si = 0; si < NS; si++) /* Loop1 */ for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (ii = 0; ii < NX; ii++) { /* Loop4 */ r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]] * x[ii + jj*JA + kk*KA + SA[si]]; } /* Loop4 */ } /* Loop3 */ } /* Loop2 */ } /* Loop1 */

An autotuning compiler After transformation, including interchange, unrolling, and prefetching for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (si = 0; si < NS; si++) { /* Loop1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core Loop4 */ for ( ; ii < NX; ii++) { /* fringe Loop4 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe Loop4 */ } /* Loop1 */ } /* Loop3 */ } /* Loop2 */

An autotuning compiler Loop optimizations for SMG2000  2x speedup on kernel from specialization, loop interchange, unrolling, prefetching  But only 1.25  overall—multiple bottlenecks  Lesson: Need complex sequences of transformations  Use profiling to guide  Inspect run-time data for specialization  Transformations are automatable

An autotuning compiler Generating parameterized representations (w/ Q. Yi at UTSA)  Generate parameterized representation of target  POET: Embedded scripting language for expressing parameterized code variations [see IPDPS/POHLL’07]  ROSE’s loop optimizer will generate POET for each target  Hand-coded POET for SMG2000  Interchange  Machine-specific: Unrolling, prefetching  Source-specific: register & restrict keywords, C pointer idiom  Related: New parameterization for loop fusion [w/ Zhao, Kennedy : Rice, Yi : UTSA]

Outline  A case for sparse kernel autotuning  OSKI  Autotuning compiler  Summary and future directions

View from space (revisited) General theme: Aggressively exploit structure  Application- and architecture-specific optimization  E.g., Sparse matrix patterns  Robust performance in spite of architecture-specific pecularities  Augment static models with benchmarking and search  Short-term OSKI extensions  Integrate into large-scale apps  Accelerator design, plasma physics (DOE)  Geophysical simulation based on Block Lanczos ( A T A*X ; LBL)  Evaluate emerging architectures (e.g., vector micros, CELL)  Other kernels: Matrix triple products  Parallelism (OSKI-PETSc)

View from space (revisited) Current and future research directions  Autotuning: Robust portable performance?  End-to-end autotuning compiler framework  New domains: PageRank, cryptokernels  Tuning for novel architectures (e.g., multicore)  Tools for generating domain-specific libraries  Performance modeling: Limits and insight?  Kernel- and machine-specific analytical and statistical models  Hybrid symbolic/empirical modeling  Implications for applications and architectures?  Debugging massively parallel applications  JitterBug [w/ Schulz, Quinlan, de Supinski, Saebjoernsen]  Static/dynamic analyses for debugging MPI

End