The view from space Last weekend in Los Angeles, a few miles from my apartment…

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,
Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
NITRO: A Framework for Adaptive Code Variant Tuning
Houssam Haitof Technische Univeristät München
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Fast Algorithms For Hierarchical Range Histogram Constructions
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
OpenFOAM on a GPU-based Heterogeneous Cluster
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Advanced data assimilation methods with evolving forecast error covariance Four-dimensional variational analysis (4D-Var) Shu-Chih Yang (with EK)
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Query Processing Presented by Aung S. Win.
Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
© 2006 IBM Corporation Adaptive Self-Tuning Memory in DB2 Adam Storm, Christian Garcia-Arellano, Sam Lightstone – IBM Toronto Lab Yixin Diao, M. Surendra.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory April 2, 2007.
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss High-resolution data assimilation in COSMO: Status and.
Enabling Refinement with Synthesis Armando Solar-Lezama with work by Zhilei Xu and many others*
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Outline Introduction Research Project Findings / Results
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.
ICCV 2007 National Laboratory of Pattern Recognition Institute of Automation Chinese Academy of Sciences Half Quadratic Analysis for Mean Shift: with Extension.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
An Introduction to AD Model Builder PFRP
Optimizing the Performance of Sparse Matrix-Vector Multiplication
University of California, Berkeley
Learning to Align: a Statistical Approach
Resource Elasticity for Large-Scale Machine Learning
Programming Models for SimMillennium
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.
for more information ... Performance Tuning
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

The view from space Last weekend in Los Angeles, a few miles from my apartment…

The view from space Estimating fill accurately and efficiently Idea: Sample matrix Fraction of matrix to sample: s [0,1] Cost ~ O(s · nnz) Control run-time cost by controlling s Control s by observing statistical confidence intervals Idea: Monitor variance automatically Cost of tuning Lower bound: convert matrix in 5 to 40 unblocked SpMVs Heuristic: 1 to 11 SpMVs

The view from space Empirical model evaluation Tuning loop Compute a tuning time budget based on workload While (time remains and no tuning chosen) Try a heuristic Heuristic for blocked SpMV: Choose r x c to minimize Tuning for workloads Weighted sums of empirical models Dynamic programming for alternatives Example: Combined y = A T Ax vs. separate (w = Ax, y = A T w)

The view from space The cost of tuning Non-trivial run-time cost: up to ~40 mat-vecs Dominated by conversion time (~ 80%) Design point: user calls tune routine explicitly Exposes cost Tuning time limited using estimated workload Provided by user or inferred by library User may save tuning results To apply on future runs with similar matrix Stored in human-readable format

The view from space Related Work Code generation Generative & generic programming Sparse compilers Domain-specific generators Empirical search-based tuning Kernel-centric: linear algebra, signal processing, sorting, MPI, … Compiler-centric: profiling + FDO, iterative compilation, superoptimizers, autotuning compilers, continuous program optimization Tuning-free cache-oblivious algorithms

The view from space Bug hunting in MPI programs Motivation: MPI is a large, complex API Bug pattern detectors Check basic API usage Adapt existing tools: MPI-CHECK; FindBugs; Farchi, et al. VC05 Tasks requiring deeper program analysis Properly matched sends/receives, barriers, collectives Buffer errors, e.g., overruns, read before non-blocking op completes Temporal usage properties See error survey by DeSouza, Kuhn, & de Supinski 05 Extend existing analyses by Shires, et al., PDPTA99; Strout, et al. ICPP06

The view from space Outline Motivation OSKI: An autotuned sparse kernel library Application-specific optimization in the wild Toward end-to-end application autotuning Summary and future work

The view from space Tour of application-specific optimizations Five case studies Common characteristics Complex code Heavy use of abstraction Use generated code (e.g., SWIG C++/Python bindings) Benefit from extensive code and data restructuring Multiple bottlenecks

The view from space [1] Loop transformations for SMG2000 SMG2000, implements semi-coarsening multigrid on structured grids (ASC Purple benchmark) Residual computation has an SpMV bottleneck Loop below looks simple but non-trivial to extract for (si = 0; si < NS; ++si) for (k = 0; k < NZ; ++k) for (j = 0; j < NY; ++j) for (i = 0; i < NX; ++i) r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]]

The view from space [1] Before transformation for (si = 0; si < NS; si++) /* Loop1 */ for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (ii = 0; ii < NX; ii++) { /* Loop4 */ r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]] * x[ii + jj*JA + kk*KA + SA[si]]; } /* Loop4 */ } /* Loop3 */ } /* Loop2 */ } /* Loop1 */

The view from space [1] After transformation, including interchange, unrolling, and prefetching for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (si = 0; si < NS; si++) { /* Loop1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core Loop4 */ for ( ; ii < NX; ii++) { /* fringe Loop4 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe Loop4 */ } /* Loop1 */ } /* Loop3 */ } /* Loop2 */

The view from space [1] Loop transformations for SMG2000 2x speedup on kernel from specialization, loop interchange, unrolling, prefetching But only 1.25x overall---multiple bottlenecks Lesson: Need complex sequences of transformations Use profiling to guide Inspect run-time data for specialization Transformations are automatable Research topic: Automated specialization of hypre?

The view from space [1] SMG2000 demo

The view from space [2] Slicing and dicing 3P Accelerator design code from SLAC calcBasis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh b = calcBasis (elem) f = calcField (b, mode)

The view from space [2] Slicing and dicing 3P Accelerator design code calcBasis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible Challenges in practice Loop nest ~ 500+ LOC 150+ LOC to calcBasis() calcBasis() in 6-deep call chain, 4- deep loop nest, 2 conditionals File I/O Changes must be unobtrusive /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh // { … b = calcBasis (elem) // } f = calcField (b, mode) writeDataToFiles (…);

The view from space [2] 3P: Impact and lessons 4-5x speedup for post-processing step; 1.5x overall Changes checked-in Lesson: Need clean source-level transformations To automate, need robust program analysis and developer guidance Research: Annotation framework for developers [w/ Quinlan, Schordan, Yi: POHLL06]

The view from space [3] Structure splitting Convert (array of structs) into (struct of arrays) Improve spatial locality through increased stride-1 accesses Make code hardware-prefetch and vector/SIMD unit friendlyc struct Type { double p; double x, y, z; double E; int k; } X[N], Y[N]; for (i = 0; i < N; i++) Y[i].E += Y[X[i].k].p; double Xp[N]; double Xx[N], Xy[N], Xz[N]; double XE[N]; int Xk[N]; // … same for Y … for (i = 0; i < N; i++) YE[i] += sqrt (Yp[Xk[i]]);

The view from space [3] Structure splitting: Impact and challenges 2x speedup on a KULL benchmark (suggested by Brian Miller) Implementation challenges Potentially affects entire code Can apply only locally, at a cost Extra storage Overhead of copying Tedious to do by hand Lesson: Extensive data restructuring may be necessary Research: When and how best to split?

The view from space [4] Finding a loop-fusion needle in a haystack Interprocedural loop fusion finder [w/ B. White : Cornell U.] Known example had 2x speedup on benchmark (Miller) Built abstraction-aware analyzer using ROSE First pass: Associate loop signatures with each function Second pass: Propagate signatures through call chains for (Zone::iterator z = zones.begin (); z != zones.end (); ++z) for (Corner::iterator c = (*z).corners().begin (); …) for (int s = 0; s sides().size(); s++) …

The view from space [4] Finding a loop-fusion needle in a haystack Found 6 examples of 3- and 4-deep nested loops Analysis-only tool Finds, though does not verify/transform Lesson: Classical optimizations relevant to abstraction use Research Recognizing and optimizing abstractions [Whites thesis, on- going] Extending traditional optimizations to abstraction use

The view from space [5] Aggregating messages (on-going) Idea: Merge sends (suggested by Miller) Implementing a fully automated translator to find and transform Research: When and how best to aggregate? DataType A; // … operations on A … A.allToAll(); // … DataType B; // … operations on B … B.allToAll(); DataType A; // … operations on A … // … DataType B; // … operations on B … bulkAllToAll(A, B);

The view from space Summary of application-specific optimizations Like library-based approach, exploit knowledge for big gains Guidance from developer Use run-time information Would benefit from automated transformation tools Real code is hard to process Changes may become part of software re-engineering Need robust analysis and transformation infrastructure Range of tools possible: analysis and/or transformation No silver bullets or magic compilers

The view from space Outline Motivation OSKI: An autotuned sparse kernel library Real world optimization Toward end-to-end application autotuning Summary and future work

The view from space A framework for performance tuning Source: SciDAC Performance Engineering Research Institute (PERI)

The view from space OSKIs place in the tuning framework

The view from space Creating structure: Traveling Salesman-based Reordering Application: Stanford accelerator design (Omega3P) Idea: Reorder by approximately solving TSP [Pinar 97] Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour Also: symmetric storage, register blocking Manually selected optimizations Just an idea High-cost of computing approximate solution to TSP in practice

The view from space 100x100 Submatrix Along Diagonal

The view from space Microscopic Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue

The view from space

Interfaces to performance tools Mark-up AST with data, analysis, to identify optimizable target(s) gprof HPCToolkit [Mellor-Crummey : Rice] VizzAnalyzer / Vizz3D [Panas : LLNL] In progress: Open SpeedShop [Schulz : LLNL] Needed: Analysis to identify targets

The view from space Outlining Outline target into dynamically loadable library routine Extends initial implementations by Liao [U. Houston], Jula [TAMU] Handles many details of C & C++ Wraps up variables, inserts declarations, generates call Produces suitable interfaces for dynamic loading Handles non-local control flow void OUT_38725__ (double* r, int JR, int KR, const double* A, …) { int si, j, k, i; for (si = 0; si < NS; si++) … r[i + j*JR + k*KR] -= A[i + …

The view from space Making a benchmark Make benchmark by inserting checkpoint library calls Measure application behavior in context Use ckpt (user-level) [Zander : U. Wisc.] Insert timing code (cycle counter) May insert arbitrary code to distinguish calling contexts Reasonably fast in practice Checkpoint read/write bandwidth: 500 MB/s on my Pentium-M For SMG2000: Problem consuming ~500 MB footprint takes ~30s to run Needed Best procedure to get accurate and fair comparisons? Do restarts resume in comparable states?

The view from space Example of benchmark (pseudo)code static int num_calls = 0; // no. of invocations of outlined code if (!num_calls) { ckpt (); // Checkpoint/resume OUT_38725__ = dlsym (…); // Load an implementation startTimer (); } OUT_38725__ (…); // outlined call-site if (++num_calls == CALL_LIMIT) { // Measured CALL_LIMIT calls stopTimer (); outputTime (); exit (0); }

The view from space SMG2000 kernel POET instantiation for (kk = 0; kk < NZ; kk++) { /* L4 */ for (jj = 0; jj < NY; jj++) { /* L3 */ for (si = 0; si < NS; si++) { /* L1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core L2 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core L2 */ for ( ; ii < NX; ii++) { /* fringe L2 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe L2 */ } /* L1 */ } /* L3 */ } /* L4 */

The view from space Search We are search-engine agnostics Many possible hybrid modeling/search techniques

The view from space Summary of autotuning compiler approach End-to-end framework leverages existing work ROSE provides a heavy-duty (robust) source-level infrastructure Assemble stand-alone components Current and future work Assembling a more complete end-to-end example Interfaces between components? Extending basic ROSE infrastructure, particularly program analysis

The view from space Compiler-based testing tools Instrumentation and dynamic analysis to measure coverage [IBM] Measurement-unit validation via Osprey [Jiang and Su, UC Davis] Numerical interval/bounds analysis [Sun] Interface to MOPS model-checker [Collingbourne, Imperial College] Interactive program visualization via VizzAnalyzer [Panas, LLNL]

The view from space SpMV trends, using pre-2007 data

The view from space SpMV trends, pre-2007: Fraction of peak

The view from space Motivation: The Difficulty of Tuning SpMV // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j)

The view from space Motivation: The Difficulty of Tuning SpMV // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t = 0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t

The view from space Motivation: The Difficulty of Tuning SpMV // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t = 0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t Exploit 8x8 dense blocks

The view from space Speedups on Itanium 2: The Need for Search Reference Mflop/s (7.6%) Mflop/s (31.1%)

The view from space Speedups on Itanium 2: The Need for Search Reference Mflop/s (7.6%) Mflop/s (31.1%) Best: 4x2

The view from space SpMV Performanceraefsky3

The view from space SpMV Performanceraefsky3

The view from space Better, worse, or about the same? Pentium 4, 1.5 GHz Xeon, 3.2 GHz

The view from space Better, worse, or about the same? Pentium 4, 1.5 GHz Xeon, 3.2 GHz * Faster, but relative improvement increases (20% ~50%) *

Problem-Specific Performance Tuning

The view from space Problem-Specific Optimization Techniques Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AA T *x, A T A*x: 4x over CSR, 1.8x over RB A *x: 2x over CSR, 1.5x over RB

The view from space Problem-Specific Optimization Techniques Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AA T *x, A T A*x: 4x over CSR, 1.8x over RB A *x: 2x over CSR, 1.5x over RB

The view from space BCSR Captures Regularly Aligned Blocks n = nnz = 1.5 M Source: NASA structural analysis problem 8x8 dense substructure Reduces storage

The view from space Problem: Forced Alignment BCSR(2x2) Stored / true nz = 1.24

The view from space Problem: Forced Alignment BCSR(2x2) Stored / true nz = 1.24 BCSR(3x3) Stored / true nz = 1.46

The view from space Problem: Forced Alignment Implies UBCSR BCSR(2x2) Stored / true nz = 1.24 BCSR(3x3) Stored / true nz = 1.46 Forces i mod 3 = j mod 3 = 0 Unaligned BCSR format: Store row indices

The view from space The Speedup Gap The Speedup Gap: BCSR vs. CSR Speedup: BCSR/CSR Machine x gap

The view from space Approach: Splitting + Relaxed Block Alignment Goal: Close the gap between FEM classes Our approach: Capture actual structure more precisely Split: A = A 1 + A 2 + … + A s Store each A i in unaligned BCSR (UBCSR) format Relax both row and column alignment Buttari, et al. (2005) show improvements from relaxed column alignment 2.1x over no blocking, 1.8x over blocking When not faster than BCSR, may still reduce storage

The view from space Variable Block Row (VBR) Analysis Partition by grouping consecutive rows/columns having same pattern

The view from space From VBR, Identify Multiple Natural Block Sizes

The view from space VBR with Fill Can also pad by matching rows/columns with nearly similar patterns Define VBR( ) = VBR where consecutive rows grouped when similarity 0 1

The view from space VBR with Fill Fill of 1%

The view from space A Complex Tuning Problem Many parameters need tuning Fill threshold,.5 1 Number of splittings, 2 s 4 Ordering of block sizes, r i c i ; r s c s = 1 1 See paper in HPCC 2005 for proof-of-concept experiments based on a semi-exhaustive search Heuristic in progress (uses Buttari, et al. (2005) work)

The view from space FEM 2 Matrices MatrixDimensio n # non-zerosDominant blocks 10-ct20stif Engine block 52k2.7M6x6 (39%), 3x3 (15%) 12-raefsky4 Buckling 20k1.3M3x3 (96%) 13-ex11 Fluid flow 16k1.1M1x1 (38%), 3x3 (23%) 15-Vavasis3 2D PDE 41k1.7M2x1 (81%), 2x2 (19%) 17-rim Fluid flow 23k1.0M1x1 (75%), 3x1 (12%) A-bmw7st_1 Car chassis 141k7.3M6x6 (82%) B-cop20k_m Accel. Cavity 121k4.8M2x1 (26%), 1x2 (26%), 1x1 (26%), 2x2 (22%) C-pwtk Wind tunnel 218k11.6M6x6 (94%) D-rma10 Charleston Harbor 47k2.4M2x2 (17%), 3x2 (15%), 2x3 (15%), 4x2 (9%), 2x4 (9%) E-s3dkqm4 Cylindrical shell 90k4.8M6x6 (99%)

The view from space Power 4 Performance

The view from space Storage Savings

The view from space Traveling Salesman Problem-based Reordering Application: Stanford accelerator design problem (Omega3P) Reorder by approximately solving TSP [Pinar & Heath 97] Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour See [Pinar & Heath 97] Also: symmetric storage, register blocking Manually selected optimizations Problem: High-cost of computing approximate solution to TSP

The view from space 100x100 Submatrix Along Diagonal

The view from space Microscopic Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue

The view from space

Inter-Iteration Sparse Tiling (1/3) y1y1 y2y2 y3y3 y4y4 y5y5 t1t1 t2t2 t3t3 t4t4 t5t5 x1x1 x2x2 x3x3 x4x4 x5x5 Idea: Strout, et al., ICCS 2001 Let A be 5x5 tridiagonal Consider y=A 2 x t=Ax, y=At Nodes: vector elements Edges: matrix elements a ij

The view from space Inter-Iteration Sparse Tiling (2/3) y1y1 y2y2 y3y3 y4y4 y5y5 t1t1 t2t2 t3t3 t4t4 t5t5 x1x1 x2x2 x3x3 x4x4 x5x5 Idea: Strout, et al., ICCS 2001 Let A be 5x5 tridiagonal Consider y=A 2 x t=Ax, y=At Nodes: vector elements Edges: matrix elements a ij Orange = everything needed to compute y 1 Reuse a 11, a 12

The view from space Inter-Iteration Sparse Tiling (3/3) Idea: Strout, et al., ICCS 2001 Let A be 5x5 tridiagonal Consider y=A 2 x t=Ax, y=At Nodes: vector elements Edges: matrix elements a ij Orange = everything needed to compute y 1 Reuse a 11, a 12 Grey = y 2, y 3 Reuse a 23, a 33, a 43 y1y1 y2y2 y3y3 y4y4 y5y5 t1t1 t2t2 t3t3 t4t4 t5t5 x1x1 x2x2 x3x3 x4x4 x5x5

The view from space Serial Sparse Tiling Performance (Itanium 2)

OSKI Software Architecture and API

The view from space Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val,, x,, y ); r = ddot (x, y); /* Some dense BLAS op on vectors */

The view from space Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val,, x,, y ); r = ddot (x, y);

The view from space Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Step 2: Call tune (with optional hints) */ oski_SetHintMatMult (A_tunable, …, 500); oski_TuneMat (A_tunable); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val,, x,, y ); r = ddot (x, y);

The view from space Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Step 2: Call tune (with optional hints) */ oski_setHintMatMult (A_tunable, …, 500); oski_TuneMat (A_tunable); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) oski_MatMult (A_tunable, OP_NORMAL,, x_view,, y_view); // Step 3 r = ddot (x, y);

The view from space Quick-and-dirty Parallelism: OSKI-PETSc Extend PETScs distributed memory SpMV (MATMPIAIJ) p0 p1 p2 p3 PETSc Each process stores diag (all-local) and off-diag submatrices OSKI-PETSc: Add OSKI wrappers Each submatrix tuned independently

The view from space OSKI-PETSc Proof-of-Concept Results Matrix 1: Accelerator cavity design (R. SLAC) N ~ 1 M, ~40 M non-zeros 2x2 dense block substructure Symmetric Matrix 2: Linear programming (Italian Railways) Short-and-fat: 4k x 1M, ~11M non-zeros Highly unstructured Big speedup from cache-blocking: no native PETSc format Evaluation machine: Xeon cluster Peak: 4.8 Gflop/s per node

The view from space Accelerator cavity matrix from SLACs T3P code

The view from space Embedded scripting language for selecting customized, complex transformations Mechanism to save/restore transformations # In file, my_xform.txt # Compute A fast = P*A*P T using Pinars reordering algorithm A_fast, P = reorder_TSP(InputMat); # Split A fast = A 1 + A 2, where A 1 in 2x2 block format, A 2 in CSR A1, A2 = A_fast.extract_blocks(2, 2); return transpose(P)*(A1+A2)*P; /* In my_app.c */ fp = fopen(my_xform.txt, rt); fgets(buffer, BUFSIZE, fp); oski_ApplyMatTransform(A_tunable, buffer); oski_MatMult(A_tunable, …); Additional Features: OSKI-Lua

Current Work and Future Directions

The view from space Current and Future Work on OSKI OSKI at bebop.cs.berkeley.edu/oski Pre-alpha version of OSKI-PETSc available; Beta for Kokkos (Trilinos) Future work Evaluation on full solves/apps Bay area lithography shop - 2x speedup in full solve Code generators Studying use of higher-level OSKI kernels Port to additional architectures (e.g., vectors, SMPs) Additional heuristics [Buttari, et al. (2005)] Many BeBOP projects on-going SpMV benchmark for HPC-Challenge [Gavari & Hoemmen] Evaluation of Cell [Williams] Higher-level kernels, solvers [Hoemmen, Nishtala] Tuning collective communications [Nishtala] Cache-oblivious stencils [Kamil]

The view from space ROSE: A Compiler-Based Approach to Tuning General Applications ROSE: Tool for building customized source-to-source tools (Quinlan, et al.) Full support for C and C++; Fortran90 in development Targets users with little or no compiler background Focus on performance optimization for scientific computing Domain-specific analysis and optimizations Object-oriented abstraction recognition Rich loop-transformation support Annotation language support Additional infrastructure support for s/w assurance, testing, and debugging Toward an end-to-end empirical tuning compiler Combines profiling, checkpointing, analysis, parameterized code generation, search Joint work with Qing Yi (University of Texas at San Antonio) Sponsored by U.S. Department of Energy

The view from space ROSE Architecture Front-end (EDG-based) Back-end Transformed application source ApplicationLibrary Interface Mid-end Source fragment AST fragment Source fragment AST fragment AST Annotations Tools Abtraction Recognition Abstraction Aware Analysis Abstraction Elimination Extended Traditional Optimizations Source+AST Transformations