The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK www.cs.berkeley.edu/~demmel/Utah_Apr05.ppt.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…
Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
Solving Linear Systems (Numerical Recipes, Chap 2)
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
1cs542g-term Notes  Assignment 1 is out (questions?)
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Data Locality CS 524 – High-Performance Computing.
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick
When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.
Automatic Performance Tuning Sparse Matrix Algorithms James Demmel
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Automatic Performance Tuning of Sparse Matrix Kernels Berkeley Benchmarking and OPtimization (BeBOP) Project James.
Automatic Performance Tuning Sparse Matrix Kernels James Demmel
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
Automatic Performance Tuning of Sparse Matrix Kernels: Recent Progress Jim Demmel, Kathy Yelick Berkeley Benchmarking and OPtimization (BeBOP) Project.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Minimizing Communication in Numerical Linear Algebra Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,
Using Adaptive Methods for Updating/Downdating PageRank Gene H. Golub Stanford University SCCM Joint Work With Sep Kamvar, Taher Haveliwala.
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
COE 509 Parallel Numerical Computing Lecture 4: The Future of Numerical Linear Algebra Libraries: Automatic Tuning of Sparse Matrix Kernels The Next LAPACK.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley CScADS Autotuning Workshop 9 July 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Optimizing the Performance of Sparse Matrix-Vector Multiplication
University of California, Berkeley
Optimizing Cache Performance in Matrix Multiplication
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Optimizing Cache Performance in Matrix Multiplication
for more information ... Performance Tuning
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Solving Linear Systems: Iterative Methods and Sparse Systems
Presentation transcript:

The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK James Demmel UC Berkeley

Outline Automatic Performance Tuning of Sparse Matrix Kernels Updating LAPACK and ScaLAPACK

Outline Automatic Performance Tuning of Sparse Matrix Kernels Updating LAPACK and ScaLAPACK

Berkeley Benchmarking and OPtimization (BeBOP) Prof. Katherine Yelick Rich Vuduc –Many results in this talk are from Vuduc’s PhD thesis, Rajesh Nishtala, Mark Hoemmen, Hormozd Gahvari Eun-Jim Im, many other earlier contributors Other papers at bebop.cs.berkeley.edu

Motivation for Automatic Performance Tuning Writing high performance software is hard –Make programming easier while getting high speed Ideal: program in your favorite high level language (Matlab, PETSc…) and get a high fraction of peak performance Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… –Best choice can depend on knowing a lot of applied mathematics and computer science How much of this can we teach?

Motivation for Automatic Performance Tuning Writing high performance software is hard –Make programming easier while getting high speed Ideal: program in your favorite high level language (Matlab, PETSc…) and get a high fraction of peak performance Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… –Best choice can depend on knowing a lot of applied mathematics and computer science How much of this can we teach? How much of this can we automate?

Examples of Automatic Performance Tuning Dense BLAS –Sequential –PHiPAC (UCB), then ATLAS (UTK) –Now in Matlab, many other releases –math-atlas.sourceforge.net/ Fast Fourier Transform (FFT) & variations –FFTW (MIT) –Sequential and Parallel –1999 Wilkinson Software Prize – Digital Signal Processing –SPIRAL: (CMU) MPI Collectives (UCB, UTK) More projects, conferences, government reports, …

Tuning Dense BLAS —PHiPAC

Tuning Dense BLAS– ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

How tuning works, so far What do dense BLAS, FFTs, signal processing, MPI reductions have in common? –Can do the tuning off-line: once per architecture, algorithm –Can take as much time as necessary (hours, a week…) –At run-time, algorithm choice may depend only on few parameters Matrix dimension, size of FFT, etc.

Register Tile Size Selection in Dense Matrix Multiply =. n k m k n m n0n0 k0k0 k0k0 m0m0 n0n0 m0m0

Tuning Register Tile Sizes (Dense Matrix Multiply) 333 MHz Sun Ultra 2i 2-D slice of 3-D space; implementations color- coded by performance in Mflop/s 16 registers, but 2-by-3 tile size fastest Needle in a haystack

99% of implementations perform at < 66% of peak.4% of implementations perform at >= 80% of peak 90% of implementations perform at < 33% of peak

Limits of off-line tuning Algorithm and its implementation may strongly depend on data only known at run-time Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) Can’t afford to generate and test thousands of algorithms and implementations at run-time! BEBOP project addresses sparse tuning

A Sparse Matrix You Use Every Day

Motivation for Automatic Performance Tuning of SpMV SpMV widely used in practice –Kernel of iterative solvers for linear systems eigenvalue problems Singular value problems Historical trends –Sparse matrix-vector multiply (SpMV): 10% of peak or less –2x faster than CSR with “hand-tuning” –Tuning becoming more difficult over time

SpMV Historical Trends: Fraction of Peak

Approach to Automatic Performance Tuning of SpMV Our approach: empirical modeling and search –Off-line: measure performance of variety of data structures and SpMV algorithms –On-line: sample matrix, use performance model to predict which data structure/algorithm is best Results –Up to 4x speedups and 31% of peak for SpMV Using register blocking –Many other optimization techniques for SpMV

Matrix-vector multiply kernel: y (i)  y (i) + A (i,j) *x (j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] SpMV with Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y (i)  y (i) + A (i,j) *x (j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]]

Example 1: The Difficulty of Tuning n = nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem

Example 1: The Difficulty of Tuning n = nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem 8x8 dense substructure

Taking advantage of block structure in SpMV Bottleneck is time to get matrix from memory –Only 2 flops for each nonzero in matrix –Goal: decrease size of data structure Don’t store each nonzero with index, instead store each nonzero r-by-c block with one index –Storage drops by up to 2x (if rc >> 1, all 32-bit quantities) –Time to fetch matrix from memory decreases Change both data structure and algorithm –Need to pick r and c –Need to change algorithm accordingly In example, is r=c=8 best choice? –Minimizes storage, so looks like a good idea…

Speedups on Itanium 2: The Need for Search Reference Best: 4x2 Mflop/s

SpMV Performance (Matrix #2): Generation 1 Power3 - 13%Power4 - 14% Itanium %Itanium 1 - 7% 195 Mflop/s 100 Mflop/s 703 Mflop/s 469 Mflop/s 225 Mflop/s 103 Mflop/s 1.1 Gflop/s 276 Mflop/s

Register Profile: Itanium Mflop/s 1190 Mflop/s

Register Profiles: IBM and Intel IA-64 Power3 - 17%Power4 - 16% Itanium %Itanium 1 - 8% 252 Mflop/s 122 Mflop/s 820 Mflop/s 459 Mflop/s 247 Mflop/s 107 Mflop/s 1.2 Gflop/s 190 Mflop/s

Register Profiles: Sun and Intel x86 Ultra 2i - 11%Ultra 3 - 5% Pentium III-M - 15%Pentium III - 21% 72 Mflop/s 35 Mflop/s 90 Mflop/s 50 Mflop/s 108 Mflop/s 42 Mflop/s 122 Mflop/s 58 Mflop/s

Example 2: The Difficulty of Tuning n = nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem

Zoom in to top corner More complicated non- zero structure in general

3x3 blocks look natural, but… More complicated non- zero structure in general Example: 3x3 blocking –Logical grid of 3x3 cells

3x3 blocks look natural, but… More complicated non- zero structure in general Example: 3x3 blocking –Logical grid of 3x3 cells But would lead to lots of “fill-in”: 1.5x

Extra Work Can Improve Efficiency! More complicated non-zero structure in general Example: 3x3 blocking –Logical grid of 3x3 cells –Fill-in explicit zeros –Unroll 3x3 block multiplies –“Fill ratio” = 1.5 On Pentium III: 1.5x speedup! –Actual mflop rate = 2.25 higher

Automatic Register Block Size Selection Selecting the r x c block size –Off-line benchmark Precompute Mflops(r,c) using dense A for each r x c Once per machine/architecture –Run-time “search” Sample A to estimate Fill(r,c) for each r x c –Run-time heuristic model Choose r, c to minimize time  Fill(r,c) / Mflops(r,c)

Accurate and Efficient Adaptive Fill Estimation Idea: Sample matrix –Fraction of matrix to sample: s  [0,1] –Cost ~ O(s * nnz) –Control cost by controlling s Search at run-time: the constant matters! Control s automatically by computing statistical confidence intervals –Idea: Monitor variance Cost of tuning –Heuristic: costs 1 to 11 unblocked SpMVs –Converting matrix costs 5 to 40 unblocked SpMVs Tuning a good idea when doing lots of SpMVs

Test Matrix Collection Many on-line sources (see Vuduc’s thesis) Matrix 1 – dense (in sparse format) Matrices 2-9: FEM with one block size r x c –N from 14K to 62K, NNZ from 1M to 3M –Fluid flow, structural mechanics, materials … Matrices 10-17: FEM with multiple block sizes –N from 17K to 52K, NNZ from.5M to 2.7M –Fluid flow, buckling, … Matrices 18 – 37: “Other” –N from 5K to 75K, NNZ from 33K to.5M –Power grid, chem eng, finance, semiconductors, … Matrices 40 – 44: Linear Programming –(N,M) from (3K,13K) to (15K,77K), NNZ from 50K to 2M

Accuracy of the Tuning Heuristics (1/4) NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”) See p. 375 of Vuduc’s thesis for matrices

Accuracy of the Tuning Heuristics (2/4)

DGEMV

Evaluating algorithms and machines for SpMV Some speedups look good, but could we do better? Questions –What is the best speedup possible? Independent of instruction scheduling, selection Can SpMV be further improved or not? –What machines are “good” for SpMV? –How can architectures be changed to improve SpMV?

Upper Bounds on Performance for register blocked SpMV P = (flops) / (time) –Flops = 2 * nnz(A) … don’t count extra work on zeros Lower bound on time: Two main assumptions –1. Count memory ops only (streaming) –2. Count only compulsory, capacity misses: ignore conflicts Account for line sizes Account for matrix size and nnz Charge minimum access “latency”  i at L i cache &  mem –e.g., Saavedra-Barrera and PMaC MAPS benchmarks

Example: L2 Misses on Itanium 2 Misses measured using PAPI [Browne ’00]

Example: Bounds on Itanium 2

Fraction of Upper Bound Across Platforms

Summary of Other Performance Optimizations Optimizations for SpMV –Register blocking (RB): up to 4x over CSR –Variable block splitting: 2.1x over CSR, 1.8x over RB –Diagonals: 2x over CSR –Reordering to create dense structure + splitting: 2x over CSR –Symmetry: 2.8x over CSR, 2.6x over RB –Cache blocking: 2.8x over CSR –Multiple vectors (SpMM): 7x over CSR –And combinations… Sparse triangular solve –Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels –AA T *x, A T A*x: 4x over CSR, 1.8x over RB –A  *x: 2x over CSR, 1.5x over RB

Example: Sparse Triangular Factor Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR

Cache Optimizations for AA T *x Cache-level: Interleave multiplication by A, A T Register-level: a i T to be r  c block row, or diag row dot product “axpy” Algorithmic-level transformations for A 2 *x, A 3 *x, … … …

Impact on Applications: Omega3P Application: accelerator cavity design [Ko] Relevant optimization techniques –Symmetric storage –Register blocking –Reordering rows and columns to improve blocking Reverse Cuthill-McKee ordering to reduce bandwidth Traveling Salesman Problem-based ordering to create blocks –[Pinar & Heath ’97] –Make columns adjacent if they have many common nonzero rows 2.1x speedup on Power 4

Source: Accelerator Cavity Design Problem (Ko via Husbands)

100x100 Submatrix Along Diagonal

Post-RCM Reordering

Before: Green + Red After: Green + Blue “Microscopic” Effect of RCM Reordering

“Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue

(Omega3P)

See: bebop.cs.berkeley.edu Optimized Sparse Kernel Interface OSKI Provides sparse kernels automatically tuned for user’s matrix & machine –BLAS-style functionality: SpMV, Ax & A T y, TrSV –Hides complexity of run-time tuning –Includes A T Ax, A k x For “advanced” users & solver library writers –Available as stand-alone library Alpha release at bebop.cs.berkeley.edu/oski Release v1.0 “soon” –Will be available as PETSc extension

How OSKI Tunes Benchmark data 1. Build for Target Arch. 2. Benchmark Heuristic models 1. Evaluate Models Generated code variants 2. Select Data Struct. & Code Library Install-Time (offline) Application Run-Time To user: Matrix handle for kernel calls Workload from program monitoring Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. History Matrix

What about the Google Matrix? Google approach –Approx. once a month: rank all pages using connectivity structure Find dominant eigenvector of a matrix –At query-time: return list of pages ordered by rank Matrix: A =  G + (1-  )(1/n)uu T –Markov model: Surfer follows link with probability , jumps to a random page with probability 1-  –G is n x n connectivity matrix [n  billions] g ij is non-zero if page i links to page j –Steady-state probability x i of landing on page i is solution to x = Ax Approximate x by power method: x = A k x 0 –In practice, k  25

Awards Best Paper, Intern. Conf. Parallel Processing, 2004 –“Performance models for evaluation and automatic performance tuning of symmetric sparse matrix-vector multiply” Best Student Paper, Intern. Conf. Supercomputing, Workshop on Performance Optimization via High-Level Languages and Libraries, 2003 –Best Student Presentation too, to Richard Vuduc –“Automatic performance tuning and analysis of sparse triangular solve” Finalist, Best Student Paper, Supercomputing 2002 –To Richard Vuduc –“Performance Optimization and Bounds for Sparse Matrix-vector Multiply” Best Presentation Prize, MICRO-33: 3 rd ACM Workshop on Feedback- Directed Dynamic Optimization, 2000 –To Richard Vuduc –“Statistical Modeling of Feedback Data in an Automatic Tuning System”

Outline Automatic Performance Tuning of Sparse Matrix Kernels Updating LAPACK and ScaLAPACK

LAPACK and ScaLAPACK Widely used dense and banded linear algebra libraries –In Matlab (thanks to tuning…), NAG, PETSc,… –Used in vendor libraries from Cray, Fujitsu, HP, IBM, Intel, NEC, SGI –over 49M web hits at New NSF grant for new, improved releases –Joint with Jack Dongarra, many others –Community effort (academic and industry) –

Goals (highlights) Putting more of LAPACK into ScaLAPACK –Lots of routines not yet parallelized New functionality –Ex: Updating/downdating of factorizations Improving ease of use –Life after F77? Automatic Performance Tuning –Over 1300 calls to ILAENV() to get tuning parameters New Algorithms –Some faster, some more accurate, some new

Faster: eig() and svd() Nonsymmetric eigenproblem –Incorporate SIAM Prize winning work of Byers / Braman / Mathias on faster HQR –Up to 10x faster for large enough problems Symmetric eigenproblem and SVD –Reduce from dense to narrow band Incorporate work of Bischof/Lang, Howell/Fulton Move work from BLAS2 to BLAS3 –Narrow band (tri/bidiagonal) problem Incorporate MRRR algorithm of Parlett/Dhillon Voemel, Marques, Willems

MRRR Algorithm for eig(tridiagonal) and svd(bidiagonal) “Multiple Relatively Robust Representation” 1999 Householder Award honorable mention for Dhillon O(nk) flops to find k eigenvalues/vectors of nxn tridiagonal matrix (similar for SVD) –Minimum possible! Naturally parallelizable Accurate –Small residuals || Tx i – i x i || = O(n  ) –Orthogonal eigenvectors || x i T x j || = O(n  ) Hence nickname: “Holy Grail”

Benchmark Details AMD 1.2 GHz Athlon, 2GB mem, Redhat + Intel compiler Compute all eigenvalues and eigenvectors of a symmetric tridiagonal matrix T Codes compared: –qr: QR iteration from LAPACK: dsteqr –dc: Cuppen’s Divide&Conquer from LAPACK: dstedc –gr: New implementation of MRRR algorithm (“Grail”) –ogr: MRRR from LAPACK: dstegr (“old Grail”)

Timing of Eigensolvers (only matrices where time >.1 sec)

Accuracy Results (old vs new Grail)

Accuracy Results (Grail vs QR vs DC)

More Accurate: Solve Ax=b Conventional Gaussian EliminationWith iterative refinement    n  

More Accurate: Solve Ax=b Old idea: Use Newton’s method on f(x) = Ax-b –On a linear system? –Roundoff in Ax-b makes it interesting (“nonlinear”) –Iterative refinement Snyder, Wilkinson, Moler, Skeel, … Repeat r = Ax-b … compute with extra precision Solve Ad = r … using LU factorization of A Update x = x – d Until “accurate enough” or no progress

What’s new? Need extra precision (beyond double) –Part of new BLAS standard –Cost = O(n 2 ) extra per right-hand-side, vs O(n 3 ) for factorization Get tiny componentwise bounds too –Error in x i small compared to |x i |, not just max j |x j | “Guarantees” based on condition number estimates –No bad bounds in 6.2M tests, unlike old method –Different condition number for componentwise bounds Extends to least squares Kahan, Hida, Riedy, X. Li, many undergrads LAPACK Working Note # 165

Can condition estimators lie? Yes, unless they cost as much as matrix multiply –Demmel/Diament/Malajovich (FCM2001) But what if matrix multiply costs O(n 2 )? – Cohn/Umans (FOCS 2003)

New algorithm for roots(p) To find roots of polynomial p –Roots(p) does eig(C(p)) –Costs O(n 3 ), stable, reliable O(n 2 ) Alternatives –Newton, Laguerre, bisection, … –Stable? Reliable? New: Exploit “semiseparable” structure of C(p) –Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration –Complexity drops from O(n 3 ) to O(n 2 ), same stability Ming Gu, Jiang Zhu, Jianlin Xia, David Bindel, -p 1 -p 2 … -p d 1 0 … … 0 … … … … 0 … 1 0 C(p)=

Conclusions Lots of opportunities for faster and more accurate solutions to classical problems Tuning algorithms for problems and architectures –Automated search to deal with complexity –Surprising optima, “needle in a haystack” Exploiting mathematical structure to find faster algorithms

Thanks to NSF and DOE for support These slides available at