The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 21 June 2006 PARA 06.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Algebraic, transcendental (i.e., involving trigonometric and exponential functions), ordinary differential equations, or partial differential equations...
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 28 Sept 2005.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 23 Feb 2007.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
04/06/2006CS267 Lecture 22a1 CS 267: Applications of Parallel Computers Final Project Suggestions James Demmel
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Antonio M. Vidal Jesús Peinado
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Numerical Computations in Linear Algebra. Mathematically posed problems that are to be solved, or whose solution is to be confirmed on a digital computer.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 27 March 2006.
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
COE 509 Parallel Numerical Computing Lecture 4: The Future of Numerical Linear Algebra Libraries: Automatic Tuning of Sparse Matrix Kernels The Next LAPACK.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
3/3/2008CS267 Guest Lecture 21 CS 267 Dense Linear Algebra: Parallel Gaussian Elimination James Demmel
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley CScADS Autotuning Workshop 9 July 2007.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Lapack / ScaLAPACK / Multi-core and more Julie UCSB - April 16 th 2007.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.
CS 267 Dense Linear Algebra: Parallel Gaussian Elimination
for more information ... Performance Tuning
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Results (Accuracy of Low Rank Approximation by iSVD)
Memory System Performance Chapter 3
Presentation transcript:

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 21 June 2006 PARA 06

Outline Motivation for new Sca/LAPACK Challenges (or research opportunities…) Goals of new Sca/LAPACK Highlights of progress

Motivation LAPACK and ScaLAPACK are widely used –Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … –>60M web Netlib (incl. CLAPACK, LAPACK95)

Impact (with NERSC, LBNL) Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). ScaLAPACK

Motivation LAPACK and ScaLAPACK are widely used –Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … –>60M web Netlib (incl. CLAPACK, LAPACK95) Many ways to improve them, based on –Own algorithmic research –Enthusiastic participation of research community –User/vendor survey –Opportunities and demands of new architectures, programming languages New releases planned (NSF support)

Participants UC Berkeley: –Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads… U Tennessee, Knoxville –Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak Other Academic Institutions –UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara –TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb Research Institutions –CERFACS, LBL Industrial Partners –Cray, HP, Intel, MathWorks, NAG, SGI

Challenges For all large scale computing, not just linear algebra! Example …

Parallelism in the Top500

Challenges For all large scale computing, not just linear algebra! Example … your laptop –256 Threads/multicore chip by 2010

Challenges For all large scale computing, not just linear algebra! Example … your laptop –256 Threads/multicore chip by 2010 Exponentially growing gaps between –Floating point time << 1/Memory BW << Memory Latency –Floating point time << 1/Network BW << Network Latency

Challenges For all large scale computing, not just linear algebra! Example … your laptop –256 Threads/multicore chip by 2010 Exponentially growing gaps between –Floating point time << 1/Memory BW << Memory Latency –Floating point time << 1/Network BW << Network Latency Heterogeneity (performance and semantics) Asynchrony Unreliability

What do users want? High performance, ease of use, … Survey results at –Small but interesting sample –What matrix sizes do you care about? 1000s: 34% 10,000s: 26% 100,000s or 1Ms: 26% –How many processors, on distributed memory? >10: 34%, >100: 31%, >1000: 19% –Do you use more than double precision? Sometimes or frequently: 16% –Would Automatic Memory Allocation help? Very useful: 72%, Not useful: 14%

Goals of next Sca/LAPACK 1.Better algorithms –Faster, more accurate 2.Expand contents –More functions, more parallel implementations 3.Automate performance tuning 4.Improve ease of use 5.Better software engineering 6.Increased community involvement

Goal 1: Better Algorithms Faster –But provide “usual” accuracy, stability –… Or accurate for an important subclass More accurate –But provide “usual” speed –… Or at any cost

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems (MS 19) Up to 10x faster HQR: –Byers / Mathias / Braman Extensions to QZ: –Kågström / Kressner / Adlerborn (MS 19) Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang Novel Data Layouts: –Gustavson / Kågström / Elmroth / Jonsson / Wasniewski (MS 15/23/30)

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems (MS 19) –Faster and more accurate than previous algorithms –SIAM SIAG/LA Prize in 2006 –New sequential, first parallel versions out in 2006

Flop Counts of Eigensolvers (2.2 GHz Opteron + ACML)

Flop Count Ratios of Eigensolvers (2.2 GHz Opteron + ACML)

Run Time Ratios of Eigensolvers (2.2 GHz Opteron + ACML)

MFlop Rates of Eigensolvers (2.2 GHz Opteron + ACML)

Parallel Runtimes of Eigensolvers (2.4 GHz Xeon Cluster + Ethernet)

Accuracy of Eigensolvers max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Accuracy of Eigensolvers: Old vs New Grail max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems (MS 19) –Faster and more accurate than previous algorithms –SIAM SIAG/LA Prize in 2006 –New sequential, first parallel versions out in 2006 –Both DC and MR are important

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems (MS19) Up to 10x faster HQR: –Byers / Mathias / Braman –SIAM SIAG/LA Prize in 2003 –Sequential version out in 2006 –More on performance later

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems (MS19) Up to 10x faster HQR: –Byers / Mathias / Braman Extensions to QZ: –Kågström / Kressner / Adlerborn (MS 19) –LAPACK Working Note (LAWN) #173 –On 26 real test matrices, speedups up to 14.7x, 4.4x average

Comparison of ScaLAPACK QR and new parallel multishift QZ Execution times in secs for 4096 x 4096 random problems Ax = sx and Ax = sBx, using processor grids including 1-16 processors. Note: work(QZ) > 2 * work(QR) but Time(// QZ) << Time (//QR)!! Times include cost for computing eigenvalues and transformation matrices. Adlerborn-Kågström-Kressner, SIAM PP’2006, also MS19

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems (MS19) Up to 10x faster HQR: –Byers / Mathias / Braman Extensions to QZ: –Kågström / Kressner / Adlerborn (MS 19) Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang –Full nonsymmetric eigenproblem: n=1500: 3.43x faster HQR: 5x faster, Reduction: 14% faster –Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster –Sequential versions out in 2006

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems (MS 19) Up to 10x faster HQR: –Byers / Mathias / Braman Extensions to QZ: –Kågström / Kressner / Adlerborn (MS19) Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang Novel Data Layouts –Gustavson / Kågström / Elmroth / Jonsson / Wasniewski (MS 15/23/30) –SIAM Review Article 2004

Novel Data Layouts and Algorithms Still merges multiple elimination steps into a few BLAS 3 operations MS 15/23/30 – Novel Data Formats Rectangular Packed Format: good speedups for packed storage of symmetric matrices

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –“Promise” the right answer for O(n 2 ) additional cost Jacobi-based SVD –Faster than QR, can be arbitrarily more accurate Arbitrary precision versions of everything –Using your favorite multiple precision package

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –Kahan, Riedy, Hida, Li –“Promise” the right answer for O(n 2 ) additional cost –Iterative refinement with extra-precise residuals –Extra-precise BLAS needed (LAWN#165)

More Accurate: Solve Ax=b Conventional Gaussian Elimination With extra precise iterative refinement    n 1/2  

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –“Promise” the right answer for O(n 2 ) additional cost –Iterative refinement with extra-precise residuals –Extra-precise BLAS needed (LAWN#165) –“Guarantees” based on condition number estimates Condition estimate < 1/( n 1/2   reliable answer and tiny error bounds No bad bounds in 6.2M tests Can condition estimators lie?

Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization –Demmel/Diament/Malajovich (FCM2001) But what if matrix multiply costs O(n 2 )? – More later

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –“Promise” the right answer for O(n 2 ) additional cost –Iterative refinement with extra-precise residuals –Extra-precise BLAS needed (LAWN#165) –“Guarantees” based on condition number estimates –Get tiny componentwise bounds too Each x i accurate Slightly different condition number –Extends to Least Squares (Li) –Release in 2006

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –Promise the right answer for O(n 2 ) additional cost Jacobi-based SVD –Faster than QR, can be arbitrarily more accurate –Drmac / Veselic (MS 3) –LAWNS # 169, 170 –Can be arbitrarily more accurate on tiny singular values –Yet faster than QR iteration, within 2x of DC

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –Promise the right answer for O(n 2 ) additional cost Jacobi-based SVD –Faster than QR, can be arbitrarily more accurate Arbitrary precision versions of everything –Using your favorite multiple precision package –Quad, Quad-double, ARPREC, MPFR, … –Using Fortran 95 modules

Iterative Refinement: for speed (MS 3) What if double precision much slower than single? –Cell processor in Playstation GFlops single, 25 GFlops double –Pentium SSE2: single twice as fast as double Given Ax=b in double precision –Factor in single, do refinement in double –If  (A) < 1/  single, runs at speed of single 1.9x speedup on Intel-based laptop Applies to many algorithms, if difference large

Exploiting GPUs Numerous emerging co-processors –Cell, SSE, Grape, GPU, “physics coprocessor,” … When can we exploit them? –LIttle help if memory is bottleneck –Various attempts to use GPUs for dense linear algebra Bisection on GPUs for symmetric tridiagonal eigenproblem –Evaluate Count(x) = #(evals < x) for many x –Very little memory traffic –Speedups up to 100x (Volkov)

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible

Missing Drivers in Sca/LAPACK LAPACKScaLAPACK Linear Equations LU Cholesky LDL T xGESV xPOSV xSYSV PxGESV PxPOSV missing Least Squares (LS) QR QR+pivot SVD/QR SVD/D&C SVD/MRRR xGELS xGELSY xGELSS xGELSD missing PxGELS missing missing (ok?) missing Generalized LSLS + equality constr. Generalized LM Above + Iterative ref. xGGLSE xGGGLM missing

More missing drivers LAPACKScaLAPACK Symmetric EVDQR / Bisection+Invit D&C MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X PxSYEVD missing Nonsymmetric EVDSchur form Vectors too xGEES / X xGEEV / X missing (driver) missing SVDQR D&C MRRR Jacobi xGESVD xGESDD missing PxGESVD missing (ok?) missing Generalized Symmetric EVD QR / Bisection+Invit D&C MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (ok?) missing Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing Generalized SVDKogbetliantz MRRR xGGSVD missing missing (ok) missing

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible New functions (highlights) –Updating / downdating of factorizations: Stewart, Langou –More generalized SVDs: Bai, Wang, Drmac (MS 3) –More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat (MS19) –Structured eigenproblems Selected matrix polynomials: –Mehrmann, Higham, Tisseur O(n 2 ) version of roots(p) –Gu, Chandrasekaran, Bindel et al

New algorithm for roots(p) To find roots of polynomial p –Roots(p) does eig(C(p)) –Costs O(n 3 ), stable, reliable O(n 2 ) Alternatives –Newton, Jenkins-Traub, Laguerre, … –Stable? Reliable? New: Exploit “semiseparable” structure of C(p) –Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration –Complexity drops from O(n 3 ) to O(n 2 ), stable in practice Related work: Van Barel (MS3), Gemignani, Bini, Pan, et al Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel -p 1 -p 2 … -p d 1 0 … … 0 … … … … 0 … 1 0 C(p)=

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible New functions (highlights) –Updating / downdating of factorizations: Stewart, Langou –More generalized SVDs: Bai, Wang, Drmac (MS 3) –More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat (MS19) –Structured eigenproblems Selected matrix polynomials: –Mehrmann, Higham, Tisseur O(n 2 ) version of roots(p) –Gu, Chandrasekaran, Bindel et al How should we prioritize missing functions?

Goal 3 – Automate Performance Tuning Widely used in performance tuning of Kernels –ATLAS (PhiPAC) – BLAS - –FFTW – Fast Fourier Transform – –Spiral – signal processing - –OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski Integrated into PETSc

Optimizing blocksizes for mat-mul Finding a Needle in a Haystack – So Automate

Goal 3 – Automate Performance Tuning Widely used in performance tuning of Kernels 1300 calls to ILAENV() to get block sizes, etc. –Never been systematically tuned Extend automatic tuning techniques of ATLAS, etc. to these other parameters –Automation important as architectures evolve Convert ScaLAPACK data layouts on the fly –Important for ease-of-use too

ScaLAPACK Data Layouts 1D Block Cyclic 1D Cyclic 2D Block Cyclic

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB Memory Speedups for using 2D processor grid range from 2x to 8x Cost of redistributing from 1D to best 2D layout 1% - 10%

Fast Matrix Multiplication (1) (Cohn, Kleinberg, Szegedy, Umans) Can think of fast convolution of polynomials p, q as –Map p (q) into group algebra  i p i z i  C[G] of cyclic group G = { z i } –Multiply elements of C [G] (use divide&conquer = FFT) –Extract coefficients For matrix multiply, need non-abelian group satisfying triple product property –There are subsets X, Y, Z of G where xyz = 1 with x  X, y  Y, z  Z  x = y = z = 1 –Map matrix A into group algebra via  xy A xy x -1 y, B into  y’z B y’z y’ -1 z. –Since x -1 y y’ -1 z = x -1 z iff y = y’ we get  y A xy B yz = (AB) xz Search for fast algorithms reduced to search for groups with certain properties –Fastest algorithm so far is O(n 2.38 ), same as Coppersmith/Winograd

Fast Matrix Multiplication (2) (Cohn, Kleinberg, Szegedy, Umans) 1.Embed A, B in group algebra (exact) 2.Perform FFT (roundoff) 3.Reorganize results into new matrices (exact) 4.Multiply new matrices recursively (roundoff) 5.Reorganize results into new matrices (exact) 6.Perform IFFT (roundoff) 7.Extract C = AB from group algebra (exact)

Fast Matrix Multiplication (3) (D., Dumitriu, Holtz, Kleinberg) Thm 1: Any algorithm of this class for C = AB is “numerically stable” –|| C comp - C || <= c n d  || A || || B || + O(    –c and d are “modest” constants –Like Strassen Let  be the exponent of matrix multiplication, i.e. no algorithm is faster than O(n  ). Thm 2: For all  >0 there exists an algorithm with complexity O(n  +  ) that is numerically stable in the sense of Thm 1.

Conclusions Lots to do in Dense Linear Algebra –New numerical algorithms –Continuing architectural challenges Parallelism, performance tuning –Ease of use, software engineering Grant support, but success depends on contributions from community

Extra Slides

Goal 4: Improved Ease of Use Which do you prefer? CALL PDGESV( N,NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO) A \ B CALL PDGESVX( FACT, TRANS, N,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)

Goal 4: Improved Ease of Use Easy interfaces vs access to details –Some users want access to all details, because Peak performance matters Control over memory allocation –Other users want “simpler” interface Automatic allocation of workspace No universal agreement across systems on “easiest interface” Leave decision to higher level packages Keep expert driver / simple driver / computational routines Add wrappers for other languages –Fortran95, Java, Matlab, Python, even C –Automatic allocation of workspace Add wrappers to convert to “best” parallel layout

Goal 5: Better SW Engineering: What could go into Sca/LAPACK? For all linear algebra problems For all matrix structures For all data types For all programming interfaces Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc) For all architectures and networks Need to prioritize, automate!

Goal 5: Better SW Engineering How to map multiple SW layers to emerging HW layers? How much better are asynchronous algorithms? Are emerging PGAS languages better? Statistical modeling to limit performance tuning costs, improve use of shared clusters Only some things understood well enough for automation now –Telescoping languages, Bernoulli, Rose, FLAME, … Research Plan: explore above design space Development Plan to deliver code (some aspects) –Maintain core in F95 subset –Friendly wrappers for other programming environments –Use variety of source control, maintenance, development tools

Goal 6: Involve the Community To help identify priorities –More interesting tasks than we are funded to do –See for list To help identify promising algorithms –What have we missed? To help do the work –Bug reports, provide fixes –Again, more tasks than we are funded to do –Already happening: thank you!

CPU Trends Relative processing power will continue to double every 18 months 256 logical processors per chip in late 2010

Challenges For all large scale computing, not just linear algebra! Example … your laptop Exponentially growing gaps between –Floating point time << 1/Memory BW << Memory Latency

Commodity Processor Trends Annual increase Typical value in 2006 Predicted value in 2010 Typical value in 2020 Single-chip floating-point performance 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s Memory bus bandwidth 23% 1 GWord/s = 0.25 word/flop 3.5 GWord/s = 0.11 word/flop 27 GWord/s = word/flop Memory latency(5.5%) 70 ns = 280 FP ops = 70 loads 50 ns = 1600 FP ops = 170 loads 28 ns = 94,000 FP ops = 780 loads Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN Will our algorithms run at a high fraction of peak?

Parallel Processor Trends Annual increase Typical value in 2004 Predicted value in 2010 Typical value in 2020 # Processors20 % 4,000 12, GFLOP/s Network Bandwidth 26% 65 MWord/s = 0.03 word/flop 260 MWord/s = word/flop 27 GWord/s = word/flop Network latency (15%) 5  s = 20K FP ops 2  s = 64K FP ops 28 ns = 94,000 FP ops = 780 loads Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN Will our algorithms scale up to more processors?

When is High Accuracy LA Possible? (1) (D., Dumitriu, Holtz) Model: fl(a  b) = (a  b) (1 +  ), |  |   –Not bit model, since  small but arbitrary Goal: NASC for  accurate LA algorithm Subgoal: NASC for  accurate algorithm to evaluate p(x)

  { +, -,  }, exact comparisons, branches Basic Allowable Sets (BAS): –Z i = {x i = 0}, S ik = {x i + x k = 0}, D ik = {x i – x k = 0} V(p) allowable if V(p) =   BAS Thm 1: V(p) unallowable  p cannot be evaluated accurately Thm 2: D  (V(p) -  {A: allowable A  V(p)})    p cannot be evaluated accurately on domain D Thm 3: p(x) integer coeffs, x complex  V(p) allowable iff p can be evaluated accurately Classical Arithmetic (2) finite

  any set of polynomials q i (x) –Ex: FMA q(x) = x 1  x 2 + x 3 V(p) allowable if V(p) =   V(q i ) Thm 1: V(p) unallowable  p cannot be evaluated accurately Thm 2: D  (V(p) -  {A: allowable A  V(p)}    p cannot be evaluated accurately on domain D Cor: No accurate LA alg exists for Toeplitz matrices using any finite set of arithmetic operations –Proof: Det(Toeplitz) contain irreducible components of any degree “Black Box” Arithmetic (3) Irred partsfinite

Complete decision procedure for  accurate algorithms, in particular for real p, arbitrary “block box” operations, domains D Apply to more structured matrix classes Incorporate division, rational functions Incorporate perturbation theory –Conj: Accurate eval possible iff condition number has certain simply singularities Extend to interval arithmetic Math ArXiv math.NA/ Open Questions and Future Work (4)

Timing of Eigensolvers (1.2 GHz Athlon, only matrices where time >.1 sec)

Timing of Eigensolvers (only matrices where time >.1 sec)

Accuracy Results (old vs new MRRR) max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Timing of Eigensolvers (1.9 GHz IBM Power 5 + ESSL, only matrices where time >.01 sec, n>200)

Timing of Eigensolvers (1.9 GHz IBM Power 5 + ESSL, only matrices with clusters)

Timing of Eigensolvers (1.9 GHz IBM Power 5 + ESSL, only matrices without clusters)

Accuracy Results on Power 5 max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Accuracy Results on Power 5, Old vs New Grail max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Timing of Eigensolvers (Cray XD1: 2.2 GHz Opteron + ACML)

Timing Ratios of Eigensolvers (Cray XD1: 2.2 GHz Opteron + ACML)

Timing Ratios of Eigensolvers (Cray XD1: 2.2 GHz Opteron + ACML) Matrices with tight clusters

Timing Ratios of Eigensolvers (Cray XD1: 2.2 GHz Opteron + ACML) Matrices without tight clusters

Timing Ratios of Eigensolvers (Cray XD1: 2.2 GHz Opteron + ACML) Random Matrices

Performance Ratios of Eigensolvers (Cray XD1: 2.2 GHz Opteron + ACML) (1,2,1) Matrices

Performance Ratios of Eigensolvers (Cray XD1: 2.2 GHz Opteron + ACML) Practical Matrices

New GSVD Algorithm Bai et al, UC DavisPSVD, CSD on the way Given m x n A and p x n B, factor A = U ∑ a X and B = V ∑ b X

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible New functions (highlights) –Updating / downdating of factorizations: Stewart, Langou –More generalized SVDs: Bai, Wang, Drmac (MS 3)

Plans for Summer 06 Release Byers HQR (Byers, Smith) MRRR (Voemel, Marques, Parlett) Hessenberg Reduction (Kressner) XBLAS (Li, Hida, Riedy) Iterative Refinement –Hida, Riedy, Li, Demmel –Dongarra, Langou RFP for packed Cholesky –Langou, Gustavson Bug fixes…

Plans for Summer 07 Generalized nonsymmetric eigenproblem –Reduction to condensed form –QZ –Reordering evals –Balancing –Everything inside xGGEV(X) –Reuse test/timing code? Sylvester –New functions => new test/timing