The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 23 Feb 2007.

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 23 Feb 2007

Outline Motivation for new Sca/LAPACK Challenges (or research opportunities…) Goals of new Sca/LAPACK Highlights of progress –With some excursions …

Motivation LAPACK and ScaLAPACK are widely used –Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … –>68M web hits @ Netlib (incl. CLAPACK, LAPACK95) 35K hits/day Many ways to improve them, based on –Own algorithmic research –Enthusiastic participation of research community –User/vendor survey –Opportunities and demands of new architectures, programming languages New releases planned (NSF support)

Participants UC Berkeley: –Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, undergrads… U Tennessee, Knoxville –Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak Other Academic Institutions –UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara –TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb Research Institutions –CERFACS, LBL Industrial Partners –Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, SGI

Challenges: Increasing Parallelism In the Top500: In your Laptop (Intel just announced an 80-core, 1 Teraflop chip)

Challenges Increasing parallelism Exponentially growing gaps between –Floating point time << 1/Memory BW << Memory Latency Improving 59%/year vs 23%/year vs 5.5%/year –Floating point time << 1/Network BW << Network Latency Improving 59%/year vs 26%/year vs 15%/year Heterogeneity (performance and semantics) Asynchrony Unreliability

What do users want? High performance, ease of use, … Survey results at www.netlib.org/lapack-dev –Small but interesting sample –What matrix sizes do you care about? 1000s: 34% 10,000s: 26% 100,000s or 1Ms: 26% –How many processors, on distributed memory? >10: 34%, >100: 31%, >1000: 19% –Do you use more than double precision? Sometimes or frequently: 16% –Would Automatic Memory Allocation help? Very useful: 72%, Not useful: 14%

Goals of next Sca/LAPACK 1.Better algorithms –Faster, more accurate 2.Expand contents –More functions, more parallel implementations 3.Automate performance tuning 4.Improve ease of use 5.Better software engineering 6.Increased community involvement

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible (possible class projects)

Missing Routines in Sca/LAPACK LAPACKScaLAPACK Linear Equations LU LU + iterative refine Cholesky LDL T xGESV xGESVX xPOSV xSYSV PxGESV missing PxPOSV missing Least Squares (LS) QR QR+pivot SVD/QR SVD/D&C SVD/MRRR QR + iterative refine. xGELS xGELSY xGELSS xGELSD missing PxGELS missing missing (intent?) missing Generalized LSLS + equality constr. Generalized LM Above + Iterative ref. xGGLSE xGGGLM missing

More missing routines LAPACKScaLAPACK Symmetric EVDQR / Bisection+Invit D&C MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X PxSYEVD missing Nonsymmetric EVDSchur form Vectors too xGEES / X xGEEV /X missing (driver) missing SVDQR D&C MRRR Jacobi xGESVD xGESDD missing PxGESVD missing (intent?) missing Generalized Symmetric EVD QR / Bisection+Invit D&C MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (intent?) missing Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing Generalized SVDKogbetliantz MRRR xGGSVD missing missing (intent) missing

Goal 1: Better Algorithms Faster –But provide “usual” accuracy, stability –… Or accurate for an important subclass More accurate –But provide “usual” speed –… Or at any cost

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana, Howell / Fulton, Bischof / Lang Extensions to QZ: –Kågström / Kressner Recursive blocked layouts for packed formats: –Gustavson / Kågström / Elmroth / Jonsson/

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems –Faster and more accurate than previous algorithms –SIAM SIAG/LA Prize in 2006 –New sequential, first parallel versions out in 2006

Flop Counts of Eigensolvers (2.2 GHz Opteron + ACML)

Flop Count Ratios of Eigensolvers (2.2 GHz Opteron + ACML)

Run Time Ratios of Eigensolvers (2.2 GHz Opteron + ACML)

MFlop Rates of Eigensolvers (2.2 GHz Opteron + ACML)

Exploiting GPUs Numerous emerging co-processors –Cell, SSE, Grape, GPU, “physics coprocessor,” … When can we exploit them? –LIttle help if memory is bottleneck –Various attempts to use GPUs for dense linear algebra Bisection on GPUs for symmetric tridiagonal eigenproblem –Evaluate Count(x) = #(evals < x) for many x –Very little memory traffic –Speedups up to 100x (Volkov) 43 Gflops on ATI Radeon X1900 vs running on 2.8 GHz Pentium 4 Overall eigenvalue solver 6.8x faster

Parallel Runtimes of Eigensolvers (2.4 GHz Xeon Cluster + Ethernet)

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –“Promise” the right answer for O(n 2 ) additional cost Jacobi-based SVD –Faster than QR, can be arbitrarily more accurate Arbitrary precision versions of everything –Using your favorite multiple precision package

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –“Promise” the right answer for O(n 2 ) additional cost –Iterative refinement with extra-precise residuals Newton’s method applied to solving f(x) = A*x-b = 0 –Extra-precise BLAS needed (LAWN#165)

More Accurate: Solve Ax=b Conventional Gaussian Elimination With extra precise iterative refinement    n 1/2  

Iterative Refinement: for speed What if double precision much slower than single? –Cell processor in Playstation 3 256 GFlops single, 25 GFlops double –Pentium SSE2: single twice as fast as double Given Ax=b in double precision –Factor in single, do refinement in double –If  (A) < 1/  single, runs at speed of single 1.9x speedup on Intel-based laptop Applies to many algorithms, if difference large

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible New functions (highlights) –Updating / downdating of factorizations: Stewart, Langou –More generalized SVDs: Bai, Wang –More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat –Structured eigenproblems O(n 2 ) version of roots(p) –Gu, Chandrasekaran, Bindel et al Selected matrix polynomials: –Mehrmann

New algorithm for roots(p) To find roots of polynomial p –Roots(p) does eig(C(p)) –Costs O(n 3 ), stable, reliable O(n 2 ) Alternatives –Newton, Jenkins-Traub, Laguerre, … –Stable? Reliable? New: Exploit “semiseparable” structure of C(p) –Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration –Complexity drops from O(n 3 ) to O(n 2 ), stable in practice Related work: Gemignani, Bini, Pan, et al Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel -p 1 -p 2 … -p d 1 0 … 0 0 1 … 0 … … … … 0 … 1 0 C(p)=

Goal 3 – Automate Performance Tuning Widely used in performance tuning of Kernels –ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas –FFTW – Fast Fourier Transform – www.fftw.org –Spiral – signal processing - www.spiral.net –OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski

Optimizing blocksizes for mat-mul Finding a Needle in a Haystack – So Automate

Goal 3 – Automate Performance Tuning Widely used in performance tuning of Kernels 1300 calls to ILAENV() to get block sizes, etc. –Never been systematically tuned Extend automatic tuning techniques of ATLAS, etc. to these other parameters –Automation important as architectures evolve Convert ScaLAPACK data layouts on the fly –Important for ease-of-use too

ScaLAPACK Data Layouts 1D Block Cyclic 1D Cyclic 2D Block Cyclic

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB Memory Speedups for using 2D processor grid range from 2x to 8x Cost of redistributing from 1D to best 2D layout 1% - 10%

Conclusions Lots to do in Dense Linear Algebra –New numerical algorithms –Continuing architectural challenges Parallelism, performance tuning –Ease of use, software engineering Grant support, but success depends on contributions from community www.netlib.org/lapack-dev www.cs.berkeley.edu/~demmel

Extra Slides

Fast Matrix Multiplication (1) (Cohn, Kleinberg, Szegedy, Umans) Can think of fast convolution of polynomials p, q as –Map p (q) into group algebra  i p i z i  C[G] of cyclic group G = { z i } –Multiply elements of C [G] (use divide&conquer = FFT) –Extract coefficients For matrix multiply, need non-abelian group satisfying triple product property –There are subsets X, Y, Z of G where xyz = 1 with x  X, y  Y, z  Z  x = y = z = 1 –Map matrix A into group algebra via  xy A xy x -1 y, B into  y’z B y’z y’ -1 z. –Since x -1 y y’ -1 z = x -1 z iff y = y’ we get  y A xy B yz = (AB) xz Search for fast algorithms reduced to search for groups with certain properties –Fastest algorithm so far is O(n 2.38 ), same as Coppersmith/Winograd

Fast Matrix Multiplication (2) (Cohn, Kleinberg, Szegedy, Umans) 1.Embed A, B in group algebra (exact) 2.Perform FFT (roundoff) 3.Reorganize results into new matrices (exact) 4.Multiple new matrices recursively (roundoff) 5.Reorganize results into new matrices (exact) 6.Perform IFFT (roundoff) 7.Extract C = AB from group algebra (exact)

Fast Matrix Multiplication (3) (Demmel, Dumitriu, Holtz, Kleinberg) Thm 1: Any algorithm of this class for C = AB is “numerically stable” –|| C comp - C || <= c n d  || A || || B || + O(    –c and d are “modest” constants –Like Strassen Let  be the exponent of matrix multiplication, i.e. no algorithm is faster than O(n  ). Thm 2: For all  >0 there exists an algorithm with complexity O(n  +  ) that is numerically stable in the sense of Thm 1.

Commodity Processor Trends Annual increase Typical value in 2006 Predicted value in 2010 Typical value in 2020 Single-chip floating-point performance 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s Memory bus bandwidth 23% 1 GWord/s = 0.25 word/flop 3.5 GWord/s = 0.11 word/flop 27 GWord/s = 0.008 word/flop Memory latency(5.5%) 70 ns = 280 FP ops = 70 loads 50 ns = 1600 FP ops = 170 loads 28 ns = 94,000 FP ops = 780 loads Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6. Will our algorithms run at a high fraction of peak?

Challenges For all large scale computing, not just linear algebra! Example … your laptop Exponentially growing gaps between –Floating point time << 1/Memory BW << Memory Latency –Floating point time << 1/Network BW << Network Latency

Parallel Processor Trends Annual increase Typical value in 2004 Predicted value in 2010 Typical value in 2020 # Processors20 % 4,000 12,000 3300 GFLOP/s Network Bandwidth 26% 65 MWord/s = 0.03 word/flop 260 MWord/s = 0.008 word/flop 27 GWord/s = 0.008 word/flop Network latency (15%) 5  s = 20K FP ops 2  s = 64K FP ops 28 ns = 94,000 FP ops = 780 loads Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6. Will our algorithms scale up to more processors?

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems –Faster and more accurate than previous algorithms –New sequential, first parallel versions out in 2006

Timing of Eigensolvers (1.2 GHz Athlon, only matrices where time >.1 sec)

Timing of Eigensolvers (only matrices where time >.1 sec)

Accuracy Results (old vs new MRRR) max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman Extensions to QZ: –Kågström / Kressner Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana, Howell / Fulton, Bischof / Lang –Full nonsymmetric eigenproblem: n=1500: 3.43x faster HQR: 5x faster, Reduction: 14% faster –Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster –Sequential versions out in 2006

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana, Howell / Fulton, Bischof / Lang Extensions to QZ: –Kågström / Kressner –LAPACK Working Note (LAWN) #173 –On 26 real test matrices, speedups up to 11.9x, 4.4x average

Goal 4: Improved Ease of Use Which do you prefer? CALL PDGESV( N,NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO) A \ B CALL PDGESVX( FACT, TRANS, N,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)

Goal 4: Improved Ease of Use Easy interfaces vs access to details –Some users want access to all details, because Peak performance matters Control over memory allocation –Other users want “simpler” interface Automatic allocation of workspace No universal agreement across systems on “easiest interface” Leave decision to higher level packages Keep expert driver / simple driver / computational routines Add wrappers for other languages –Fortran95, Java, Matlab, Python, even C –Automatic allocation of workspace Add wrappers to convert to “best” parallel layout

Goal 5: Better SW Engineering: What could go into Sca/LAPACK? For all linear algebra problems For all matrix structures For all data types For all programming interfaces Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc) For all architectures and networks Need to prioritize, automate!

Goal 5: Better SW Engineering How to map multiple SW layers to emerging HW layers? How much better are asynchronous algorithms? Are emerging PGAS languages better? Statistical modeling to limit performance tuning costs, improve use of shared clusters Only some things understood well enough for automation now –Telescoping languages, Bernoulli, Rose, FLAME, … Research Plan: explore above design space Development Plan to deliver code (some aspects) –Maintain core in F95 subset –Friendly wrappers for other programming environments –Use variety of source control, maintenance, development tools

Goal 6: Involve the Community To help identify priorities –More interesting tasks than we are funded to do –See www.netlib.org/lapack-dev for list To help identify promising algorithms –What have we missed? To help do the work –Bug reports, provide fixes –Again, more tasks than we are funded to do –Already happening: thank you!

Accuracy of Eigensolvers max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible New functions (highlights) –Updating / downdating of factorizations: Stewart, Langou –More generalized SVDs: Bai, Wang

New GSVD Algorithm Bai et al, UC DavisPSVD, CSD on the way Given m x n A and p x n B, factor A = U ∑ a X and B = V ∑ b X

Motivation LAPACK and ScaLAPACK are widely used –Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … –>63M web hits @ Netlib (incl. CLAPACK, LAPACK95) 35K hits/day

Impact (with NERSC, LBNL) Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). ScaLAPACK

Challenges For all large scale computing, not just linear algebra! Example …

Challenges For all large scale computing, not just linear algebra! Example … your laptop

CPU Trends Relative processing power will continue to double every 18 months 256 logical processors per chip in late 2010

Challenges For all large scale computing, not just linear algebra! Example … your laptop Exponentially growing gaps between –Floating point time << 1/Memory BW << Memory Latency Improving 59%/year vs 23%/year vs 5.5%/year

Accuracy of Eigensolvers: Old vs New MRRR max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems –Faster and more accurate than previous algorithms –New sequential, first parallel versions out in 2006 –Numerical evidence shows DC faster if it “deflates” often, which is hard to predict in advance. So having both algorithms is important. –SVD still an open problem

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman –SIAM SIAG/LA Prize in 2003 –Sequential version out in 2006 –More on performance later

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana, Howell / Fulton, Bischof / Lang –Full nonsymmetric eigenproblem: n=1500: 3.43x faster HQR: 5x faster, Reduction: 14% faster

ARCH: Intel Pentium 4 ( 3.4 GHz ) F77 : GNU Fortran (GCC) 3.4.4 BLAS: libgoto_prescott32p-r1.00.so (one thread) Dense nonsymmetric eigenvalue problem No vectors All vectors Source: Julien Langou

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana, Howell / Fulton, Bischof / Lang –Full nonsymmetric eigenproblem: n=1500: 3.43x faster HQR: 5x faster, Reduction: 14% faster –Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster –Sequential versions out in 2006

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana, Howell / Fulton, Bischof / Lang Extensions to QZ: –Kågström / Kressner –LAPACK Working Note (LAWN) #173 –On 26 real test matrices, speedups up to 11.9x, 4.4x average

Comparison of ScaLAPACK QR and new parallel multishift QZ Execution times in secs for 4096 x 4096 random problems Ax = sx and Ax = sBx, using processor grids including 1-16 processors. Note: work(QZ) > 2 * work(QR) but Time(// QZ) << Time (//QR)!! Times include cost for computing eigenvalues and transformation matrices. Adlerborn-Kågström-Kressner, SIAM PP’2006

Goal 1a – Faster Algorithms (Highlights) MRRR algorithm for symmetric eigenproblem / SVD: –Parlett / Dhillon / Voemel / Marques / Willems Up to 10x faster HQR: –Byers / Mathias / Braman Faster Hessenberg, tridiagonal, bidiagonal reductions: –van de Geijn/Quintana, Howell / Fulton, Bischof / Lang Extensions to QZ: –Kågström / Kressner Recursive blocked layouts for packed formats: –Gustavson / Kågström / Elmroth / Jonsson/ –SIAM Review Article 2004

Recursive Layouts and Algorithms Still merges multiple elimination steps into a few BLAS 3 operations Best speedups for packed storage of symmetric matrices

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –“Promise” the right answer for O(n 2 ) additional cost –Iterative refinement with extra-precise residuals –Extra-precise BLAS needed (LAWN#165) –“Guarantees” based on condition number estimates Condition estimate < 1/( n 1/2   reliable answer and tiny error bounds No bad bounds in 6.2M tests Can condition estimators lie?

Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization –Demmel/Diament/Malajovich (FCM2001) But what if matrix multiply costs O(n 2 )? – More later

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –“Promise” the right answer for O(n 2 ) additional cost –Iterative refinement with extra-precise residuals –Extra-precise BLAS needed (LAWN#165) –“Guarantees” based on condition number estimates –Get tiny componentwise bounds too Each x i accurate Slightly different condition number –Extends to Least Squares –Release in 2006

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –Promise the right answer for O(n 2 ) additional cost Jacobi-based SVD –Faster than QR, can be arbitrarily more accurate –LAWNS # 169, 170 –Can be arbitrarily more accurate on tiny singular values –Yet faster than QR iteration!

Goal 1b – More Accurate Algorithms Iterative refinement for Ax=b, least squares –Promise the right answer for O(n 2 ) additional cost Jacobi-based SVD –Faster than QR, can be arbitrarily more accurate Arbitrary precision versions of everything –Using your favorite multiple precision package –Quad, Quad-double, ARPREC, MPFR, … –Using Fortran 95 modules

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible New functions (highlights) –Updating / downdating of factorizations: Stewart, Langou –More generalized SVDs: Bai, Wang –More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat –Structured eigenproblems O(n 2 ) version of roots(p) –Gu, Chandrasekaran, Bindel et al Selected matrix polynomials: –Mehrmann How should we prioritize missing functions?

The Difficulty of Tuning SpMV: Sparse Matrix Vector Multiply // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j)

The Difficulty of Tuning SpMV // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t = 0 for k=row[i] to row[i+1]-1: t += A[k] * x[J[k]] y[i] = t Exploit 8x8 dense blocks

Speedups on Itanium 2: The Need for Search Reference Mflop/s (7.6%) Mflop/s (31.1%)

Speedups on Itanium 2: The Need for Search Reference Best: 4x2 Mflop/s (7.6%) Mflop/s (31.1%)

SpMV Performance—raefsky3

More Surprises tuning SpMV More complex example Example: 3x3 blocking –Logical grid of 3x3 cells

Extra Work Can Improve Efficiency More complex example Example: 3x3 blocking –Logical grid of 3x3 cells –Pad with zeros –“Fill ratio” = 1.5 On Pentium III: 1.5x speedup! (2/3 time)

Tuning for Workloads BiCG - equal mix of A*x and A T *y –3x1: Ax, A T y = 1053, 343 Mflop/s –3x3: Ax, A T y = 806, 826 Mflop/s Higher-level operation - (Ax, A T y) kernel –3x1: 757 Mflop/s –3x3: 1400 Mflop/s

Optimizations Available in OSKI: Optimized Sparse Kernel Interface Optimizations for SpMV –Register blocking (RB): up to 4x over CSR –Variable block splitting: 2.1x over CSR, 1.8x over RB –Diagonals: 2x over CSR –Reordering to create dense structure + splitting: 2x over CSR –Symmetry: 2.8x over CSR, 2.6x over RB –Cache blocking: 3x over CSR –Multiple vectors (SpMM): 7x over CSR –And combinations… Sparse triangular solve –Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels –AA T *x, A T A*x: 4x over CSR, 1.8x over RB –A  *x: 2x over CSR, 1.5x over RB Available stand alone or integrated into PETSc

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 23 Feb 2007.

Similar presentations

Presentation on theme: "The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 23 Feb 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 23 Feb 2007.

Similar presentations

Presentation on theme: "The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 23 Feb 2007."— Presentation transcript:

Similar presentations

About project

Feedback