The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 28 Sept 2005.

Slides:

Advertisements

Similar presentations

Computational Physics Linear Algebra Dr. Guy Tel-Zur Sunset in Caruaru by Jaime JaimeJunior. publicdomainpictures.netVersion , 14:00.

Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Algebraic, transcendental (i.e., involving trigonometric and exponential functions), ordinary differential equations, or partial differential equations...

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

MATH 685/ CSI 700/ OR 682 Lecture Notes

Solving Linear Systems (Numerical Recipes, Chap 2)

Chapter 2, Linear Systems, Mainly LU Decomposition.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 21 June 2006 PARA 06.

1cs542g-term Notes  Assignment 1 is out (questions?)

Symmetric Eigensolvers in Sca/LAPACK Osni Marques

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 23 Feb 2007.

The Problem With The Linpack Benchmark 1.0 Matrix Generator Jack J. Dongarra and Julien Langou International Journal of High Performance Computing Applications.

CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:

The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

04/06/2006CS267 Lecture 22a1 CS 267: Applications of Parallel Computers Final Project Suggestions James Demmel

ECIV 301 Programming & Graphics Numerical Methods for Engineers REVIEW II.

1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Antonio M. Vidal Jesús Peinado

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 27 March 2006.

Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

COE 509 Parallel Numerical Computing Lecture 4: The Future of Numerical Linear Algebra Libraries: Automatic Tuning of Sparse Matrix Kernels The Next LAPACK.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

3/3/2008CS267 Guest Lecture 21 CS 267 Dense Linear Algebra: Parallel Gaussian Elimination James Demmel

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley CScADS Autotuning Workshop 9 July 2007.

03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Lecture 7 - Systems of Equations CVEN 302 June 17, 2002.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

JAVA AND MATRIX COMPUTATION

Lesson 3 CSPP58001.

Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

1 Chapter 7 Numerical Methods for the Solution of Systems of Equations.

Lecture 6 - Single Variable Problems & Systems of Equations CVEN 302 June 14, 2002.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

On Computing Multiple Eigenvalues, the Jordan Structure and their Sensitivity Zhonggang Zeng Northeastern Illinois University Sixth International Workshop.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

ALGEBRAIC EIGEN VALUE PROBLEMS

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.

Chapter 2, Linear Systems, Mainly LU Decomposition

Ioannis E. Venetis Department of Computer Engineering and Informatics

The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.

A computational loop k k Integration Newton Iteration

Linear Systems, Mainly LU Decomposition

CS 267 Dense Linear Algebra: Parallel Gaussian Elimination

for more information ... Performance Tuning

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Constructing a system with multiple computers or processors

A computational loop k k Integration Newton Iteration

Presentation transcript:

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 28 Sept 2005

Outline Motivation Participants Goals 1.Better numerics (faster and more accurate algorithms) 2.Expand contents (more functions, more parallel implementations) 3.Automate performance tuning 4.Improve ease of use 5.Better maintenance and support 6.Increase community involvement (this means you!) Questions for the audience Selected Highlights Concluding poem

Motivation LAPACK and ScaLAPACK are widely used –Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … –>52M web Netlib (incl. CLAPACK, LAPACK95) Many ways to improve them, based on –Own algorithmic research –Enthusiastic participation of research community –On-going user/vendor survey (url below) –Opportunities and demands of new architectures, programming languages New releases planned (NSF support) Your feedback desired –

Participants UC Berkeley: –Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads… U Tennessee, Knoxville –Jack Dongarra, Victor Eijkhout, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov Other Academic Institutions –UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara –TU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb Research Institutions –CERFACS, LBL Industrial Partners –Cray, HP, Intel, MathWorks, NAG, SGI

Goal 1 – Better Numerics Fastest algorithm providing “standard” backward stability –MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems –Up to 10x faster HQR: Byers / Mathias / Braman –Extensions to QZ: Kågström / Kressner –Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang, Howell / Fulton –Recursive blocked layouts for packed formats: Gustavson / Kågström / Elmroth / Jonsson/

Goal 1 – Better Numerics Fastest algorithm providing “standard” backward stability –MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems –Up to 10x faster HQR: Byers / Mathias / Braman –Extensions to QZ: Kågström / Kressner –Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang, Howell / Fulton –Recursive blocked layouts for packed formats: Gustavson / Kågström / Elmroth / Jonsson/ New: Most accurate algorithm providing “standard” speed –Iterative refinement for Ax=b, least squares Assume availability of Extra Precise BLAS (Li/Hida/…) –Jacobi SVD: Drmaĉ/Veselić

Goal 1 – Better Numerics Fastest algorithm providing “standard” backward stability –MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems –Up to 10x faster HQR: Byers / Mathias / Braman –Extensions to QZ: Kågström / Kressner –Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang, Howell / Fulton –Recursive blocked layouts for packed formats, Gustavson / Kågström / Elmroth / Jonsson/ New: Most accurate algorithm providing “standard” speed –Iterative refinement for Ax=b, least squares Assume availability of Extra Precise BLAS (Li/Hida/…) –Jacobi SVD: Drmaĉ/Veselić Condition estimates for (almost) everything (ongoing)

Goal 1 – Better Numerics Fastest algorithm providing “standard” backward stability –MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems –Up to 10x faster HQR: Byers / Mathias / Braman –Extensions to QZ: Kågström / Kressner –Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang, Howell / Fulton –Recursive blocked layouts for packed formats, Gustavson / Kågström / Elmroth / Jonsson/ New: Most accurate algorithm providing “standard” speed –Iterative refinement for Ax=b, least squares Assume availability of Extra Precise BLAS (Li/Hida/…) –Jacobi SVD: Drmaĉ/Veselić Condition estimates for (almost) everything (ongoing) What is not fast or accurate enough?

What goes into Sca/LAPACK? For all linear algebra problems For all matrix structures For all data types For all programming interfaces Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc) For all architectures and networks Need to automate and prioritize!

Goal 2 – Expanded Content Make content of ScaLAPACK mirror LAPACK as much as possible –Full Automation would be nice, not yet robust, general enough Telescoping languages, Bernoulli, Rose, FLAME, … New functions (examples) –Updating / downdating of factorizations: Stewart, Langou –More generalized SVDs: Bai, Wang –More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat –Structured eigenproblems O(n 2 ) version of roots(p) – Gu, Chandrasekaran, Zhu et al Selected matrix polynomials: Mehrmann How should we prioritize missing functions?

Goal 3 – Automate Performance Tuning Not just BLAS 1300 calls to ILAENV() to get block sizes, etc. –Never been systematically tuned Extend automatic tuning techniques of ATLAS, etc. to these other parameters –Automation important as architectures evolve Convert ScaLAPACK data layouts on the fly How important is peak performance?

Goal 4: Improved Ease of Use Which do you prefer? CALL PDGESV( N,NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO) A \ B CALL PDGESVX( FACT, TRANS, N,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)

Goal 4: Improved Ease of Use, Software Engineering (1) Development versus Research –Development: practical approach to produce useful code –Research: If we could start over and do it right, how would we? Life after F77? –Fortran95, C, C++, Java, Matlab, Python, … Easy interfaces vs access to details –No universal agreement across systems on “easiest interface” –Leave decision to higher level packages –Keep expert driver / simple driver / computational routines Conclusion: Subset of F95 for core + wrappers for drivers –What subset? Recursion, for new data structures Modules, to produce multiple precision versions Environmental enquiries, to replace xLAMCH –Wrappers for Fortran95, Java, Matlab, Python, … even for CLAPACK Automatic memory allocation of workspace

Goal 4: Improved Ease of Use, Software Engineering (2) Why not full F95 for core? –Would make interfacing to other languages and packages harder –Some users want control over memory allocation Why not C for core? –High cost/benefit ratio for full rewrite –Performance Automation would be nice –Use Babel/SIDL to produce native looking interfaces when possible

Precisions beyond double (1) Range of designs possible –Just run in quad (or some other fixed precision) –Support codes like #bits = 32 Repeat #bits = 2* #bits Solve(A, b, x, error_bound, #bits) Until error_bound < tol

Precisions beyond double (2) Easiest approach – fixed precision –Use F95 modules to produce any precision on request –Could use QD, ARPREC, GMP, … –Keep current memory allocation (twiddle GMP…) Next easier approach – maximum precision –Build maximum allowable precision on request –Pass in precision parameter, up to this amount More flexible (and difficult) approach –Choose any precision at run time –Dynamically allocate all variables Most aggressive approach –New algorithms that minimize work to get desired prec. What do users want? –Compatibility with symbolic manipulation systems?

Goal 4: Improved Ease of Use, Software Engineering (3) Research Issues –May or may not impact development How to map multiple software layers to emerging architectural layers? Are emerging HPCS languages better? How much can we automate? Do we keep having to write Gaussian elimination over and over again? Statistical modeling to limit performance tuning costs, improve use of shared clusters

Goal 5:Better Maintenance and Support Website for user feedback and requests New developer and discussion forums URL: –Includes NSF proposal Version control and bug tracking system Automatic Build and Test environment Wide variety of supported platforms Cooperation with vendors Anything else desired?

Goal 6: Involve the Community To help identify priorities –More interesting tasks than we are funded to do –See for list To help identify promising algorithms –What have we missed? To help do the work –Bug reports, provide fixes –Again, more tasks than we are funded to do –Already happening: thank you! –We retain final decisions on content Anything else?

Some Highlights Putting more of LAPACK into ScaLAPACK ScaLAPACK performance on 1D vs 2D grids MRRR “Holy Grail” algorithm for symmetric EVD Iterative Refinement for Ax=b O(n 2 ) polynomial root finder Generalized SVD

Missing Drivers in Sca/LAPACK LAPACKScaLAPACK Linear Equations LU Cholesky LDL T xGESV xPOSV xSYSV PxGESV PxPOSV missing Least Squares (LS) QR QR+pivot SVD/QR SVD/D&C SVD/MRRR QR + iterative refine. xGELS xGELSY xGELSS xGELSD missing PxGELS missing driver missing (intent) missing Generalized LSLS + equality constr. Generalized LM Above + Iterative ref. xGGLSE xGGGLM missing

More missing drivers LAPACKScaLAPACK Symmetric EVDQR / Bisection+Invit D&C MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X missing (intent) missing Nonsymmetric EVDSchur form Vectors too xGEES / X xGEEV /X missing driver SVDQR D&C MRRR Jacobi xGESVD xGESDD missing PxGESVD missing (intent) missing Generalized Symmetric EVD QR / Bisection+Invit D&C MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (intent) missing Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing Generalized SVDKogbetliantz MRRR xGGSVD missing missing (intent) missing

Missing matrix types in ScaLAPACK Symmetric, Hermitian, triangular –Band, Packed Positive Definite –Packed Orthogonal, Unitary –Packed

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB Memory Speedups for using 2D processor grid range from 2x to 8x

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB Memory Cost of redistributing matrix to optimal layout is small

MRRR Algorithm for eig(tridiagonal) and svd(bidiagonal) “Multiple Relatively Robust Representation” 1999 Householder Award honorable mention for Dhillon O(nk) flops to find k eigenvalues/vectors of nxn tridiagonal matrix (similar for SVD) –Minimum possible! Naturally parallelizable Accurate –Small residuals || Tx i – i x i || = O(n  ) –Orthogonal eigenvectors || x i T x j || = O(n  ) Hence nickname: “Holy Grail” 2 versions –LAPACK 3.0: large error on “hard” cases –Next release: fixed! How should we tradeoff speed and accuracy?

Timing of Eigensolvers (1.2 GHz Athlon, only matrices where time >.1 sec)

Timing of Eigensolvers (only matrices where time >.1 sec)

Accuracy Results (old vs new Grail) max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

Accuracy Results (Grail vs QR vs DC) max i ||Tq i – i q i || / ( n  ) || QQ T – I || / (n  )

More Accurate: Solve Ax=b Conventional Gaussian Elimination With extra precise iterative refinement    n 1/2  

What’s new? Need extra precision (beyond double) –Part of new BLAS standard –Cost = O(n 2 ) extra per right-hand-side, vs O(n 3 ) to factor Get tiny componentwise bounds too –Error in x i small compared to |x i |, not just max j |x j | “Guarantees” based on condition number estimates –No bad bounds in 6.2M tests –Different condition number for componentwise bounds –Traditional iterative refinement can “fail” –Only get “matrix close to singular” message when answer wrong? Extends to least squares Demmel, Kahan, Hida, Riedy, X. Li, Sarkisyan, … LAPACK Working Note # 165

Can condition estimators lie? Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization –Demmel/Diament/Malajovich (FCM2001) But what if matrix multiply costs O(n 2 )? – Cohn/Umans/Kleinberg (FOCS 2003/5)

New algorithm for roots(p) To find roots of polynomial p –Roots(p) does eig(C(p)) –Costs O(n 3 ), stable, reliable O(n 2 ) Alternatives –Newton, Jenkins-Traub, Laguerre, … –Stable? Reliable? New: Exploit “semiseparable” structure of C(p) –Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration –Complexity drops from O(n 3 ) to O(n 2 ) Related work: Gemignani, Bini, Pan, et al Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel -p 1 -p 2 … -p d 1 0 … … 0 … … … … 0 … 1 0 C(p)=

Properties of new roots(p) First (?) algorithm that –Is O(n 2 ) –Is backward stable, in the matrix sense –Is backward stable, in the sense that the computed roots are the exact roots of a slightly perturbed input polynomial Depends on balancing = scaling roots by a constant  Still need to automate choice of  Byers, Mathias, Braman Tisseur, Higham, Mackey …

New GSVD Algorithm: Timing Comparisons 1.0GHz Itanium–2 (2GB RAM); Intel's Math Kernel Library (incl. BLAS, LAPACK) Bai et al, UC DavisPSVD, CSD on the way

Conclusions Lots to do in Dense Linear Algebra –New numerical algorithms –Continuing architectural challenges Parallelism, performance tuning Grant support, but success depends on contributions from community

Extra Slides

Downa Dating Should some equations be forgot when overdetermīned? Should some equations be forgot using hyperbolic sines? With hyperbolic sines, my dear with hyperbolic sines, We’ll hope to get some boundedness with hyperbolic sines. With apologies to Robert Burns & Nick Higham

New GSVD Algorithm (XGGQSV) U T A Z =∑ a ( 0 R ), V T B Z =∑ b ( 0 R ); A: M x N, B: P xN Modified Van Loan's method (1) Pre-processing: reveal rank of ( A T ; B T ) T (2) Split QR: reduce two upper triangular into one (3) CSD: Cosine-Sine Decomposition (4) Post-processing: assemble resulted matrices Workspace =  (max(M, N, P)) + 5L 2 where L = rank(B) Outperforms current XGGSVD in LAPACK

Profile of SGGQSV 1.0GHz Itanium–2 (2GB RAM); Intel's Math Kernel Library (incl. BLAS, LAPACK)

Related New Routines CSD (XORCSD) –U T Q 1 Z =∑ a, V T Q 2 Z =∑ b –Based on Von Loan's method (1985) –Dominated by the cost of SVD –Workspace =  (Max(P+M, N)) PSVD (XGGPSV) –U T ABV = ∑ –Based on work by Golub, Solna, Van Dooren (2000) –Workspace =  (Max(M, P, N))

Benchmark Details AMD 1.2 GHz Athlon, 2GB mem, Redhat + Intel compiler Compute all eigenvalues and eigenvectors of a symmetric tridiagonal matrix T Codes compared: –qr: QR iteration from LAPACK: dsteqr –dc: Cuppen’s Divide&Conquer from LAPACK: dstedc –gr: New implementation of MRRR algorithm (“Grail”) –ogr: MRRR from LAPACK 3.0: dstegr (“old Grail”)

Timing of Eigensolvers (1.2 GHz Athlon, only matrices where time >.1 sec

Timing of Eigensolvers (1.2 GHz Athlon, only matrices where time >.1 sec)

More Accurate: Solve Ax=b Old idea: Use Newton’s method on f(x) = Ax-b –On a linear system? –Roundoff in Ax-b makes it interesting (“nonlinear”) –Iterative refinement Snyder, Wilkinson, Moler, Skeel, … Repeat r = Ax-b … compute with extra precision Solve Ad = r … using LU factorization of A Update x = x – d Until “accurate enough” or no progress

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB Memory Speedups from 1.33x to 11.5x

Times obtained on: 60 processors, Dual Opteron 1.4GHz, 64-bit machine 2GB Memory, Myrinet Interconnect

Times obtained on: 60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB Memory Speedups from 1.8x to 7.5x

Times obtained on: 60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB Memory

Times obtained on: 60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB Memory Speedups from 2.2x to 6x

Times obtained on: 60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect 2GB Memory

Times obtained on: 60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per node, 4 nodes connected with high-bandwidth low-latency switching network Block size: 64 Speedups from 2.2x to 2.8x

Times obtained on: 60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per node, 4 nodes connected with high-bandwidth low-latency switching network Block size: 64

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley Householder XVI

OO-like interfaces (VE) Aim: native looking interfaces in modern languages (C++, F90, Java, Python…) Overloading to have identical interfaces for all data types (precision/storage) Overloading to omit certain parameters (stride, transpose,…) Expose data only when necessary (automatic allocation of temporaries, pivoting permutation,….) Mechanism: Use Babel/SIDL to interface to existing F77 code base

Example of HLL interface (VE) // C++ example Lapack::PDTridiagonalMatrix A = Lapack::PDTridiagonalMatrix::_create(); x = (double*) malloc(SIZE*sizeof(double)); y = (double*) malloc(SIZE*sizeof(double)); A.MatVec(x,y); // or A.MatVec(MatTranspose,x,y); A.Factor(); A.Solve(y); ! F90 example ! Create matrix from user array allocate(data_array(20,25)) ! Fill in the array…. call CreateWithArray(A,data_array) ! Retrieve the array of a (factored) matrix call GetArray(A,mat) mat_array => mat%d_data print *,'pivots',mat_array(0,0),mat_array(1,1) Lots of details omitted here!