Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization Automatic Performance Tuning of Numerical Kernels BeBOP:

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.

Numerical Algorithms Matrix multiplication

CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Data Locality CS 524 – High-Performance Computing.

High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.

08/29/2002CS267 Lecure 21 CS 267: Optimizing for Uniprocessors—A Case Study in Matrix Multiplication Katherine Yelick

CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.

9/10/2007CS194 Lecture1 Memory Hierarchies and Optimizations: Case Study in Matrix Multiplication Kathy Yelick

Automatic Performance Tuning of Sparse Matrix Kernels Berkeley Benchmarking and OPtimization (BeBOP) Project James.

BeBOP: Berkeley Benchmarking and Optimization Automatic Performance Tuning of Numerical Kernels James Demmel EECS and Math UC Berkeley Katherine Yelick.

Uniprocessor Optimizations and Matrix Multiplication

Data Locality CS 524 – High-Performance Computing.

Single Processor Optimizations Matrix Multiplication Case Study

Optimizing for the serial processors Scaled speedup: operate near the memory boundary. Memory systems on modern processors are complicated. The performance.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Makoto Kudoh*1, Hisayasu Kuroda*1,

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

1 Lecture 2 Single Processor Machines: Memory Hierarchies and Processor Features UCSB CS240A, Winter 2013 Modified from Demmel/Yelick’s slides.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

BeBOP: Berkeley Benchmarking and Optimization Automatic Performance Tuning of Numerical Kernels James Demmel EECS and Math UC Berkeley Katherine Yelick.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

A few words on locality and arrays

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Optimizing Cache Performance in Matrix Multiplication

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Optimizing Cache Performance in Matrix Multiplication

BLAS: behind the scenes

Uniprocessor Optimizations and Matrix Multiplication

for more information ... Performance Tuning

Memory Hierarchies.

STUDY AND IMPLEMENTATION

Memory System Performance Chapter 3

Presentation transcript:

Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization James Demmel EECS and Math UC Berkeley Katherine Yelick EECS UC Berkeley Support from NSF, DOE SciDAC, Intel

Performance Tuning Participants  Other Faculty  Michael Jordan, William Kahan, Zhaojun Bai (UCD)  Researchers  Mark Adams (SNL), David Bailey (LBL), Parry Husbands (LBL), Xiaoye Li (LBL), Lenny Oliker (LBL)  PhD Students  Rich Vuduc, Yozo Hida, Geoff Pike  Undergrads  Brian Gaeke, Jen Hsu, Shoaib Kamil, Suh Kang, Hyun Kim, Gina Lee, Jaeseop Lee, Michael de Lorimier, Jin Moon, Randy Shoopman, Brandon Thompson

Outline  Motivation, History, Related work  Tuning Dense Matrix Operations  Tuning Sparse Matrix Operations  Results on Sun Ultra 1/170  Recent results on P4  Recent results on Itanium  Future Work

Motivation History Related Work

Performance Tuning Performance Tuning  Motivation: performance of many applications dominated by a few kernels  CAD  Nonlinear ODEs  Nonlinear equations  Linear equations  Matrix multiply  Matrix-by-matrix or matrix-by-vector  Dense or Sparse  Information retrieval by LSI  Compress term-document matrix  …  Sparse mat- vec multiply  Many other examples (not all linear algebra)

Conventional Performance Tuning  Motivation: performance of many applications dominated by a few kernels  Vendor or user hand tunes kernels  Drawbacks:  Very time consuming and tedious work  Even with intimate knowledge of architecture and compiler, performance hard to predict  Growing list of kernels to tune  Example: New BLAS (Basic Linear Algebra Subroutines) Standard  Must be redone for every architecture, compiler  Compiler technology often lags architecture  Not just a compiler problem:  Best algorithm may depend on input, so some tuning at run-time.  Not all algorithms semantically or mathematically equivalent

Automatic Performance Tuning  Approach: for each kernel 1.Identify and generate a space of algorithms 2.Search for the fastest one, by running them  What is a space of algorithms?  Depending on kernel and input, may vary  instruction mix and order  memory access patterns  data structures  mathematical formulation  When do we search?  Once per kernel and architecture  At compile time  At run time  All of the above

Some Automatic Tuning Projects  PHIPAC ( ) (Bilmes,Asanovic,Vuduc,Demmel)  ATLAS ( ) (Dongarra, Whaley; in Matlab)  XBLAS ( (Demmel, X. Li)  Sparsity ( ) (Yelick, Im)  FFTs and Signal Processing  FFTW (  Won 1999 Wilkinson Prize for Numerical Software  SPIRAL (  Extensions to other transforms, DSPs  UHFFT  Extensions to higher dimension, parallelism  Special session at ICCS 2001  Organized by Yelick and Demmel   Proceedings available  Pointers to other automatic tuning projects at 

An Example: Dense Matrix Multiply

Tuning Techniques in PHIPAC  Multi-level tiling (cache, registers)  Copy optimization  Removing false dependencies  Exploiting multiple registers  Minimizing pointer updates  Loop unrolling  Software pipelining  Exposing independent operations  Copy optimization

Untiled (“Naïve”) n x n Matrix Multiply Assume two levels of memory: fast and slow for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} for k = 1 to n {read B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} =+* C(i,j) A(i,:) B(:,j)

Untiled (“Naïve”) n x n Matrix Multiply - Analysis Assume two levels of memory: fast and slow for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} for k = 1 to n {read B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} Count Number of slow memory references m = n 3 to read each column of B n times + n 2 to read each row of A once for each i + 2n 2 to read and write each element of C once = n 3 + 3n 2

Tiled (Blocked) Matrix Multiply Consider A,B,C to be N by N matrices of b by b subblocks where b=n / N is called the block size for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} =+* C(i,j) A(i,k) B(k,j)

Tiled (Blocked) Matrix Multiply - Analysis  Why is this algorithm correct?  Number of slow memory references on blocked matrix multiply m = N*n 2 to read each block of B N 3 times (N 3 * n/N * n/N) + N*n 2 to read each block of A N 3 times + 2n 2 to read and write each block of C once = (2N + 2) * n 2 ~ (2/b) * n 3  b = n/N times fewer slow memory references than untiled algorithm  Assumes all three bxb blocks from A,B,C must fit in fast memory  3b 2 <= M = fast memory size  Decrease in slow memory references limited to a factor of O(sqrt(M))  Theorem (Hong & Kung, 1981): “Any” reorganization of this algorithm (using only associativity) has at least O(n 3 /sqrt(M)) slow memory refs  Apply tiling recursively for multiple levels of memory hierarchy

PHIPAC: Copy optimization  Copy input operands  Reduce cache conflicts  Constant array offsets for fixed size blocks  Expose page-level locality Original matrix, column major (numbers are addresses) Reorganized into 2x2 blocks

PHIPAC: Removing False Dependencies  Using local variables, reorder operations to remove false dependencies a[i] = b[i] + c; a[i+1] = b[i+1] * d; float f1 = b[i]; float f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d; false read-after-write hazard between a[i] and b[i+1]  With some compilers, you can say explicitly (via flag or pragma) that a and b are not aliased.

PHIPAC: Exploiting Multiple Registers  Reduce demands on memory bandwidth by pre-loading into local variables while( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++; } float f0 = filter[0]; float f1 = filter[1]; float f2 = filter[2]; while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++; } also: register float f0 = …; Before After

PHIPAC: Minimizing Pointer Updates  Replace pointer updates for strided memory addressing with constant array offsets f0 = *r8; r8 += 4; f1 = *r8; r8 += 4; f2 = *r8; r8 += 4; f0 = r8[0]; f1 = r8[4]; f2 = r8[8]; r8 += 12;

PHIPAC: Loop Unrolling  Expose instruction-level parallelism float f0 = filter[0], f1 = filter[1], f2 = filter[2]; float s0 = signal[0], s1 = signal[1], s2 = signal[2]; *res++ = f0*s0 + f1*s1 + f2*s2; do { signal += 3; s0 = signal[0]; res[0] = f0*s1 + f1*s2 + f2*s0; s1 = signal[1]; res[1] = f0*s2 + f1*s0 + f2*s1; s2 = signal[2]; res[2] = f0*s0 + f1*s1 + f2*s2; res += 3; } while( … );

PHIPAC: Software Pipelining of Loops  Mix instructions from different iterations  Keeps functional units “busy”  Reduce stalls from dependent instructions Load(i)Mult(i)Add(i) Load(i+1)Mult(i+1)Add(i+1)Load(i+2)Mult(i+2)Add(i+2) Add(i) Mult(i+1) Add(i+1) Load(i+2) Mult(i+2) Add(i+2) i i+1 i+2 i i+1 i+2 Load(i+3) Load(i+4)Mult(i+3)

PHIPAC: Exposing Independent Operations  Hide instruction latency  Use local variables to expose independent operations that can execute in parallel or in a pipelined fashion  Balance the instruction mix (what functional units are available?) f1 = f5 * f9; f2 = f6 + f10; f3 = f7 * f11; f4 = f8 + f12;

Tuning pays off – PHIPAC (Bilmes, Asanovic,Demmel)

Tuning pays off – ATLAS (Dongarra, Whaley) Extends applicability of PHIPAC Incorporated in Matlab (with rest of LAPACK)

Search for optimal register tile sizes on Sun Ultra registers, but 2-by-3 tile size fastest

Search for Optimal L0 block size in dense matmul 60% of peak on Pentium II-300 4% of versions exceed

High precision dense mat-vec multiply (XBLAS)

High Precision Algorithms (XBLAS)  Double-double ( High precision word represented as pair of doubles)  Many variations on these algorithms; we currently use Bailey’s  Exploiting Extra-wide Registers  Suppose s(1), …, s(n) have f-bit fractions, SUM has F>f bit fraction  Consider following algorithm for S =  i=1,n s(i)  Sort so that |s(1)|  |s(2)|  …  |s(n)|  SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUM  Theorem (D., Hida) Suppose F<2f (less than double precision)  If n  2 F-f + 1, then error  1.5 ulps  If n = 2 F-f + 2, then error  2 2f-F ulps (can be  1)  If n  2 F-f + 3, then error can be arbitrary (S  0 but sum = 0 )  Examples  s(i) double (f=53), SUM double extended (F=64) – accurate if n  = 2049  Dot product of single precision x(i) and y(i) –s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64)  –accurate if n  = 65537

Tuning pays off – FFTW (Frigo, Johnson) Tuning ideas applied to signal processing (DFT) Also incorporated in Matlab

Tuning Sparse Matrix Computations

Matrix #2 – raefsky3 (FEM/Fluids)

Matrix #22 (cont’d)- bayer02 (chemical process simulation)

Matrix #40 – gupta1 (linear programming)

Matrix #40 (cont’d) – gupta1 (linear programming)

Tuning Sparse matrix-vector multiply  Sparsity  Optimizes y = A*x for a particular sparse A  Im and Yelick  Algorithm space  Different code organization, instruction mixes  Different register blockings (change data structure and fill of A)  Different cache blocking  Different number of columns of x  Different matrix orderings  Software and papers available 

How Sparsity tunes y = A*x  Register Blocking  Store matrix as dense r x c blocks  Precompute performance in Mflops of dense A*x for various register block sizes r x c  Given A, sample it to estimate Fill if A blocked for varying r x c  Choose r x c to minimize estimated running time Fill/Mflops  Store explicit zeros in dense r x c blocks, unroll  Cache Blocking  Useful when source vector x enormous  Store matrix as sparse 2 k x 2 l blocks  Search over 2 k x 2 l cache blocks to find fastest

Register-Blocked Performance of SPMV on Dense Matrices (up to 12x12) 333 MHz Sun Ultra IIi800 MHz Pentium III 1.5 GHz Pentium 4800 MHz Itanium 70 Mflops 35 Mflops 425 Mflops 310 Mflops 175 Mflops 105 Mflops 250 Mflops 110 Mflops

Which other sparse operations can we tune?  General matrix-vector multiply A*x  Possibly many vectors x  Symmetric matrix-vector multiply A*x  Solve a triangular system of equations T -1 *x  y = A T *A*x  Kernel of Information Retrieval via LSI (SVD)  Same number of memory references as A*x  y =  i (A(i,:)) T * (A(i,:) * x)  Future work  A 2 *x, A k *x  Kernel of Information Retrieval used by Google  Includes Jacobi, SOR, …  Changes calling algorithm  A T *M*A  Matrix triple product  Used in multigrid solver  What does SciDAC need?

Test Matrices  General Sparse Matrices  Up to n=76K, nnz = 3.21M  From many application areas  1 – Dense  2 to 17 - FEM  18 to 39 - assorted  41 to 44 – linear programming  45 - LSI  Symmetric Matrices  Subset of General Matrices  1 – Dense  2 to 8 - FEM  9 to 13 - assorted  Lower Triangular Matrices  Obtained by running SuperLU on subset of General Sparse Matrices  1 – Dense  2 – 13 – FEM  Details on test matrices at end of talk

Results on Sun Ultra 1/170

Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 1 RHS

Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 9 RHS

Speed up from Cache Blocking on LSI matrix on Sun Ultra

Recent Results on P4 using icc and gcc

Speedup of SPMV from Sparsity on P4/icc-5.0.1

Single vector speedups on P4 by matrix type – best r x c

Performance of SPMV from Sparsity on P4/icc-5.0.1

Speed up from Cache Blocking on LSI matrix on P4

Fill for SPMV from Sparsity on P4/icc-5.0.1

Multiple vector speedups on P4

Multiple vector speedups on P4 – by matrix type

Multiple Vector Performance on P4

Symmetric Sparse Matrix-Vector Multiply on P4 (vs naïve full = 1)

Sparse Triangular Solve (Matlab’s colmmd ordering) on P4

A T *A on P4 (Accesses A only once)

Preliminary Results on Itanium using ecc

Speedup of SPMV from Sparsity on Itanium/ecc-5.0.1

Single vector speedups on Itanium by matrix type

Raw Performance of SPMV from Sparsity on Itanium

Fill for SPMV from Sparsity on Itanium

Improvements to register block size selection  Initial heuristic to determine best r x c block biased to diagonal of performance plot  Didn’t matter on Sun, does on P4 and Itanium since performance so “nondiagonally dominant”  Matrix 8:  Chose 2x2 (164 Mflops)  Better: 3x1 (196 Mflops)  Matrix 9:  Chose 2x2 (164 Mflops)  Better: 3x1 (213 Mflops)

Multiple vector speedups on Itanium

Multiple vector speedups on Itanium – by matrix type

Multiple Vector Performance on Itanium

Speed up from Cache Blocking on LSI matrix on Itanium

Future Work  Exploit New Architectures  Itanium  128 (82-bit) floating point registers  fused multiply-add instruction  predicated instructions  rotating registers for software pipelining  prefetch instructions  three levels of cache  McKinley  Tune current and wider set of kernels  Improve heuristics, eg choice of r x c  Incorporate into applications  PETSC, library,…  Further automate performance tuning  Generation of algorithm space generators  Impact on architecture, via changed benchmarks?

Background on Test Matrices

Sparse Matrix Benchmark Suite (1/3) #Matrix NameProblem DomainDimensionNo. Non-zeros 1denseDense matrix1, M 2raefsky3Fluid structure interaction21, M 3inaccuraAccuracy problem16, M 4bcsstk35*Stiff matrix automobile frame30, M 5venkat01Flow simulation62, M 6crystk02*FEM crystal free-vibration13, k 7crystk03*FEM crystal free-vibration24, M 8nasasrb*Shuttle rocket booster54, M 93dtube*3-D pressure tube45, M 10ct20stif*CT20 engine block52, M 11baiAirfoil eigenvalue calculation23, k 12raefsky4Buckling problem19, M 13ex113-D steady flow problem16, M 14rdist1Chemical process simulation4, k 15vavasis32-D PDE problem41, M Note: * indicates a symmetric matrix.

Sparse Matrix Benchmark Suite (2/3) #Matrix NameProblem DomainDimensionNo. Non-zeros 16orani678Economic modeling2, k 17rimFEM fluid mechanics problem22, M 18memplusCircuit simulation17, k 19gemat11Power flow4, k 20lhr10Chemical process simulation10, k 21goodwin*Fluid mechanics problem7, k 22bayer02Chemical process simulation13, k 23bayer10Chemical process simulation13, k 24coater2Simulation of coating flows9, k 25finan512*Financial portfolio optimization74, k 26onetone2Harmonic balance method36, k 27pwt*Structural engineering36, k 28vibrobox*Vibroacoustics12, k 29wang4Semiconductor device simulation26, k 30lnsp3937Fluid flow modeling3, k

Sparse Matrix Benchmark Suite (3/3) #Matrix NameProblem DomainDimensionsNo. Non-zeros 31lns3937Fluid flow modeling3, k 32sherman5Oil reservoir modeling3, k 33sherman3Oil reservoir modeling5, k 34orsreg1Oil reservoir modeling2, k 35saylr4Oil reservoir modeling3, k 36shyy161Viscous flow calculation76, k 37wang3Semiconductor device simulation26, k 38mcfeAstrophysics k 39jpwh991Circuit physics problem9916,027 40gupta1*Linear programming31, M 41lpcrebLinear programming9,648 x 77, k 42lpcredLinear programming8,926 x 73, k 43lpfit2pLinear programming3,000 x 13, k 44lpnug20Linear programming15,240 x 72, k 45lsiLatent semantic indexing10 k x 255 k3.7 M

Matrix #2 – raefsky3 (FEM/Fluids)

Matrix #2 (cont’d) – raefsky3 (FEM/Fluids)

Matrix #22 - bayer02 (chemical process simulation)

Matrix #22 (cont’d)- bayer02 (chemical process simulation)

Matrix #27 - pwt (structural engineering)

Matrix #27 (cont’d)- pwt (structural engineering)

Matrix #29 – wang4 (semiconductor device simulation)

Matrix #29 (cont’d)-wang4 (seminconductor device sim.)

Matrix #40 – gupta1 (linear programming)

Matrix #40 (cont’d) – gupta1 (linear programming)

Symmetric Matrix Benchmark Suite #Matrix NameProblem DomainDimensionNo. Non-zeros 1denseDense matrix1, M 2bcsstk35Stiff matrix automobile frame30, M 3crystk02FEM crystal free vibration13, k 4crystk03FEM crystal free vibration24, M 5nasasrbShuttle rocket booster54, M 63dtube3-D pressure tube45, M 7ct20stifCT20 engine block52, M 8gearboxAircraft flap actuator153, M 9cfd2Pressure matrix123, M 10finan512Financial portfolio optimization74, k 11pwtStructural engineering36, k 12vibroboxVibroacoustic problem12, k 13gupta1Linear programming31, M

Lower Triangular Matrix Benchmark Suite #Matrix NameProblem DomainDimensionNo. of non-zeros 1denseDense matrix1, k 2ex113-D Fluid Flow16, M 3goodwinFluid Mechanics, FEM7, k 4lhr10Chemical process simulation10, k 5memplusMemory circuit simulation17, M 6orani678Finance2, k 7raefsky4Structural modeling19, M 8wang4Semiconductor device simulation, FEM 26, M

Lower triangular factor: Matrix #2 – ex11

Lower triangular factor: Matrix #3 - goodwin

Lower triangular factor: Matrix #4 – lhr10