Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University.

Slides:



Advertisements
Similar presentations
1 Computational models of the physical world Cortical bone Trabecular bone.
Advertisements

Weighted Matrix Reordering and Parallel Banded Preconditioners for Nonsymmetric Linear Systems Murat Manguoğlu*, Mehmet Koyutürk**, Ananth Grama* and Ahmed.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
OpenFOAM on a GPU-based Heterogeneous Cluster
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Introduction CS 524 – High-Performance Computing.
A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.
A Solenoidal Basis Method For Efficient Inductance Extraction H emant Mahawar Vivek Sarin Weiping Shi Texas A&M University College Station, TX.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Parallel Architectures
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ahmed Sameh and Ananth Grama Computer Science Department Purdue University.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
ANS 1998 Winter Meeting DOE 2000 Numerics Capabilities 1 Barry Smith Argonne National Laboratory DOE 2000 Numerics Capability
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
JAVA AND MATRIX COMPUTATION
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
Parallel and Distributed Computing Research at the Computing Research Institute Ananth Grama Computing Research Institute and Department of Computer Sciences.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.
On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Murat Manguoğlu Middle East Technical University, Ankara, Turkey Prace.
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
Hui Liu University of Calgary
A survey of Exascale Linear Algebra Libraries for Data Assimilation
A computational loop k k Integration Newton Iteration
Ananth Grama and Ahmed Sameh Department of Computer Science
Supported by the National Science Foundation.
Hybrid Programming with OpenMP and MPI
Department of Computer Science, University of Tennessee, Knoxville
A computational loop k k Integration Newton Iteration
Presentation transcript:

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University. Linear Solvers Grant Kickoff Meeting, 9/26/06.

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Project Overview Identify sweet spots in algorithm- architecture-programming model space for efficient sparse linear system solvers. Design a new class of highly scalable sparse solvers suited to petascale HPC systems. Develop analytical frameworks for performance prediction and projection. Methodology Design generalized sparse solvers (direct, iterative, and hybrid) and evaluate their scaling/communication characteristics. Evaluate architectural features and their impact on scalable solver performance. Evaluate performance and productivity aspects of programming models -- PGAs (CAF, UPC) and MPI. Challenges and Impact Generalizing the space of parallel sparse linear system solvers. Analysis and implementation on parallel platforms Performance projection to the petascale Guidance for architecture and programming model design / performance envelope. Benchmarks and libraries for HPCS. Milestones / Schedule Final deliverable: Comprehensive evaluation of scaling properties of existing (and novel sparse solvers). Six month target: Comparative performance of solvers on multicore SMPs and clusters. 12-month target: Evaluation on these solvers on Cray X1, BG, JS20/21, for CAF/UPC/MPI implementations.

Introduction A critical aspect of High-Productivity is the identification of points/regions in the algorithm/ architecture/ programming model space that are amenable for implementation on petascale systems. This project aims at identifying such points for commonly used sparse linear system solvers, and at developing more robust novel solvers. These novel solvers emphasize reduction in memory/remote accesses at the expense of (possibly) higher FLOP counts – yielding much better actual performance.

Project Rationale Sparse system solvers govern the overall performance of many CSE applications on HPC systems. Design of HPC architectures and programming models should be influenced by their suitability for such solvers and related kernels. Extreme need for concurrency on novel architectural models require fundamental re- examination of conventional sparse solvers.

Typical Computational Kernels for PDE’s Integration Newton Iteration Linear system solvers kk kk tt

Fluid Structure Interaction

NESSIE – Nanoelectronics Simulation Environment Transport/Electrostatics Multi-scale, Multi-physics Multi-method Numerical Parallel Algorithms: Linear solver (SPIKE), Eigenpairs solver (TraceMin), preconditioning strategies,… Mathematical Methodologies: Finite Element Method, mode decomposition, multi-scale, non-linear numerical schemes,… 3D CNTs-3D Molec. Devices3D Si Nanowires3D III/V Devices2D MOSFETs E. Polizzi ( )

Simulation, Model Reduction and Real-Time Control of Structures

Fluid-Solid Interaction

Project Goals Develop generalizations of direct and iterative solvers – e.g. the Spike polyalgorithm. Implement such generalizations on various architectures (multicore, multicore SMPs, multicore SMP aggregates) and programming models (PGAs, Messaging APIs) Analytically quantify performance and project to petascale platforms. Compare relative performance, identify architecture/programming model features, and guide algorithm/ architecture/ programming model co-design.

Background Personnel: –Ahmed Sameh, Samuel Conte Professor of Computer Science, has worked on development of parallel numerical algorithms for four decades. –Ananth Grama, Professor and University Scholar, has worked both on numerical aspects of sparse solvers, as well as analytical frameworks for parallel systems. –(To be named – Postdoctoral Researcher)* will be primarily responsible for implementation and benchmarking. *We have identified three candidates for this position and will shortly be hiring one of them.

Background… Technical –We have built extensive infrastructure for parallel sparse solvers – including the Spike parallel toolkit, augmented-spectral ordering techniques, and multipole-based preconditioners –We have diverse hardware infrastructure, including Intel/AMP multicore SMP clusters, JS20/21 Blade servers, BlueGene/L, Cray X1.

Background… Technical… –We have initiated installation of Co-Array Fortran and Unified Parallel C on our machines and porting our toolkits to these PGAs. –We have extensive experience in analysis of performance and scalability of parallel algorithms, including development of the isoefficiency metric for scalability.

Technical Highlights The SPIKE Toolkit

SPIKE: Introduction Engineering problems usually produce large sparse linear systems Banded (or banded with low-rank perturbations) structure is often obtained after reordering SPIKE partitions the banded matrix into a block tridiagonal form Each partition is associated with one CPU, or one node  multilevel parallelism after RCM reordering

SPIKE: Introduction…

Reduced system ((p-1) x 2m) Retrieve solution AX=F  SX=diag(A 1 -1,…,A p -1 ) F A(n x n) B j, C j (m x m), m<<n W2W2 V1V1 V2V2 V3V3 W3W3 W4W4

SPIKE: A Hybrid Algorithm The spikes can be computed: –Explicitly (fully or partially) –On the Fly –Approximately The diagonal blocks can be solved: –Directly (dense LU, Cholesky, or sparse counterparts) –Iteratively (with a preconditioning strategy) The reduced system can be solved: –Directly (Recursive SPIKE) –Iteratively (with a preconditioning scheme) –Approximately (Truncated SPIKE) Different choices depending on the properties of the matrix and/or platform architecture

The SPIKE algorithm LevelDescription 3 SPIKE 2 LAPACK blocked algorithms 1 Primitives for banded matrices (our own) 0 BLAS3 (matrix-matrix primitives) Hierarchy of Computational Modules (systems dense within the band)

SPIKE versions Algorithm Factorization E Explicit R Recursive T Truncated F on the Fly P LU w/ pivoting Explicit generation of spikes- reduced system is solved iteratively with a preconditioner. Explicit generation of spikes- reduced system is solved directly using recursive SPIKE Implicit generation of reduced system which is solved on-the-fly using an iterative method. L LU w/o pivoting Explicit generation of spikes- reduced system is solved iteratively with a preconditioner. Explicit generation of spikes- reduce system is solved directly using recursive SPIKE Truncated generation of spike tips: V b is exact, W t is approx.- reduced system is solved directly Implicit generation of reduced system which is solved on-the-fly using an iterative method. U LU and UL w/o pivot. Truncated generation of spike tips: V b, W t are exact- reduced system is solved directly Implicit generation of reduced system which is solved on-the-fly using an iterative method with precond. A alternate LU / UL Explicit generation of spikes using new partitioning- reduced system is solved iteratively with a preconditioner. Truncated generation of spikes using new partitioning- reduced system is solved directly

SPIKE Hybrids 1.SPIKE versions R = recursiveE = explicit F = on-the-flyT = truncated 2.Factorization No pivoting: L = LU U = LU & UL A = alternate LU & UL Pivoting: P = LU 3.Solution improvement: –0 direct solver only –2iterative refinement –3outer Bicgstab iterations

SPIKE “on-the-fly” Does not require generating the spikes explicitly  Ideally suited for banded systems with large sparse bands The reduced system is solved iteratively with/without a preconditioning strategy                      IF EIH GIF EIH GIF EI S

Numerical Experiments (dense within the band) –Computing platforms 4 nodes Linux Xeon cluster 512 processor IBM-SP –Performance comparison w/ LAPACK, and ScaLAPACK

# procs N=0.5 M N=1M N=2M N=4M Speed improvement Spike vs Scalapack IBM-SP Spike (RL0) b=401; RHS=1; SPIKE: Scalability

SPIKE partitioning - 4 processor example A1A1 A2A2 A3A3 A4A4 C2C2 C3C3 C4C4 B1B1 B2B2 B3B3 Processors Factorizations (w/o pivoting) LU LU and UL UL A1A1 A2A2 A3A3 C2C2 C3C3 B1B1 B2B2 Processors 1 2,3 4 Factorizations (w/o pivoting) LU UL (p=2); LU (p=3) UL Partitioning -- 1 Partitioning -- 2

SPIKE: Small number of processors ScaLAPACK needs at least 4-8 processors to perform as well as LAPACK. SPIKE benefits from the LU-UL strategy and realizes speed improvement over LAPACK on 2 or more processors. 4-node Xeon Intel Linux cluster with infiniband interconnection - Two 3.2 Ghz processors per node; 4 GB of memory/ node; 1 MB cache; 64-bit arithmetic. Intel fortran, Intel MPI Intel MKL libraries for LAPACK, SCALAPACK n=960,000 b=201 System is diag. dominant

General sparse systems After RCM reordering Science and engineering problems often produce large sparse linear systems Banded (or banded with low- rank perturbations) structure is often obtained after reordering While most ILU preconditioners depend on reordering strategies that minimize the fill-in in the factorization stage, we propose to: – extract a narrow banded matrix, via a reordering-dropping strategy, to be used as a preconditioner. – make use of an improved SPIKE “on-the-fly” scheme that is ideally suited for banded systems that are sparse within the band.

Sparse parallel direct solvers and banded systems PARDISOReorderingFactorizationSolve 1 CPU- (1 Node) CPU- (1 Node) SuperLUReorderingFactorizationSolve 1 CPU - (1 Node) *8.4* 2 CPU- (2 Nodes) 8.237*12.5* 4 CPU- (4 Nodes) 8.241*16.7* 8 CPU- (4 Nodes) *15.9* MUMPSReorderingFactorizationSolve 1 CPU - (1 Node) CPU- (2 Nodes) CPU- (4 Nodes) CPU- (4 Nodes) *1.9* * Memory swap –too much fill-in N=432,000, b= 177, nnz= 7, 955, 116, fill-in of the band: 10.4%

Multilevel Parallelism: SPIKE calling MKL- PARDISO for banded systems that are sparse within the band Node 1 Node 2 Node 3 Node 4 Pardiso SPIKE A1A1 A2A2 A3A3 A4A4 C2C2 C3C3 C4C4 B1B1 B2B2 B3B3  This SPIKE hybrid scheme exhibits better performance than other parallel direct sparse solvers used alone.

SPIKE-MKL ”on-the-fly” for systems that are sparse within the band N=432,000, b= 177, nnz= 7, 955, 116, sparsity of the band: 10.4%  MUMPS: time (2-nodes) = s; time (4-nodes) = 39.6 s  For narrow banded systems, SPIKE will consider the matrix dense within the band. Reordering schemes for minimizing the bandwidth can be used if necessary. N=471,800, b= 1455, nnz= 9, 499, 744, sparsity of the band: 1.4%  Good scalability using “on-the-fly” SPIKE scheme

A Preconditioning Approach 1.Reorder the matrix to bring most of the elements within a band via HSL’s MC64 (to maximize sum or product of the diagonal elements) RCM (Reverse Cuthill-McKee) or MD (Minimum Degree) 2.Extract a band from the reordered matrix to use as preconditioner. 3.Use an iterative method (e.g. Bicgstab) to solve the linear system (outer solver). 4.Use SPIKE to solve the system involving the preconditioner (inner solver).

Matrix DW8192, order N = 8192, NNZ = 41746, Condest (A) = 1.39e+07,Square Dielectric Waveguide Sparsity = %, (A+A’) indefinite

GMRES + ILU preconditioner vs. Bicgstab with banded preconditioner Gmres w/o precond:>5000 iterations –Gmres + ILU (no fill-in):Fails –Gmres + ILU (1 e-3):Fails –Gmres + ILU (1 e-4):16 iters., ||r|| ~ NNZ(L) = 612,889 NNZ(U) = 414,661 Spike - Bicgstab: –preconditioner of bandwidth = 70 –16 iters., ||r|| ~ NNZ = 573,440

Comparisons on a 4-node Xeon Linux Cluster ( 2 CPUs/node ) SuperLU: –1 node s –2 nodes s –4 nodes s (memory limitation) MUMPS: –1 node s –2 nodes s –4 nodes s SPIKE – RL3 (bandwidth = 129): –1 node s –2 nodes s –4 nodes s –Bicgstab required 6 iterations only –Norm of rel. res. ~ Matrix DW8192 Sparse Matvec kernel needs fine-tuning Sparse Matvec kernel needs fine-tuning

Matrix:BMW7st_1 (stiffness matrix) Order N =141,347, NNZ = 7,318,399 after RCM reordering

Comparisons on a 4-node Xeon Linux Cluster ( 2 CPUs/node ) SuperLU: –1 node s –2 nodes s –4 nodes s MUMPS: –1 node s –2 nodes s –4 nodes s SPIKE – TL3 (bandwidth = 101): –1 node s –2 nodes s –4 nodes s –Bicgstab required 5 iterations only –Norm of rel. res. ~ Matrix: BMW7st_1 is diag. dominant Sparse Matvec kernel needs fine-tuning

Reservoir Modeling

Bicgstab + ILU (‘0’, fill-in) NNZ = 50,720

Matrix after RCM reordering

Bicgtab + banded preconditioner bandwidth = 61; NNZ = 12,464

Driven Cavity Example Square domain:  1  x, y  1 B.C S. : u x = u y = 0 on x, y =  1 ; x = 1 u x = 1, u y = 0 on y = 1 Picard’s iterations; Q2/Q1 elements: Linearized equations (Oseen problem) G  0 E = A s = (A + A T )/2 viscosity: 0.1, 0.02, 0.01,0.002

Linear system for 128 x 128 grid

after spectral reordering

MA48 as a preconditioner for GMRES on a uniprocessor Droptol# of iter.T(factor.)T(GMRES iters.)nnz(L+U)final residualv ,195, E-131/ E ,177, E-091/ E ,611, E-081/50 ** for drop tolerance =.0001, total time ~ 117 sec. ** No convergence for a drop tolerance of 10 -2

Spike – Pardiso for solving systems involving the banded preconditioner On 4 nodes (8 CPU’s) of a Xeon-Intel cluster: –bandwidth of extracted preconditioner = 1401 –total time= 16 sec. –# of Gmres iters.= 131 –2-norm of residual= Speed improvement over sequential procedure with drop tolerance =.0001: –117/16 ~ 7.3

Technical Highlights… Analysis of Scaling Properties –In early work, we developed the Isoefficiency metric for scalability. –With the likely scenario of utilizing up to 100K processing cores, this work becomes critical. –Isoefficiency quantifies the performance of a parallel system (a parallel program and the underlying architecture) as the number of processors is increased.

Technical Highlights… Isoefficiency Analysis –The efficiency of any parallel program for a given problem instance goes down with increasing number of processors. –For a family of parallel programs (formally referred to as scalable programs), increasing the problem size results in an increase in efficiency.

Technical Highlights… Isoefficiency is the rate at which problem size must be increased w.r.t. number of processors, to maintain constant efficiency. This rate is critical, since it is ultimately limited by total memory size. Isoefficiency is a key indicator of a program’s ability to scale to very large machine configurations. Isoefficiency analysis will be used extensively for performance projection and scaling properties.

Architecture We target the following currently available architectures –IBM JS20/21 and BlueGene/L platforms –Cray X1/XT3 –AMD Opteron multicore SMP and SMP clusters –Intel Xeon multicore SMP and SMP clusters These platforms represent a wide range of currently available architectural extremes. _______________________________________________ Intel, AMD, and IBM JS21 platforms are available locally at Purdue. We intend to get access to the BlueGene at LLNL and the Cray XT3 at ORNL.

Implementation Current implementations are MPI based. The Spike tooklit will be ported to –POSIX and OpenMP –UPC and CAF –Titanium and X10 (if releases are available) These implementations will be comprehensively benchmarked across platforms.

Benchmarks/Metrics We aim to formally specify a number of benchmark problems (sparse systems arising in nanoelectronics, structural mechanics, and fluid- structure interaction) We will abstract architecture characteristics – processor speed, memory bandwidth, link bandwidth, bisection bandwidth (some of these abstractions can be derived from HPCC benchmarks). We will quantify solvers on the basis of wall- clock time, FLOP count, parallel efficiency, scalability, and projected performance to petascale systems.

Benchmarks/Metrics Other popular benchmarks such as HPCC do not address sparse linear solvers in meaningful ways. HPCC comprises of seven benchmark tests – HPL, DGEMM, STREAM, PTRANS, RandomAccess, FFT, and Communication bandwidth and latency. HPL They only peripherally address underlying performance issues in sparse system solvers.

Progress/Accomplishments Implementation of the parallel Spike polyalgorithm toolkit. Incorporation of a number of parallel direct and preconditioned iterative solvers (e.g. SuperLU, MUMPS, and Pardiso) into the Spike toolkit. Evaluation of Spike on the IBM/SP, SGI Altix, Intel Xeon clusters, and Intel multicore platforms.

Milestones Final deliverable: Comprehensive evaluation of scaling properties of existing (and new solvers). Six month target: Comparative performance of solvers on multicore SMPs and clusters. Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.

Financials The total cost of this project is approximately $150K for its one-year duration. The budget primarily accounts for a post- doctoral researcher’s salary/benefits and minor summer-time for the PIs. Together, these three project personnel are responsible for accomplishing project milestones and reporting.

Concluding Remarks This project takes a comprehensive view of parallel sparse linear system solvers and their suitability for petascale HPC systems. Its results directly influence ongoing and future development of HPC systems. A number of major challenges are likely to emerge, both as a result of this project, and from impending architectural innovations.

Concluding Remarks… Architectural innovations on the horizon –Scalable multicore platforms: 64 to 128 cores on the horizon –Heterogeneous multicore: It is likely that cores will be heterogeneous – some with floating point units, others with vector units, yet others with programmable hardware (indeed such chips are commonly used in cell phones) –Significantly higher pressure on the memory subsystem

Concluding Remarks… Challenges for programming models and runtime systems: –Affinity scheduling is important for performance – need to specify tasks that must be co-scheduled (suitable programming abstractions needed). –Programming constructs for utilizing heterogeneity.

Concluding Remarks… Challenges for algorithm and application development – FLOPS are cheap, memory references are expensive – explore new families of algorithms that optimize for (minimize) the latter – Algorithmic techniques and programming constructs for specifying algorithmic asynchrony (used to mask system latency) – Many of the optimizations are likely to be beyond the technical reach of applications programmers – need for scalable library support –Increased emphasis on scalability analysis