ACTS Tools and Case Studies of their Use Osni Marques and Tony Drummond Lawrence Berkeley National Laboratory NERSC User.

Slides:



Advertisements
Similar presentations
A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
Advertisements

Advanced Computational Software Scientific Libraries: Part 2 Blue Waters Undergraduate Petascale Education Program May 29 – June
1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Solving Linear Systems (Numerical Recipes, Chap 2)
Chapter 2, Linear Systems, Mainly LU Decomposition.
- ACTS - A Reliable Software Infrastructure for Scientific Computing Osni Marques Lawrence Berkeley National Laboratory (LBNL) UC Berkeley.
NERSC User Group Meeting The DOE ACTS Collection Osni Marques Lawrence Berkeley National Laboratory
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
1 NERSC User Group Business Meeting, June 3, 2002 High Performance Computing Research Juan Meza Department Head High Performance Computing Research.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Advancing Computational Science Research for Accelerator Design and Optimization Accelerator Science and Technology - SLAC, LBNL, LLNL, SNL, UT Austin,
Programming Tools and Environments: Linear Algebra James Demmel Mathematics and EECS UC Berkeley.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
PETSc Portable, Extensible Toolkit for Scientific computing.
CS240A: Conjugate Gradients and the Model Problem.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Antonio M. Vidal Jesús Peinado
SuperLU: Sparse Direct Solver X. Sherry Li SIAM Short Course on the ACTS Collection Feburary 27, 2004.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
Introduction to ScaLAPACK. 2 ScaLAPACK ScaLAPACK is for solving dense linear systems and computing eigenvalues for dense matrices.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.
Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,
Report on Sensitivity Analysis Radu Serban Keith Grant, Alan Hindmarsh, Steven Lee, Carol Woodward Center for Applied Scientific Computing, LLNL Work performed.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Overview of the ACTS Toolkit For NERSC Users Brent Milne John Wu
The ACTS Toolkit (What can it do for you?) Osni Marques and Tony Drummond ( LBNL/NERSC )
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
ANS 1998 Winter Meeting DOE 2000 Numerics Capabilities 1 Barry Smith Argonne National Laboratory DOE 2000 Numerics Capability
Using the PETSc Linear Solvers Lois Curfman McInnes in collaboration with Satish Balay, Bill Gropp, and Barry Smith Mathematics and Computer Science Division.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Intel Math Kernel Library Wei-Ren Chang, Department of Mathematics, National Taiwan University 2009/03/10.
JAVA AND MATRIX COMPUTATION
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Parallel Solution of the Poisson Problem Using MPI
CS240A: Conjugate Gradients and the Model Problem.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Cracow Grid Workshop, November 5-6, 2001 Concepts for implementing adaptive finite element codes for grid computing Krzysztof Banaś, Joanna Płażek Cracow.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
TR&D 2: NUMERICAL TOOLS FOR MODELING IN CELL BIOLOGY Software development: Jim Schaff Fei Gao Frank Morgan Math & Physics: Boris Slepchenko Diana Resasco.
The ACTS Collection Building a Reliable Software Infrastructure for Scientific Computing Osni Marques and Tony Drummond Lawrence Berkeley National Laboratory.
Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 
Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Programming Models for SimMillennium
GENERAL VIEW OF KRATOS MULTIPHYSICS
Presentation transcript:

ACTS Tools and Case Studies of their Use Osni Marques and Tony Drummond Lawrence Berkeley National Laboratory NERSC User Group Meeting June 24-25, 2004

06/24/2004NERSC User Group Meeting2 Goals  Extended support for experimental software  Make ACTS tools available on DOE computers  Provide technical support  Maintain ACTS information center  Coordinate efforts with other supercomputing centers  Enable large scale scientific applications  Educate and train High Intermediate level Tool expertise Conduct tutorials What is the DOE ACTS Collection? Advanced CompuTational Software Collection Tools for developing parallel applications ACTS started as an “umbrella” project Intermediate Basic level Higher level of support to users of the tool Basic Help with installation Basic knowledge of the tools Compilation of user’s reports

06/24/2004NERSC User Group Meeting3 Motivation 1: Challenges in the Development of Scientific Codes Productivity Time to the first solution (prototype) Time to solution (production) Other requirements Complexity Increasingly sophisticated models Model coupling Interdisciplinarity Performance Increasingly complex algorithms Increasingly complex architectures Increasingly demanding applications Productivity Time to the first solution (prototype) Time to solution (production) Other requirements Complexity Increasingly sophisticated models Model coupling Interdisciplinarity Performance Increasingly complex algorithms Increasingly complex architectures Increasingly demanding applications Libraries written in different languages. Discussions about standardizing interfaces are often sidetracked into implementation issues. Difficulties managing multiple libraries developed by third-parties. Need to use more than one language in one application. The code is long-lived and different pieces evolve at different rates Swapping competing implementations of the same idea and testing without modifying the code Need to compose an application with some other(s) that were not originally designed to be combined Libraries written in different languages. Discussions about standardizing interfaces are often sidetracked into implementation issues. Difficulties managing multiple libraries developed by third-parties. Need to use more than one language in one application. The code is long-lived and different pieces evolve at different rates Swapping competing implementations of the same idea and testing without modifying the code Need to compose an application with some other(s) that were not originally designed to be combined

06/24/2004NERSC User Group Meeting4 Motivation 2: Addressing the Performance Gap through Software Peak performance is skyrocketing l In 1990s, peak performance increased 100x; in 2000s, it will increase 1000x But l Efficiency for many science applications declined from 40-50% on the vector supercomputers of 1990s to as little as 5-10% on parallel supercomputers of today Need research on l Mathematical methods and algorithms that achieve high performance on a single processor and scale to thousands of processors l More efficient programming models for massively parallel supercomputers , Teraflops 1996 Performance Gap Peak Performance Real Performance

06/24/2004NERSC User Group Meeting5 ACTS Tools Functionalities CategoryToolFunctionalities Numerical AztecAlgorithms for the iterative solution of large sparse linear systems. HypreAlgorithms for the iterative solution of large sparse linear systems, intuitive grid-centric interfaces, and dynamic configuration of parameters. PETScTools for the solution of PDEs that require solving large-scale, sparse linear and nonlinear systems of equations. OPT++Object-oriented nonlinear optimization package. SUNDIALSSolvers for the solution of systems of ordinary differential equations, nonlinear algebraic equations, and differential- algebraic equations. ScaLAPACKLibrary of high performance dense linear algebra routines for distributed-memory message-passing. SuperLUGeneral-purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. TAOLarge-scale optimization software, including nonlinear least squares, unconstrained minimization, bound constrained optimization, and general nonlinear optimization. Code Development Global ArraysLibrary for writing parallel programs that use large arrays distributed across processing nodes and that offers a shared- memory view of distributed arrays. OvertureObject-Oriented tools for solving computational fluid dynamics and combustion problems in complex geometries. Code Execution CUMULVSFramework that enables programmers to incorporate fault-tolerance, interactive visualization and computational steering into existing parallel programs GlobusServices for the creation of computational Grids and tools with which applications can be developed to access the Grid. PAWSFramework for coupling parallel applications within a component-like model. SILOONTools and run-time support for building easy-to-use external interfaces to existing numerical codes. TAUSet of tools for analyzing the performance of C, C++, Fortran and Java programs. Library Development ATLAS and PHiPACTools for the automatic generation of optimized numerical software for modern computer architectures and compilers. PETEExtensible implementation of the expression template technique (C++ technique for passing expressions as function arguments).

06/24/2004NERSC User Group Meeting6 Use of ACTS Tools 3D overlapping grid for a submarine produced with Overture’s module ogen. Model of a "hard" sphere included in a "soft" material, 26 million d.o.f. Unstructured meshes in solid mechanics using Prometheus and PETSc (Adams and Demmel). Multiphase flow using PETSc, 4 million cell blocks, 32 million DOF, over 10.6 Gflops on an IBM SP (128 nodes), entire simulation runs in less than 30 minutes (Pope, Gropp, Morgan, Seperhrnoori, Smith and Wheeler). 3D incompressible Euler,tetrahedral grid, up to 11 million unknowns, based on a legacy NASA code, FUN3d (W. K. Anderson), fully implicit steady-state, parallelized with PETSc (courtesy of Kaushik and Keyes). Electronic structure optimization performed with TAO, (UO 2 ) 3 (CO 3 ) 6 (courtesy of deJong). Molecular dynamics and thermal flow simulation using codes based on Global Arrays. GA have been employed in large simulation codes such as NWChem, GAMESS-UK, Columbus, Molpro, Molcas, MWPhys/Grid, etc.

06/24/2004NERSC User Group Meeting7 Use of ACTS Tools Induced current (white arrows) and charge density (colored plane and gray surface) in crystallized glycine due to an external field (Louie, Yoon, Pfrommer and Canning), eigenvalue problems solved with ScaLAPACK. OPT++ is used in protein energy minimization problems (shown here is protein T162 from CASP5, courtesy of Meza, Oliva et al.) Omega3P is a parallel distributed-memory code intended for the modeling and analysis of accelerator cavities, which requires the solution of generalized eigenvalue problems. A parallel exact shift-invert eigensolver based on PARPACK and SuperLU has allowed for the solution of a problem of order 7.5 million with 304 million nonzeros. Finding 10 eigenvalues requires about 2.5 hours on 24 processors of an IBM SP. Two ScaLAPACK routines, PZGETRF and PZGETRS, are used for solution of linear systems in the spectral algorithms based AORSA code (Batchelor et al.), which is intended for the study of electromagnetic wave-plasma interactions. The code reaches 68% of peak performance on 1936 processors of an IBM SP.

06/24/2004NERSC User Group Meeting8 PETSc Computation and Communication Kernels MPI, MPI-IO, BLAS, LAPACK Profiling Interface PETSc PDE Application Codes Object-Oriented Matrices, Vectors, Indices Grid Management Linear Solvers Preconditioners + Krylov Methods Nonlinear Solvers, Unconstrained Minimization ODE Integrators Visualization Interface Portable, Extensible Toolkit for Scientific Computation Vectors: Fundamental objects for storing field solutions, right-hand sides, etc. Matrices: Fundamental objects for storing linear operators (e.g., Jacobians)

06/24/2004NERSC User Group Meeting9 PETSc - Numerical Components Compressed Sparse Row (AIJ) Blocked Compressed Sparse Row (BAIJ) Block Diagonal (BDIAG) DenseOther IndicesBlock IndicesStrideOther Index Sets Vectors Line SearchTrust Region Newton-based Methods Other Nonlinear Solvers Additive Schwartz Block Jacobi ILUICC LU (Sequential only) Others Preconditioners Euler Backward Euler Pseudo Time Stepping Other Time Steppers GMRESCGCGSBi-CG-STABTFQMRRichardsonChebychevOther Krylov Subspace Methods Matrices Distributed Arrays Matrix-free

06/24/2004NERSC User Group Meeting10 TAO Toolkit for Advanced Optimization

06/24/2004NERSC User Group Meeting11 OPT++ Assumptions: Objective function is smooth Twice continuously differentiable Constraints are linearly independent Expensive objective functions Four major classes of problems available NLF0(ndim, fcn, init_fcn, constraint)  basic nonlinear function, no derivative information available NLF1(ndim, fcn, init_fcn, constraint)  nonlinear function, first derivative information available FDNLF1(ndim, fcn, init_fcn, constraint)  nonlinear function, first derivative information approximated NLF2(ndim, fcn, init_fcn, constraint)  nonlinear function, first and second derivative information available Objective function Equality constraints Inequality constraints

06/24/2004NERSC User Group Meeting12 Hypre Data Layout structuredcompositeblock-strucunstrucCSR Linear Solvers GMG,...FAC,...Hybrid,...AMGe,...ILU,... Linear System Interfaces

06/24/2004NERSC User Group Meeting13 SUNDIALS SUite of Nonlinear and DIfferential/ALgebraic Solvers Band Linear Solver Preconditioned GMRES Linear Solver General Preconditioner Modules Vector Kernels Dense Linear Solver User main routine User problem-defining function User preconditioner function CVODE ODE Integrator IDA DAE Integrator KINSOL Nonlinear Solver Solvers x’ = f(t,x), x(t 0 ) = x 0 CVODE F(t,x,x’) = 0, x(t 0 ) = x 0 IDA F(x) = 0 KINSOL Sensitivity analysis (forward and adjoint) also available

Software Hierarchy Data distribution Simple examples Application ScaLAPACK

06/24/2004NERSC User Group Meeting15 ScaLAPACK: software hierarchy ScaLAPACK BLAS LAPACK BLACS MPI/PVM/... PBLAS Global Local platform specific Clarity,modularity, performance and portability. Atlas can be used here for automatic tuning. Linear systems, least squares, singular value decomposition, eigenvalues. Communication routines targeting linear algebra operations. Parallel BLAS. Communication layer (message passing).

06/24/2004NERSC User Group Meeting16 ScaLAPACK: 2D Block-Cyclic Distribution 5x5 matrix partitioned in 2x2 blocks2x2 process grid point of view

06/24/2004NERSC User Group Meeting17 2D Block-Cyclic Distribution  CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) IF ( MYROW.EQ.0.AND. MYCOL.EQ.0 ) THEN A(1) = 1.1; A(2) = -2.1; A(3) = -5.1; A(1+LDA) = 1.2; A(2+LDA) = 2.2; A(3+LDA) = -5.2; A(1+2*LDA) = 1.5; A(2+3*LDA) = 2.5; A(3+4*LDA) = -5.5; ELSE IF ( MYROW.EQ.0.AND. MYCOL.EQ.1 ) THEN A(1) = 1.3; A(2) = 2.3; A(3) = -5.3; A(1+LDA) = 1.4; A(2+LDA) = 2.4; A(3+LDA) = -5.4; ELSE IF ( MYROW.EQ.1.AND. MYCOL.EQ.0 ) THEN A(1) = -3.1; A(2) = -4.1; A(1+LDA) = -3.2; A(2+LDA) = -4.2; A(1+2*LDA) = 3.5; A(2+3*LDA) = 4.5; ELSE IF ( MYROW.EQ.1.AND. MYCOL.EQ.1 ) THEN A(1) = 3.3; A(2) = -4.3; A(1+LDA) = 3.4; A(2+LDA) = 4.4; END IF  CALL PDGESVD( JOBU, JOBVT, M, N, A, IA, JA, DESCA, S, U, IU, JU, DESCU, VT, IVT, JVT, DESCVT, WORK, LWORK, INFO )  LDA is the leading dimension of the local array Array descriptor for A (contains information about A)

On line tutorial:

PROGRAM PSGESVDRIVER * * Example Program solving Ax=b via ScaLAPACK routine PSGESV * *.. Parameters.. *.. *.. Local Scalars.. *.. *.. Local Arrays.. *.. Executable Statements.. * * INITIALIZE THE PROCESS GRID * CALL SL_INIT( ICTXT, NPROW, NPCOL ) CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) * * If I'm not in the process grid, go to the end of the program * IF( MYROW.EQ.-1 ) $ GO TO 10 * * DISTRIBUTE THE MATRIX ON THE PROCESS GRID * Initialize the array descriptors for the matrices A and B * CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT, MXLLDA, $ INFO ) CALL DESCINIT( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT, $ MXLLDB, INFO ) * * Generate matrices A and B and distribute to the process grid * CALL MATINIT( A, DESCA, B, DESCB ) * * Make a copy of A and B for checking purposes * CALL PSLACPY( 'All', N, N, A, 1, 1, DESCA, A0, 1, 1, DESCA ) CALL PSLACPY( 'All', N, NRHS, B, 1, 1, DESCB, B0, 1, 1, DESCB ) * * CALL THE SCALAPACK ROUTINE * Solve the linear system A * X = B * CALL PSGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, $ INFO ) 

/**********************************************************************/ /* This program illustrates the use of the ScaLAPACK routines PDPTTRF */ /* and PPPTTRS to factor and solve a symmetric positive definite */ /* tridiagonal system of linear equations, i.e., T*x = b, with */ /* different data in two distinct contexts. */ /**********************************************************************/ /* a bunch of things omitted for the sake of space */ main() { /* Start BLACS */ Cblacs_pinfo( &mype, &npe ); Cblacs_get( 0, 0, &context ); Cblacs_gridinit( &context, "R", 1, npe ); /* Processes 0 and 2 contain d(1:4) and e(1:4) */ /* Processes 1 and 3 contain d(5:8) and e(5:8) */ if ( mype == 0 || mype == 2 ){ d[0]=1.8180; d[1]=1.6602; d[2]=1.3420; d[3]=1.2897; e[0]=0.8385; e[1]=0.5681; e[2]=0.3704; e[3]=0.7027; } else if ( mype == 1 || mype == 3 ){ d[0]=1.3412; d[1]=1.5341; d[2]=1.7271; d[3]=1.3093; e[0]=0.5466; e[1]=0.4449; e[2]=0.6946; e[3]=0.0000; } if ( mype == 0 || mype == 1 ) { /* New context for processes 0 and 1 */ map[0]=0; map[1]=1; Cblacs_get( context, 10, &context_1 ); Cblacs_gridmap( &context_1, map, 1, 1, 2 ); /* Right-hand side is set to b = [ ] */ if ( mype == 0 ) { b[0]=1.0; b[1]=2.0; b[2]=3.0; b[3]=4.0; } else if ( mype == 1 ) { b[0]=5.0; b[1]=6.0; b[2]=7.0; b[3]=8.0; } /* Array descriptor for A (D and E) */ desca[0]=501; desca[1]=context_1; desca[2]=n; desca[3]=nb; desca[4]=0; desca[5]=lda; desca[6]=0; /* Array descriptor for B */ descb[0]=502; descb[1]=context_1; descb[2]=n; descb[3]=nb; descb[4]=0; descb[5]=ldb; descb[6]=0; /* Factorization */ pdpttrf( &n, d, e, &ja, desca, af, &laf, work, &lwork, &info ); /* Solution */ pdpttrs( &n, &nrhs, d, e, &ja, desca, b, &ib, descb, af, &laf, work, &lwork, &info ); printf( "MYPE=%i: x[:] = %7.4f %7.4f %7.4f %7.4f\n", mype, b[0], b[1], b[2], b[3]); } else { /* New context for processes 0 and 1 */ map[0]=2; map[1]=3; Cblacs_get( context, 10, &context_2 ); Cblacs_gridmap( &context_2, map, 1, 1, 2 ); /* Right-hand side is set to b = [ ] */ if ( mype == 2 ) { b[0]=8.0; b[1]=7.0; b[2]=6.0; b[3]=5.0; } else if ( mype == 3 ) { b[0]=4.0; b[1]=3.0; b[2]=2.0; b[3]=1.0; } /* Array descriptor for A (D and E) */ desca[0]=501; desca[1]=context_2; desca[2]=n; desca[3]=nb; desca[4]=0; desca[5]=lda; desca[6]=0; /* Array descriptor for B */ descb[0]=502; descb[1]=context_2; descb[2]=n; descb[3]=nb; descb[4]=0; descb[5]=ldb; descb[6]=0; /* Factorization */ pdpttrf( &n, d, e, &ja, desca, af, &laf, work, &lwork, &info ); /* Solution */ pdpttrs( &n, &nrhs, d, e, &ja, desca, b, &ib, descb, af, &laf, work, &lwork, &info ); printf( "MYPE=%i: x[:] = %7.4f %7.4f %7.4f %7.4f\n", mype, b[0], b[1], b[2], b[3]); } Cblacs_gridexit( context ); Cblacs_exit( 0 ); } Using Matlab notation: T = diag(D)+diag(E,-1)+diag(E,1) where D = [ ] E = [ ] Then, solving T*x = b, if b = [ ] x = [ ] if b = [ ] x = [ ]

06/24/2004NERSC User Group Meeting21 Application: Cosmic Microwave Background (CMB) Analysis The statistics of the tiny variations in the CMB (the faint echo of the Big Bang) allows the determination of the fundamental parameters of cosmology to the percent level or better. MADCAP (Microwave Anisotropy Dataset Computational Analysis Package) Makes maps from observations of the CMB and then calculates their angular power spectra. (See Calculations are dominated by the solution of linear systems of the form M=A -1 B for dense nxn matrices A and B scaling as O(n 3 ) in flops. MADCAP uses ScaLAPACK for those calculations. On the NERSC Cray T3E (original code): Cholesky factorization and triangular solve. Typically reached 70-80% peak performance. Solution of systems with n ~ 10 4 using tens of processors. The results demonstrated that the Universe is spatially flat, comprising 70% dark energy, 25% dark matter, and only 5% ordinary matter. On the NERSC IBM SP: Porting was trivial but tests showed only 20-30% peak performance. Code rewritten to use triangular matrix inversion and triangular matrix multiplication  one-day work Performance increased to 50-60% peak. Solution of previously intractable systems with n ~ 10 5 using hundreds of processors. The international BOOMERanG collaboration announced results of the most detailed measurement of the cosmic microwave background radiation (CMB), which strongly indicated that the universe is flat (Apr. 27, 2000).

Functionalities Simple examples Application SuperLU

06/24/2004NERSC User Group Meeting23 SuperLU Solve general sparse linear system A x = b. Algorithm: Gaussian elimination (factorization A = LU), followed by lower/upper triangular solutions. Efficient and portable implementation for high- performance architectures, flexible interface. SuperLUSuperLU_MTSuperLU_DIST PlatformSerialSMPDistributed LanguageCC and Pthread (or pragmas) C and MPI Data typeReal/complex, Single/double Real, doubleReal/complex, Double

06/24/2004NERSC User Group Meeting24 SuperLU: Example Pddrive.c (1/2) #include "superlu_ddefs.h“ main(int argc, char *argv[]) { superlu_options_t options; SuperLUStat_t stat; SuperMatrix A; ScalePermstruct_t ScalePermstruct; LUstruct_t LUstruct; SOLVEstruct_t SOLVEstruct; gridinfo_t grid; · · · · · · /* Initialize MPI environment */ MPI_Init( &argc, &argv ); · · · /* Initialize the SuperLU process grid */ nprow = npcol = 2; superlu_gridinit(MPI_COMM_WORLD, nprow,npcol, &grid); /* Read matrix A from file, distribute it, and set up the right-hand side */ dcreate_matrix(&A, nrhs, &b, &ldb, &xtrue, &ldx, fp, &grid); /* Set the options for the solver. Defaults are: options.Fact = DOFACT; options.Equil = YES; options.ColPerm = MMD_AT_PLUS_A; options.RowPerm = LargeDiag; options.ReplaceTinyPivot = YES; options.Trans = NOTRANS; options.IterRefine = DOUBLE; options.SolveInitialized = NO; options.RefineInitialized = NO; options.PrintStat = YES; */ set_default_options_dist(&options);

06/24/2004NERSC User Group Meeting25 SuperLU: Example Pddrive.c (2/2) /* Initialize ScalePermstruct and LUstruct. */ ScalePermstructInit(m, n, &ScalePermstruct); LUstructInit(m, n, &LUstruct); /* Initialize the statistics variables. */ PStatInit(&stat); /* Call the linear equation solver. */ pdgssvx(&options, &A, &ScalePermstruct, b, ldb, nrhs, &grid, &LUstruct, &SOLVEstruct, berr, &stat, &info); /* Print the statistics. */ PStatPrint(&options, &stat, &grid); /* Deallocate storage */ PStatFree(&stat); Destroy_LU(n, &grid, &LUstruct); LUstructFree(&LUstruct); /* Release the SuperLU process grid */ superlu_gridexit(&grid); /* Terminate the MPI execution environment */ MPI_Finalize(); }

06/24/2004NERSC User Group Meeting26 Application: Accelerator Cavity Design (1/2) Calculate cavity mode frequencies and field vectors Solve Maxwell equation in electromagnetic field Omega3P simulation code developed at SLAC Finite element methods lead to large sparse generalized eigensystem K x = M x Real symmetric for lossless cavities; complex symmetric when lossy in cavities Seek interior eigenvalues (tightly clustered) that are relatively small in magnitude Omega3P model of a 47-cell section of the 206-cell Next Linear Collider accelerator structure

06/24/2004NERSC User Group Meeting27 Application: Accelerator Cavity Design (2/2) Speed up Lanczos convergence by shift-invert. Seek largest eigenvalues, well separated, of the transformed system M (K -  M) -1 x =  M x,  = 1 / ( -  ) The Filtering algorithm [Y. Sun]: Inexact shift-invert Lanczos + JOCC Exact shift-invert Lanczos (ESIL): PARPACK+ SuperLU_DIST for shifted linear system (no pivoting, no iterative refinement) dds47 (linear elements) - total eigensolver time: N=1.3x10 6, NNZ = 20x10 6

Functionalities Simple examples Application TAU

06/24/2004NERSC User Group Meeting29 TAU: Functionalities Profiling of Java, C++, C, and Fortran codes Detailed information (much more than prof/gprof) Profiles for each unique template instantiation Time spent exclusively and inclusively in each function Start/Stop timers Profiling data maintained for each thread, context, and node Parallel IO Statistics for the number of calls for each profiled function Profiling groups for organizing and controlling instrumentation Support for using CPU hardware counters (PAPI) Graphic display for parallel profiling data Graphical display of profiling results (built-in viewers, interface to Vampir)

06/24/2004NERSC User Group Meeting30 TAU: Example 1 (1/4) Index of Options currently installed: option used in the example

06/24/2004NERSC User Group Meeting31 TAU: Example 1 (2/4) PROGRAM PSGESVDRIVER ! ! Example Program solving Ax=b via ScaLAPACK routine PSGESV ! !.. Parameters.. !**** a bunch of things omitted for the sake of space **** !.. Executable Statements.. ! ! INITIALIZE THE PROCESS GRID ! integer profiler(2) save profiler call TAU_PROFILE_INIT() call TAU_PROFILE_TIMER(profiler,'PSGESVDRIVER') call TAU_PROFILE_START(profiler) CALL SL_INIT( ICTXT, NPROW, NPCOL ) CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) !**** a bunch of things omitted for the sake of space **** CALL PSGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, & INFO ) !**** a bunch of things omitted for the sake of space **** call TAU_PROFILE_STOP(profiler) STOP END psgesvdriver.int.f90 NB. ScaLAPACK routines have not been instrumented and therefore are not shown in the charts.

06/24/2004NERSC User Group Meeting32 TAU: Example 1 (3/4) … …

06/24/2004NERSC User Group Meeting33 TAU: Example 1 (4/4)

06/24/2004NERSC User Group Meeting34 TAU: Example 2 (1/2) Index of Makefile: compilation rule include $(TAUROOTDIR)/rs6000/lib/Makefile.tau-mpi-papi-pdt-profile-trace

06/24/2004NERSC User Group Meeting35 TAU: Example 2 (2/2) PAPI provides access to hardware performance counters (see for details and contact for the corresponding TAU events). In this example we are just measuring FLOPS. PARAPROF

06/24/2004NERSC User Group Meeting36 Study Case: Electronic Structure Calculations Code …

Additional Information

Tool descriptions, installation details, examples, etc Agenda, accomplishments, conferences, releases, etc Goals and other relevant information Points of contact Search engine High Performance Tools portable library calls robust algorithms help code optimization Scientific Computing Centers (like PSC, NERSC): Reduce user’s code development time that sums up in more production runs and faster and effective scientific research results Overall better system utilization Facilitate the accumulation and distribution of high performance computing expertise Provide better scientific parameters for procurement and characterization of specific user needs High Performance Tools portable library calls robust algorithms help code optimization Scientific Computing Centers (like PSC, NERSC): Reduce user’s code development time that sums up in more production runs and faster and effective scientific research results Overall better system utilization Facilitate the accumulation and distribution of high performance computing expertise Provide better scientific parameters for procurement and characterization of specific user needs See also:

An expanded Framework for the Advanced Computational Testing and Simulation Toolkit, The Advanced Computational Testing and Simulation (ACTS) Toolkit. Technical Report LBNL A First Prototype of PyACTS. Technical Report LBNL ACTS - A collection of High Performing Tools for Scientific Computing. Technical Report LBNL The ACTS Collection: Robust and high-performance tools for scientific computing. Guidelines for tool inclusion and retirement. Technical Report LBNL/PUB An Infrastructure for the creation of High End Scientific and Engineering Software Tools and Applications. Technical Report LBNL/PUB References To appear: two journals featuring ACTS Tools. How Can ACTS Work for you?, Solving Problems in Science and Engineering, Robust and High Performance Tools for Scientific Computing, Robust and High Performance Tools for Scientific Computing, The ACTS Collection: Robust and High Performance Libraries for Computational Sciences, SIAM PP04 Enabling Technologies For High End Computer Simulations Tutorials and Workshops FY , FY , FY , FY , Progress Reports

06/24/2004NERSC User Group Meeting40 Matrix of Applications and Tools... More Applications … Enabling sciences and discoveries… with high performance and scalability...