1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA).

Slides:

Advertisements

Similar presentations

Lecture 8 Hybrid Solvers based on Domain Decomposition Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA crd-legacy.lbl.gov/~xiaoye/G2S3/

Advertisements

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

OpenFOAM on a GPU-based Heterogeneous Cluster

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Sparse Direct Methods on High Performance Computers X. Sherry Li CS267/E233: Applications of Parallel Computing.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.

Programming Tools and Environments: Linear Algebra James Demmel Mathematics and EECS UC Berkeley.

NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.

The Evolution of a Sparse Partial Pivoting Algorithm John R. Gilbert with: Tim Davis, Jim Demmel, Stan Eisenstat, Laura Grigori, Stefan Larimore, Sherry.

Sparse Direct Solvers on High Performance Computers X. Sherry Li CS267: Applications of Parallel Computers March.

Contemporary Languages in Parallel Computing Raymond Hummel.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

SuperLU: Sparse Direct Solver X. Sherry Li SIAM Short Course on the ACTS Collection Feburary 27, 2004.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Iterative and direct linear solvers in fully implicit magnetic reconnection simulations with inexact Newton methods Xuefei (Rebecca) Yuan 1, Xiaoye S.

Processing of a CAD/CAE Jobs in grid environment using Elmer Electronics Group, Physics Department, Faculty of Science, Ain Shams University, Mohamed Hussein.

Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,

SciDAC Software Introduction Osni Marques Lawrence Berkeley National Laboratory DOE Workshop for Industry Software Developers March 31,

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.

Lecture 5 Parallel Sparse Factorization, Triangular Solution

Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:

Amesos Sparse Direct Solver Package Ken Stanley, Rob Hoekstra, Marzio Sala, Tim Davis, Mike Heroux Trilinos Users Group Albuquerque 3 Nov 2004.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Developing a computational infrastructure for parallel high performance FE/FVM simulations Dr. Stan Tomov Brookhaven National Laboratory August 11, 2003.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Lecture 4 Sparse Factorization: Data-flow Organization

PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI, UMR CNRS 5800, Université Bordeaux I Projet ScAlApplix,

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.

1 1  Capabilities: Serial (C), shared-memory (OpenMP or Pthreads), distributed-memory (hybrid MPI+ OpenM + CUDA). All have Fortran interface. Sparse LU.

Barry Smith Algebraic Solvers in FASTMath Part 1 – Introduction to several packages unable to present today Part 2 – Portable Extensible Toolkit for Scientific.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.

Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 

Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Symmetric-pattern multifrontal factorization T(A) G(A)

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Martin Kruliš by Martin Kruliš (v1.1)1.

1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.

A Tool for Chemical Kinetics Simulation on Accelerated Architectures

A survey of Exascale Linear Algebra Libraries for Data Assimilation

Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz

Use-case: CFD software with FEM and unstructured meshes

Challenges in Electromagnetic Modeling Scalable Solvers

Parallel Unstructured Mesh Infrastructure

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Linchuan Chen, Xin Huo and Gagan Agrawal

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.

Hybrid Programming with OpenMP and MPI

Presentation transcript:

1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA). All implemented in C, having Fortran interface. Sparse LU decomposition, triangular solution with multiple right-hand sides. Incomplete LU (ILU) preconditioner in serial SuperLU. Sparsity-preserving ordering:  Minimum degree ordering applied to A T A or A T +A [MMD, Liu `85]  ‘Nested dissection’ ordering applied to A T A or A T +A [(Par)METIS, (PT)-Scotch User-controllable pivoting: partial pivoting, threshold pivoting, static pivoting. Condition number estimation. Iterative refinement. Componentwise error bounds.  Download:  Further information: Contact: Sherry Li, Developers: Sherrry Li, Jim Demmel, John Gilbert, Laura Grigori, Piush Sao, Meiyue Shao, Ichitaro Yamazaki SuperLU – supernodal sparse LU direct solver

2 2  Increased scalability via new DAG-based scheduling algorithms to shorten critical path. Idle time (MPI_Wait) was significantly reduced (2.6x faster using 1000s cores)  Architecture-aware: exploit heterogeneous nodes Offload fine-grained Schur-complement updates to GPU or MIC accelerators.  Programming: MPI + OpenMP + CUDA  Pipeline execution of CPU and GPU tasks 3x faster on multi-GPU, or multi-Xeon Phi clusters, 2-5x reduction in memory usage. “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par 2014, LNCS Vol Porto, Portugal, August , “A Sparse Direct Solver for Distributed Memory Xeon Phi-accelerated Systems”, P. Sao, X. Liu, R. Vuduc, and X.S. Li, X. Liu, IPDPS 2015, May 25-29, SuperLU_DIST: Recent advances CPU copy Accelerator copy Pipeline execution: CPU & Accelerator

3 3  Over 26,000 downloads in FY SuperLU is mentioned in 5% of the NERSC projects (weighted by allocation size)  Used in many high-end simulation codes: ASCEM/Amanzi: Advanced Simulation Capability for Environmental Management, DOE Denovo: radiation transport simulations for nuclear reactors, DOE DGDFT: Dicontinuous Galerkin Method for Density Functional Theory, DOE FEAP: finite element analysis, UC Berkeley H2plus: water simulation code, DOE HiFi: multi-fluid modeling for plasma applications, U. Washington M3D-C1: plasma fusion energy, DOE NekTar: High-order spectral-element Navier-Stokes solver, NCAR NIMROD: plasma fusion energy, DOE Omega3P: accelerator cavity design, DOE OpenSees: earthquake engineering, Pacific Earthquake Engineering Research Center PMAMR: CCSE code for carbon sequestration, DOE PHOENIX: stellar and planetary atmosphere code QUEST: Quantum electron simulation toolbox, UC Davis VORPAL: Plasma physics simulation code, Tech-X  Adopted in many commercial mathematical libraries and simulation software, including AMD (circuit simulation), Boeing (aircraft design), Chevron, ExxonMobile (geology), Cray's LibSci, FEMLAB, HP's MathLib, IMSL, NAG, OptimaNumerics, Python (SciPy), Walt Disney Feature Animation. SuperLU usage and impact