Michal Merta Alena Vašatová Václav Hapla David Horák

Slides:

Advertisements

Similar presentations

Block LU Factorization Lecture 24 MA471 Fall 2003.

Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Advanced Computational Software Scientific Libraries: Part 2 Blue Waters Undergraduate Petascale Education Program May 29 – June

1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.

Stochastic Optimization of Complex Energy Systems on High-Performance Computers Cosmin G. Petra Mathematics and Computer Science Division Argonne National.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

PETSc Portable, Extensible Toolkit for Scientific computing.

Active Set Support Vector Regression

Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.

Direct and iterative sparse linear solvers applied to groundwater flow simulations Matrix Analysis and Applications October 2007.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”

© 2008 The MathWorks, Inc. ® ® Parallel Computing with MATLAB ® Silvina Grad-Freilich Manager, Parallel Computing Marketing

Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.

1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

ANS 1998 Winter Meeting DOE 2000 Numerics Capabilities 1 Barry Smith Argonne National Laboratory DOE 2000 Numerics Capability

Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.

1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam.

ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.

1 SciDAC TOPS PETSc Work SciDAC TOPS Developers Satish Balay Chris Buschelman Matt Knepley Barry Smith.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Danny Dunlavy, Andy Salinger Sandia National Laboratories Albuquerque, New Mexico, USA SIAM Parallel Processing February 23, 2006 SAND C Sandia.

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

1 1  Capabilities: Scalable algebraic solvers for PDEs Freely available and supported research code Usable from C, C++, Fortran 77/90, Python, MATLAB.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Cracow Grid Workshop, November 5-6, 2001 Concepts for implementing adaptive finite element codes for grid computing Krzysztof Banaś, Joanna Płażek Cracow.

Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.

On the Performance of PC Clusters in Solving Partial Differential Equations Xing Cai Åsmund Ødegård Department of Informatics University of Oslo Norway.

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.

PYTHON FOR HIGH PERFORMANCE COMPUTING. OUTLINE  Compiling for performance  Native ways for performance  Generator  Examples.

Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,

On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.

A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.

Hui Liu University of Calgary

A survey of Exascale Linear Algebra Libraries for Data Assimilation

Xing Cai University of Oslo

Resource Elasticity for Large-Scale Machine Learning

GPU Implementations for Finite Element Methods

GENERAL VIEW OF KRATOS MULTIPHYSICS

FOUNDATIONS OF MODERN SCIENTIFIC PROGRAMMING

A Software Framework for Easy Parallelization of PDE Solvers

Parallelizing Unstructured FEM Computation

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Presentation transcript:

Michal Merta Alena Vašatová Václav Hapla David Horák Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France

Motivation solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs linear problems non-linear problems non-overlapping, FETI methods with up to tens of thousands of subdomains usage of PRACE Tier-1 and Tier-0 HPC systems

PETSc (Portable, Extensible Toolkit for Scientific computation) developed by Argonne National Laboratory data structures and routines for the scalable parallel solution of scientific applications modeled by PDE coded primarily in C language, but good FORTRAN support, can also be called from C++ and Python codes current version is 3.2 www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody

PETSc components seq. / par.

Trilinos developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc. object oriented design, high modularity, use of modern C++ features (templating) mainly in C++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov

Trilinos components

Both PETSc and Trilinos… Potential users PETSc „essential object orientation“ for programmers used to functional programming but seeking for more modular code recommended for C and FORTRAN users Trilinos „pure object orientation“ for programmers who are not scared of OOP, appreciate good SW design and have some experience with C++ even better extensibility and reusability, SW project management are parallelized on the data level (vectors & matrices) using MPI use BLAS and LAPACK – de facto standard for dense LA have their own implementation of sparse BLAS include robust preconditioners, linear solvers (direct and iterative) and nonlinear solvers can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source

Problem of elastostatics Let me introduce a simple model problem. Omega is a isotropic elastic body such as steel traverse. One side is fixed to the wall – Gamma_U – Dirichlet boundary. Others are loaded by prescribed surface traction. f

TFETI decomposition to apply TFET domain decomposition we tear the original body from the Dirichlet boundary decompose it into a system of elastic bodies of nearly the same size new artifical boundaries between them arise continuation on the artificial boundaries enforced by gluing conditions.

Primal discretized formulation The FEM discretization with a suitable numbering of nodes results in the QP problem:

Dual discretized formulation (homogenized) QP problem again, but with lower dimension and simpler constraints

Primal data distribution, F action * very sparse … straightforward matrix distribution, given by a decomposition block diagonal  embarrassingly parallel

Coarse projector action * ? ? ? … can easily take 85 % of computation time if not properly parallelized!

G preprocessing and action ?

Coarse problem preprocessing and action 1 2 action ? 3 Currently used variant: B2 (PPAM 2011)

Coarse problem

HECToR phase 3 (XE6) the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC uses the latest AMD "Bulldozer" multicore processor architecture 704 compute blades each blade with 4 compute nodes giving a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagos processors → 32 cores per node total of 90 112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops www.hector.ac.uk

Benchmark K+ implemented as direct solve (LU) of regularized K built-in CG routine used (PETSc.KSP, Trilinos.Belos) E = 1e6,  = 0.3, g = 9.81 ms-2 computed @ HECToR

Results # subds = # cores 1 4 16 64 256 1024 Prim. dim. Dual dim. 252 31752 127 008 508 032 2 032 128 8 128 512 32 514 048 Dual dim. 252 1512 7056 30240 124992 508032 Solution time Trilinos 1.39 3.01 4.80 6.25 10.31 28.05 PETSc 1.14 2.66 4.16 4.74 4.92 5.84 # iterations 34 63 96 105 102 33 68 94 1 iter. time 4.48e-2 4.76e-2 5.00e-2 5.95e-2 9.81e-2 2.75e-1 3.46e-2 3.92e-2 4.42e-2 4.52e-2 4.69e-2 5.73e-2 stopping criterion: ||rk|| / || r0|| < 1e-5 without preconditioning

Application to image registration Image registration is a crucial step of image processing if there is a need to compare or integrate information from two or more images. Given these images, a reference R and template T, the goal is to nd an optimal transformation in such way that T becomes, in a certain sense, similar to R. These images usually show the same scene, but taken at dierent times, from dierent viewpoints or by dierent sensors Usage in weather forecast, GIS, medicine, cartography, computer vision Process of integrating information from two (or more) different images Images from different sensors, different angles or/and times

Application to image registration In medicine: Monitoring of growth of a tumour Therapy valuation Comparison of patient data with anathomical atlas Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)

Elastic registration The task is to minimize the distance between two images 𝜑≔𝑥−𝑢(𝑥) 𝑇 𝑅

Elastic registration Parallelization using TFETI method

Results # of subdomains 1 4 16 Primal variables 20402 81608 326432 Dual variables 903 2641 8254 Solution time [s] 41 34.54 57.44 # of iterations 2467 990 665 Time/iteration [s] 0.01 0.03 0.08 stopping criterion: ||rk|| / || r0|| < 1e-5

Solution

Conclusion and future work To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2) To extend image registration to 3D data

References KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software. HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012. Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.

Thank you for your attention!