Michal Merta Alena Vašatová Václav Hapla David Horák Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France
Motivation solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs linear problems non-linear problems non-overlapping, FETI methods with up to tens of thousands of subdomains usage of PRACE Tier-1 and Tier-0 HPC systems
PETSc (Portable, Extensible Toolkit for Scientific computation) developed by Argonne National Laboratory data structures and routines for the scalable parallel solution of scientific applications modeled by PDE coded primarily in C language, but good FORTRAN support, can also be called from C++ and Python codes current version is 3.2 www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody
PETSc components seq. / par.
Trilinos developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc. object oriented design, high modularity, use of modern C++ features (templating) mainly in C++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov
Trilinos components
Both PETSc and Trilinos… Potential users PETSc „essential object orientation“ for programmers used to functional programming but seeking for more modular code recommended for C and FORTRAN users Trilinos „pure object orientation“ for programmers who are not scared of OOP, appreciate good SW design and have some experience with C++ even better extensibility and reusability, SW project management are parallelized on the data level (vectors & matrices) using MPI use BLAS and LAPACK – de facto standard for dense LA have their own implementation of sparse BLAS include robust preconditioners, linear solvers (direct and iterative) and nonlinear solvers can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source
Problem of elastostatics Let me introduce a simple model problem. Omega is a isotropic elastic body such as steel traverse. One side is fixed to the wall – Gamma_U – Dirichlet boundary. Others are loaded by prescribed surface traction. f
TFETI decomposition to apply TFET domain decomposition we tear the original body from the Dirichlet boundary decompose it into a system of elastic bodies of nearly the same size new artifical boundaries between them arise continuation on the artificial boundaries enforced by gluing conditions.
Primal discretized formulation The FEM discretization with a suitable numbering of nodes results in the QP problem:
Dual discretized formulation (homogenized) QP problem again, but with lower dimension and simpler constraints
Primal data distribution, F action * very sparse … straightforward matrix distribution, given by a decomposition block diagonal embarrassingly parallel
Coarse projector action * ? ? ? … can easily take 85 % of computation time if not properly parallelized!
G preprocessing and action ?
Coarse problem preprocessing and action 1 2 action ? 3 Currently used variant: B2 (PPAM 2011)
Coarse problem
HECToR phase 3 (XE6) the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC uses the latest AMD "Bulldozer" multicore processor architecture 704 compute blades each blade with 4 compute nodes giving a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagos processors → 32 cores per node total of 90 112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops www.hector.ac.uk
Benchmark K+ implemented as direct solve (LU) of regularized K built-in CG routine used (PETSc.KSP, Trilinos.Belos) E = 1e6, = 0.3, g = 9.81 ms-2 computed @ HECToR
Results # subds = # cores 1 4 16 64 256 1024 Prim. dim. Dual dim. 252 31752 127 008 508 032 2 032 128 8 128 512 32 514 048 Dual dim. 252 1512 7056 30240 124992 508032 Solution time Trilinos 1.39 3.01 4.80 6.25 10.31 28.05 PETSc 1.14 2.66 4.16 4.74 4.92 5.84 # iterations 34 63 96 105 102 33 68 94 1 iter. time 4.48e-2 4.76e-2 5.00e-2 5.95e-2 9.81e-2 2.75e-1 3.46e-2 3.92e-2 4.42e-2 4.52e-2 4.69e-2 5.73e-2 stopping criterion: ||rk|| / || r0|| < 1e-5 without preconditioning
Application to image registration Image registration is a crucial step of image processing if there is a need to compare or integrate information from two or more images. Given these images, a reference R and template T, the goal is to nd an optimal transformation in such way that T becomes, in a certain sense, similar to R. These images usually show the same scene, but taken at dierent times, from dierent viewpoints or by dierent sensors Usage in weather forecast, GIS, medicine, cartography, computer vision Process of integrating information from two (or more) different images Images from different sensors, different angles or/and times
Application to image registration In medicine: Monitoring of growth of a tumour Therapy valuation Comparison of patient data with anathomical atlas Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)
Elastic registration The task is to minimize the distance between two images 𝜑≔𝑥−𝑢(𝑥) 𝑇 𝑅
Elastic registration Parallelization using TFETI method
Results # of subdomains 1 4 16 Primal variables 20402 81608 326432 Dual variables 903 2641 8254 Solution time [s] 41 34.54 57.44 # of iterations 2467 990 665 Time/iteration [s] 0.01 0.03 0.08 stopping criterion: ||rk|| / || r0|| < 1e-5
Solution
Conclusion and future work To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2) To extend image registration to 3D data
References KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software. HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012. Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.
Thank you for your attention!