Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michal Merta Alena Vašatová Václav Hapla David Horák

Similar presentations


Presentation on theme: "Michal Merta Alena Vašatová Václav Hapla David Horák"— Presentation transcript:

1 Michal Merta Alena Vašatová Václav Hapla David Horák
Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France

2 Motivation solution of large-scale scientific and engineering problems
possibly hundreds of millions DOFs linear problems non-linear problems non-overlapping, FETI methods with up to tens of thousands of subdomains usage of PRACE Tier-1 and Tier-0 HPC systems

3 PETSc (Portable, Extensible Toolkit for Scientific computation)
developed by Argonne National Laboratory data structures and routines for the scalable parallel solution of scientific applications modeled by PDE coded primarily in C language, but good FORTRAN support, can also be called from C++ and Python codes current version is 3.2 petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody

4 PETSc components seq. / par.

5 Trilinos developed by Sandia National Laboratories
collection of relatively independent packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc. object oriented design, high modularity, use of modern C++ features (templating) mainly in C++ (Fortran and Python bindings) current version trilinos.sandia.gov

6 Trilinos components

7 Both PETSc and Trilinos…
Potential users PETSc „essential object orientation“ for programmers used to functional programming but seeking for more modular code recommended for C and FORTRAN users Trilinos „pure object orientation“ for programmers who are not scared of OOP, appreciate good SW design and have some experience with C++ even better extensibility and reusability, SW project management are parallelized on the data level (vectors & matrices) using MPI use BLAS and LAPACK – de facto standard for dense LA have their own implementation of sparse BLAS include robust preconditioners, linear solvers (direct and iterative) and nonlinear solvers can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source

8 Problem of elastostatics
Let me introduce a simple model problem. Omega is a isotropic elastic body such as steel traverse. One side is fixed to the wall – Gamma_U – Dirichlet boundary. Others are loaded by prescribed surface traction. f

9 TFETI decomposition to apply TFET domain decomposition we tear the original body from the Dirichlet boundary decompose it into a system of elastic bodies of nearly the same size new artifical boundaries between them arise continuation on the artificial boundaries enforced by gluing conditions.

10 Primal discretized formulation
The FEM discretization with a suitable numbering of nodes results in the QP problem:

11 Dual discretized formulation (homogenized)
QP problem again, but with lower dimension and simpler constraints

12 Primal data distribution, F action
* very sparse … straightforward matrix distribution, given by a decomposition block diagonal  embarrassingly parallel

13 Coarse projector action
* ? ? ? … can easily take 85 % of computation time if not properly parallelized!

14 G preprocessing and action
?

15 Coarse problem preprocessing and action
1 2 action ? 3 Currently used variant: B2 (PPAM 2011)

16 Coarse problem

17 HECToR phase 3 (XE6) the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC uses the latest AMD "Bulldozer" multicore processor architecture 704 compute blades each blade with 4 compute nodes giving a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagos processors → 32 cores per node total of cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops

18 Benchmark K+ implemented as direct solve (LU) of regularized K
built-in CG routine used (PETSc.KSP, Trilinos.Belos) E = 1e6,  = 0.3, g = 9.81 ms-2 HECToR

19 Results # subds = # cores 1 4 16 64 256 1024 Prim. dim. Dual dim. 252
31752 Dual dim. 252 1512 7056 30240 124992 508032 Solution time Trilinos 1.39 3.01 4.80 6.25 10.31 28.05 PETSc 1.14 2.66 4.16 4.74 4.92 5.84 # iterations 34 63 96 105 102 33 68 94 1 iter. time 4.48e-2 4.76e-2 5.00e-2 5.95e-2 9.81e-2 2.75e-1 3.46e-2 3.92e-2 4.42e-2 4.52e-2 4.69e-2 5.73e-2 stopping criterion: ||rk|| / || r0|| < 1e-5 without preconditioning

20 Application to image registration
Image registration is a crucial step of image processing if there is a need to compare or integrate information from two or more images. Given these images, a reference R and template T, the goal is to nd an optimal transformation in such way that T becomes, in a certain sense, similar to R. These images usually show the same scene, but taken at dierent times, from dierent viewpoints or by dierent sensors Usage in weather forecast, GIS, medicine, cartography, computer vision Process of integrating information from two (or more) different images Images from different sensors, different angles or/and times

21 Application to image registration
In medicine: Monitoring of growth of a tumour Therapy valuation Comparison of patient data with anathomical atlas Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)

22 Elastic registration The task is to minimize the distance between two images 𝜑≔𝑥−𝑢(𝑥) 𝑇 𝑅

23 Elastic registration Parallelization using TFETI method

24 Results # of subdomains 1 4 16 Primal variables 20402 81608 326432
Dual variables 903 2641 8254 Solution time [s] 41 34.54 57.44 # of iterations 2467 990 665 Time/iteration [s] 0.01 0.03 0.08 stopping criterion: ||rk|| / || r0|| < 1e-5

25 Solution

26 Conclusion and future work
To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2) To extend image registration to 3D data

27 References KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software. HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012. Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp

28 Thank you for your attention!


Download ppt "Michal Merta Alena Vašatová Václav Hapla David Horák"

Similar presentations


Ads by Google