Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shengxin Zhu The University of Oxford

Similar presentations


Presentation on theme: "Shengxin Zhu The University of Oxford"— Presentation transcript:

1 Shengxin Zhu The University of Oxford
What is the most important kernel of sparse linear solvers for heterogeneous supercomputers? Shengxin Zhu The University of Oxford Prof. Xingping Liu and Prof. Tongxiang Gu National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics 2018/11/25 SNSCC'12,

2 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Outlines Brief introduction on Heterogeneous supper-computers Computation kernels of Krylov methods Influence of communications Case study: GPBiCG(m,l) Challenging problems Conclusion 2018/11/25 SNSCC'12, 2

3 Introduction to heterogeneous supper-computers
Dawning5000A Nodes: Bandwidth: Memory: Dawning 5000 Ranking history 11/2008 11th 06/2009 15th 11/2009 19th 06/2010 24th 11/2010 35th 06/2011 40th 11/2011 58th 2011/ Nov : top500 1st K (JP) 2st NUDT (CN) 3rd Cray (US) 4th Dawning (CN) 3 2018/11/25 SNSCC'12,

4 Computational kernels of Krylov methods
Vector update: parallel in nature Mat-vec: Computation intensive; multi-core technology CUDA/OpenMP Inner product: Communication intensive (CPU/MPI). 2018/11/25 SNSCC'12, 4

5 Influence of communication first glance
Computation cheap Communication expensive Based on Aztec by Prof. Tuminaro et Sandia S Zhu, MSc Thesis, CAEP, 2010 2018/11/25 SNSCC'12,

6 Real reason for time-consuming communications
Small workshops: focus less preparing time Conference: diversity more preparing time 2018/11/25 SNSCC'12, 6

7 Strategies for minimizing communications
Replacing dot by others (semi-Chebyshev ) : workshop only no conference if possible. Inner product free , Gu, Liu, Mo(2002) Reorganizing algorithm such that: (reduce number of conference and each conference accept more talks) residual replacement strategies due to Von de Vorst (2000s). CA –KSMs, Demmel et al (2008) Overlapping communication over computation 2018/11/25 SNSCC'12,

8 A case study, Paralleling GPBiCG(m,l) (S. Fujino, 2002)
GPBiCG(1,0) BiCGSTAB GPBiCG(0,1) GPBiCG GPBiCG(1,1) BiCGSTAB2 Could be used to design breakdown free BiCGSTAB method. 2018/11/25 SNSCC'12,

9 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
GPBiCG(m,l) (S. Fujino, 2002) 2018/11/25 SNSCC'12,

10 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
GPBiCG(m,l) (S. Fujino, 2002) 2018/11/25 SNSCC'12,

11 Algorithm Design of PGPBiCG(m,l) Method
2018/11/25 SNSCC'12,

12 PGPBiCG(m,l) Method (reduce # global commun. )
Algorithm reconstruct: three GobalCs to one! Global synch. reconstruct Global synch. 2018/11/25 SNSCC'12,

13 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Performance Based on Aztec by Prof. R.S. Tuminaro et Sandia 2018/11/25 SNSCC'12,

14 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Convergence analysis Residual replacements strategies Backward stable analysis 2018/11/25 SNSCC'12,

15 Challenging problem Accurate compute dot
Why Mindless by Kahan Accurate compute inner product. Ogita and Rump –et-al, Accurate sum and dot product, SIAM Sci Compt cited 188 times. (but) …. PLASMA team Backward stable analysis of residual replacement methods. Carson and Demmel, A residual replacement strategy for improving the maximum attainable accuracy of communication avoiding Krylov subspace Methods, April Reliable dot computation algorithm 2018/11/25 SNSCC'12,

16 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Conclusion: Avoiding communication Reliable computation Inner product computation is very likely to be the most challenging kernel for HHPC, while Mat_vec important for both… Software abstraction and threads programming are helpful, together with re-designing algorithms will do better Math/Algorithm CS/Performance Applications interface Aztec POSKI POSKI Hyper, PETSc; Trilinos (Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May 02/2012 2018/11/25 SNSCC'12,

17 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Thanks ! 2018/11/25 SNSCC'12,

18 Initial study on communication complexity
More than ten thousand processors are connected by network Global Communication becomes more and more serious 2018/11/25 SNSCC'12,

19 Methods in literatures
Based on the former two strategies de Sturler and van der Vorst: Parallel GMRES(m) and CG methods (1995) Bucker and Sauren: Parallel QMR method (1997) Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods ( ) Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and PQMRCGSTAB methods ( ) Demmel et al CA-KSMs ( ) Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient method (2004) replaced the inner products computation by solving linear systems with small size. Eliminates global inner products completely. The idea have been generated to MPCG by Grief and Bridson (2006) 2018/11/25 SNSCC'12,

20 Comparison of computational count of two Algorithms
2018/11/25 SNSCC'12,

21 Comparison of computational count of two Algorithms
2018/11/25 SNSCC'12,

22 Mathematical model of the time consummation
2018/11/25 SNSCC'12,

23 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Scalability analysis 2018/11/25 SNSCC'12,

24 The optimal number of processors
Popt 2018/11/25 SNSCC'12,

25 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Convergence Analysis 2018/11/25 SNSCC'12,

26 Numerical Experiments: timing and improvements
2018/11/25 SNSCC'12,

27 Numerical Experiments: Speedup
2018/11/25 SNSCC'12,

28 SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Conclusions PGPBiCG(m,l) method is more scalable and parallel for solving large sparse unsymmetrical linear systems on distributed parallel architectures Performance, isoefficiency analysis and numerical experiments have been done for PGPBiCG(m,l) and GPBiCG(m,l) methods The parallel communication performance can be improved by a factor of larger than 3. The PGPBiCG(m,l) method has better parallel speed up compared with the GPBiC(m,l) method. For further performance improvements: overlap of computation with communication, numerical stability. 2018/11/25 SNSCC'12,


Download ppt "Shengxin Zhu The University of Oxford"

Similar presentations


Ads by Google