1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department of Mathematics and Computer Science Indiana State University
2 Contents Current status of real symmetric eigensolvers Motivation BD&C algorithm – a high performance approximate eigensolver Parallel implementations of BD&C algorithm Applications in electronic structure calculation and numerical results Summary and Future Work
3 Current Status of Dense Symmetric Eigensolvers PDSYEVD PDSYEVX PDSYEVR
4 Classical Three Steps to Decompose A=XΛX T Reduction to symmetric tridiagonal form A=HTH T Eigen-decomposition of the tridiagonal matrix T=VΛV T Cuppen’s divide-and-conquer Bisection and inverse iteration Multiple Relatively Robust Representations (MRRR) Back-transformation of the eigenvectors X=HV
5 Bottleneck of Classical Approaches Reduction time is the bottleneck PDSYEVD PDSYEVR Robert C. Ward and Yihua Bai, Performance of Parallel Eigensolvers on Electronic Structure Calculations II, Technical Report UT-CS , University of Tennessee August 2006
6 Limitation of Classical Approaches Compute eigen-solution to full accuracy, while lower accuracy frequently sufficient in electronic structure calculation Questions: Trade accuracy for efficiency? How?
7 Motivation A high performance approximate eigensolver for electronic structure calculation
8 Schrödinger’s Equation: An Intrinsic Eigenvalue Problem
9 Computation of Electronic Structure Solve Schrödinger’s Equation efficiently Different approximation methods Hartree-Fock approximation density functional theory configuration interaction …, etc. Self-Consistent Field method Solve generalized non-linear real symmetric eigenvalue problem iteratively A standard linear eigenvalue problem solved in each iteration. Typically the most time consuming part of electronic structure calculation Low accuracy suffices in earlier iterations Matrices from application problems may have locality properties
10 Problem Definition Given a real symmetric matrix A and accuracy tolerance , want to compute where and contain the approximate eigenvectors and eigenvalues, respectively, and satisfy
11 Block Algorithms for Approximate Eigensolver 1) Block-tridiagonal divide-and-conquer (BD&C) – The centerpiece 2) Block tridiagonalization (BT) – Block tridiagonalization of sparse and “effectively” sparse matrices 3) Orthogonal reduction of full matrix to block- tridiagonal form (OBR) – Orthogonal transformations to produce block-tridiagonal matrix
12 1) BD&C Algorithm * Decompose: block tridiagonal matrix where numerically orthogonal eigenvector matrix accuracy tolerance number of blocks diagonal matrix of eigenvalues * W. N. Gansterer, R. C. Ward, R. P. Muller and W. A. Goddard III, Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices, SIAM J. Sci. Comput., 25 (2003), pp. 65 – 85.
13 Three Steps of BD&C 1. Subdivision 2. Solve Sub-problem 3. Synthesis – the most time consuming step with decompose: where:,, decompose, then multiply V i and Z Complexity:a function of deflation, rank, and size
14 2) Block Tridiagonalization (BT)* An approximation to the original full matrix May require eigenvectors from previous iteration * Y. Bai, W. N. Gansterer and R. C. Ward, Block-Tridiagonalization of “Effectively” Sparse Symmetric Matrices, ACM Trans. Math. Softw., 30 (2004), pp. 326 – 352. Complexity:
15 3) Orthogonal Reduction to Block-Tridiagonal Matrix (OBR) * A full matrix that cannot be sparsified A sequence of Householder transformations Complexity:
16 Complexity of Major Components AlgorithmComputational Complexity BD&C BT OBR message passing latency time to transfer one floating point number time for one floating point operation ranks for off-diagonal blocks
17 Parallel Implementations Parallel block divide-and-conquer (PBD&C) * Preprocessing Parallel block tridiagonalization (PBT) Parallel orthogonal block-tridiagonal reduction (POBR) ** * Yihua Bai and Robert C. Ward, A Parallel Symmetric Block-Tridiagonal Divide-and-Conquer Algorithm, Technical Report UT-CS , University of Tennessee, December Submitted to ACM TOMS ** Yihua Bai and Robert C. Ward, Parallel Block Tridiagonalization of Real Symmetric Matrices, Technical Report UT-CS , University of Tennessee, June Submitted to ACM TOMS
18 Implementations of PBD&C Mixed data/task parallel implementation versus complete data parallel implementation
19 Mixed Parallel Implementation Data distribution and redistribution Merging sequence and workload balance Mixed parallelism – data/task Deflation
20 Matrix Distribution – Mixed Data/Task Parallelism Divide processors into groups of sub-grids Assign each sub-grid to a sub-problem Block-tridiagonal matrix with q diagonal blocks
21 Matrix Distribution – Example Each diagonal block assigned a sub-grid 2D block cyclic distribution on each sub-grid
22 Data Redistribution Distribute from a 2 2 grid to a 3 3 grid Redistribute data from one sub-grid to another one (subdivision step)
23 Distribute from a 2 2 and a 2 4 grids to a 3 4 grid Redistribute data for each merging operation from two sub-grids to one super-grid (synthesis step) Data Redistribution (cont’d)
24 Merging Sequence Final merging operation Idle time Level 4 Level 3 Level 2 Level 1 Level 0 h right h lett Final merging operation counts for up to 75% of total computational cost. Consider low computational complexity and workload balance at the same time for the final merge.
25 Problems Subgrid construction Example: subgrid 1: 2X2 subgrid 2: 5X5 supergrid: 1X29? Many communicator handles Can use up to 2k handles, where k=max(number of diagonal blocks, number of total processors) Portability on different MPI implementations Example: need minor modification of code when use mpimx (myrinet mpi)
26 Complete Data Parallel Implementation Block-tridiagonal matrix with q diagonal blocks Assign all processors to each block in block-tridiagonal matrix Assume a 2X2 processor grid, Assigned to B 1, B 2, …, B q, and C 1, C 2, …, C q-1.
27 Advantages and Disadvantages Advantages One communicator One processor grid Portability to different MPI platform Disadvantages Not all processors involved in some steps SVD of off-diagonal blocks Decomposition of diagonal blocks Merge smaller sub-problems Still need data redistribution for each merging operation
28 Numerical Results Mixed data/task parallel BD&C subroutine PDSBTDC vs. ScaLAPACK PDSYEVD Matrices with different eigenvalue distributions and different sizes Banded application matrix Complete data parallel BD&C subroutine PDSBTDCD vs. Mixed data/task parallel BD&C subroutine PDSBTDC
29 Machine Specifications IBM p690 System in ORNL
30 PDSBTDC vs. PDSYEVD on Matrices with Different Eigenvalue Distributions Arithmetically distributed eigenvalues Geometrically distributed eigenvalues =10 -6, b = 20
31 Accuracy of PDSBTDC Residual: Departure from orthogonality:
32 PDSBTDC on Application Matrix PDSBTDC with different tolerances Polyalanine matrix, n = 5027, b = 79
33 Performance Test on UT SInRG AMD Opteron Processor 240 Cluster Number of nodes64 Memory per node2GB Processor per node2 CPU frequency1.4 GB L2 cache1 MB TLB size1024 4K pages InterconnectMyrinet 2000 Similar performance and scales a little better
34 PDSBTDC vs. PDSBTDCD Performance Block-tridiagonal matrix with arithmetically distributed eigenvalues, Matrix size = 12000, block size = 20, tolerance = Data parallel implementation scales down in SVD of off-diagonal blocks and solving sub-problems.
35 Application in Electronic Structure Calculation Trans-Polyacetylene Simple chemical structure Semiconducting conjugated polymer Light emitting devices, flexible Fast nonlinear optical response Strong nonlinear susceptibility
36 Matrix Generated from trans-PA Yihua Bai, Robert C. Ward, and Guoping Zhang, Parallel Divide-and-Conquer Algorithm for Computing Full Spectrum of Polyacetylene, Poster at the Division of Atomic, Molecular and Optical Physics (DAMOP) 2006 meeting, Knoxville, Tennessee.
37 Two Steps to Compute Approximate Eigen-Solution Construct block-tridiagonal matrix from the original dense matrix H M = H + E, where M is block tridiagonal Algorithm: PBT Compute eigensolutions to reduced accuracy User defined accuracy, typically Algorithm: PBD&C
38 Trans-(CH) n=16000, = With lower accuracy (i.e., ), the savings in execution time is order of magnitude. Compare Execution Time with ScaLAPACK PDSYEVD
39 Relative Execution Time with Fixed n 2 /p With fixed per-processor problem size, The relative execution time for an O(n 3 ) algorithm should be as the reference line shows. The curve for our new parallel algorithm shows a computational complexity between O(n 2 ) and O(n 3 )
40 Conclusion and Future Work
41 Conclusion PBD&C: very efficient on block tridiagonal matrices with Low ranks for off-diagonal blocks High ratio of deflation Comparison of PDSBTDC and PDSBTDCD PDSBTDCD performs better with smaller number of processors in use PDSBTDC scales better as the number of processors in use increases PBD&C combined with PBT Efficient on application matrices with specific locality property
42 Future Work A Parallel Adaptive Eigensolver Alternative method for computation of eigenvectors Approximation in sparse eigensolver Incorporate PBD&C and PBT into SCF for trans-PA Fine tuning of PDSBTDCD
43 End of Presentation Thank you!
44 Acknowledgement Dr. R. P. Muller Sandia National Laboratories Dr. G. Zhang Indiana State University
45 Task Flowchart Major Efficiency improvements from Reduced accuracy in early iterations of SCF Reducing the reduction bottleneck Eigenvectors may be required if efforts made to improve efficiency
46 Complexity of Major Components Sequential Parallel BD&C BT OBR message passing latency time to transfer one floating point number time for one floating point operation n b block size for parallel 2D matrix distribution