1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department.

Slides:



Advertisements
Similar presentations
A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Ionization of the Hydrogen Molecular Ion by Ultrashort Intense Elliptically Polarized Laser Radiation Ryan DuToit Xiaoxu Guan (Mentor) Klaus Bartschat.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
Solving Linear Systems (Numerical Recipes, Chap 2)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
OpenFOAM on a GPU-based Heterogeneous Cluster
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
1cs542g-term Notes  In assignment 1, problem 2: smoothness = number of times differentiable.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
1 Complex Images k’k’ k”k” k0k0 -k0-k0 branch cut   k 0 pole C1C1 C0C0 from the Sommerfeld identity, the complex exponentials must be a function.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Numerical Analysis – Eigenvalue and Eigenvector Hanyang University Jong-Il Park.
*Man-Cheol Kim, Hyung-Jo Jung and In-Won Lee *Man-Cheol Kim, Hyung-Jo Jung and In-Won Lee Structural Dynamics & Vibration Control Lab. Structural Dynamics.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Jungpyo Lee Plasma Science & Fusion Center(PSFC), MIT Parallelization for a Block-Tridiagonal System with MPI 2009 Spring Term Project.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
What is QCD? Quantum ChromoDynamics is the theory of the strong force
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Data Structures and Algorithms in Parallel Computing
1. Systems of Linear Equations and Matrices (8 Lectures) 1.1 Introduction to Systems of Linear Equations 1.2 Gaussian Elimination 1.3 Matrices and Matrix.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
1 Instituto Tecnológico de Aeronáutica Prof. Maurício Vicente Donadon AE-256 NUMERICAL METHODS IN APPLIED STRUCTURAL MECHANICS Lecture notes: Prof. Maurício.
Linear Scaling Quantum Chemistry Richard P. Muller 1, Bob Ward 2, and William A. Goddard, III 1 1 Materials and Process Simulation Center California Institute.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Computational Physics (Lecture 7) PHY4061. Eigen Value Problems.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Parallel Algorithm Design & Analysis Course Dr. Stephen V. Providence Motivation, Overview, Expectations, What’s next.
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
A PARALLEL BISECTION ALGORITHM (WITHOUT COMMUNICATION)
CSCE569 Parallel Computing
Parallelization of Sparse Coding & Dictionary Learning
Lecture 13: Singular Value Decomposition (SVD)
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Presentation transcript:

1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department of Mathematics and Computer Science Indiana State University

2 Contents Current status of real symmetric eigensolvers Motivation BD&C algorithm – a high performance approximate eigensolver Parallel implementations of BD&C algorithm Applications in electronic structure calculation and numerical results Summary and Future Work

3 Current Status of Dense Symmetric Eigensolvers  PDSYEVD  PDSYEVX  PDSYEVR

4 Classical Three Steps to Decompose A=XΛX T Reduction to symmetric tridiagonal form A=HTH T Eigen-decomposition of the tridiagonal matrix T=VΛV T Cuppen’s divide-and-conquer Bisection and inverse iteration Multiple Relatively Robust Representations (MRRR) Back-transformation of the eigenvectors X=HV

5 Bottleneck of Classical Approaches Reduction time is the bottleneck PDSYEVD PDSYEVR Robert C. Ward and Yihua Bai, Performance of Parallel Eigensolvers on Electronic Structure Calculations II, Technical Report UT-CS , University of Tennessee August 2006

6 Limitation of Classical Approaches Compute eigen-solution to full accuracy, while lower accuracy frequently sufficient in electronic structure calculation Questions: Trade accuracy for efficiency? How?

7 Motivation A high performance approximate eigensolver for electronic structure calculation

8 Schrödinger’s Equation: An Intrinsic Eigenvalue Problem

9 Computation of Electronic Structure Solve Schrödinger’s Equation efficiently Different approximation methods Hartree-Fock approximation density functional theory configuration interaction …, etc. Self-Consistent Field method Solve generalized non-linear real symmetric eigenvalue problem iteratively A standard linear eigenvalue problem solved in each iteration. Typically the most time consuming part of electronic structure calculation Low accuracy suffices in earlier iterations Matrices from application problems may have locality properties

10 Problem Definition Given a real symmetric matrix A and accuracy tolerance , want to compute where and contain the approximate eigenvectors and eigenvalues, respectively, and satisfy

11 Block Algorithms for Approximate Eigensolver 1) Block-tridiagonal divide-and-conquer (BD&C) – The centerpiece 2) Block tridiagonalization (BT) – Block tridiagonalization of sparse and “effectively” sparse matrices 3) Orthogonal reduction of full matrix to block- tridiagonal form (OBR) – Orthogonal transformations to produce block-tridiagonal matrix

12 1) BD&C Algorithm * Decompose: block tridiagonal matrix where numerically orthogonal eigenvector matrix accuracy tolerance number of blocks diagonal matrix of eigenvalues * W. N. Gansterer, R. C. Ward, R. P. Muller and W. A. Goddard III, Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices, SIAM J. Sci. Comput., 25 (2003), pp. 65 – 85.

13 Three Steps of BD&C 1. Subdivision 2. Solve Sub-problem 3. Synthesis – the most time consuming step with decompose: where:,, decompose, then multiply V i and Z Complexity:a function of deflation, rank, and size

14 2) Block Tridiagonalization (BT)*  An approximation to the original full matrix  May require eigenvectors from previous iteration * Y. Bai, W. N. Gansterer and R. C. Ward, Block-Tridiagonalization of “Effectively” Sparse Symmetric Matrices, ACM Trans. Math. Softw., 30 (2004), pp. 326 – 352. Complexity:

15 3) Orthogonal Reduction to Block-Tridiagonal Matrix (OBR) * A full matrix that cannot be sparsified A sequence of Householder transformations Complexity:

16 Complexity of Major Components AlgorithmComputational Complexity BD&C BT OBR  message passing latency  time to transfer one floating point number  time for one floating point operation  ranks for off-diagonal blocks

17 Parallel Implementations Parallel block divide-and-conquer (PBD&C) * Preprocessing Parallel block tridiagonalization (PBT) Parallel orthogonal block-tridiagonal reduction (POBR) ** * Yihua Bai and Robert C. Ward, A Parallel Symmetric Block-Tridiagonal Divide-and-Conquer Algorithm, Technical Report UT-CS , University of Tennessee, December Submitted to ACM TOMS ** Yihua Bai and Robert C. Ward, Parallel Block Tridiagonalization of Real Symmetric Matrices, Technical Report UT-CS , University of Tennessee, June Submitted to ACM TOMS

18 Implementations of PBD&C Mixed data/task parallel implementation versus complete data parallel implementation

19 Mixed Parallel Implementation  Data distribution and redistribution  Merging sequence and workload balance  Mixed parallelism – data/task  Deflation

20 Matrix Distribution – Mixed Data/Task Parallelism  Divide processors into groups of sub-grids  Assign each sub-grid to a sub-problem Block-tridiagonal matrix with q diagonal blocks

21 Matrix Distribution – Example Each diagonal block assigned a sub-grid 2D block cyclic distribution on each sub-grid

22 Data Redistribution Distribute from a 2  2 grid to a 3  3 grid  Redistribute data from one sub-grid to another one (subdivision step)

23 Distribute from a 2  2 and a 2  4 grids to a 3  4 grid  Redistribute data for each merging operation from two sub-grids to one super-grid (synthesis step) Data Redistribution (cont’d)

24 Merging Sequence Final merging operation Idle time Level 4 Level 3 Level 2 Level 1 Level 0 h right h lett Final merging operation counts for up to 75% of total computational cost. Consider low computational complexity and workload balance at the same time for the final merge.

25 Problems Subgrid construction Example: subgrid 1: 2X2 subgrid 2: 5X5 supergrid: 1X29? Many communicator handles Can use up to 2k handles, where k=max(number of diagonal blocks, number of total processors) Portability on different MPI implementations Example: need minor modification of code when use mpimx (myrinet mpi)

26 Complete Data Parallel Implementation Block-tridiagonal matrix with q diagonal blocks Assign all processors to each block in block-tridiagonal matrix Assume a 2X2 processor grid, Assigned to B 1, B 2, …, B q, and C 1, C 2, …, C q-1.

27 Advantages and Disadvantages Advantages One communicator One processor grid Portability to different MPI platform Disadvantages Not all processors involved in some steps SVD of off-diagonal blocks Decomposition of diagonal blocks Merge smaller sub-problems Still need data redistribution for each merging operation

28 Numerical Results Mixed data/task parallel BD&C subroutine PDSBTDC vs. ScaLAPACK PDSYEVD Matrices with different eigenvalue distributions and different sizes Banded application matrix Complete data parallel BD&C subroutine PDSBTDCD vs. Mixed data/task parallel BD&C subroutine PDSBTDC

29 Machine Specifications IBM p690 System in ORNL

30 PDSBTDC vs. PDSYEVD on Matrices with Different Eigenvalue Distributions Arithmetically distributed eigenvalues Geometrically distributed eigenvalues  =10 -6, b = 20

31 Accuracy of PDSBTDC Residual: Departure from orthogonality:

32 PDSBTDC on Application Matrix PDSBTDC with different tolerances Polyalanine matrix, n = 5027, b = 79

33 Performance Test on UT SInRG AMD Opteron Processor 240 Cluster Number of nodes64 Memory per node2GB Processor per node2 CPU frequency1.4 GB L2 cache1 MB TLB size1024 4K pages InterconnectMyrinet 2000 Similar performance and scales a little better

34 PDSBTDC vs. PDSBTDCD Performance Block-tridiagonal matrix with arithmetically distributed eigenvalues, Matrix size = 12000, block size = 20, tolerance = Data parallel implementation scales down in SVD of off-diagonal blocks and solving sub-problems.

35 Application in Electronic Structure Calculation Trans-Polyacetylene Simple chemical structure Semiconducting conjugated polymer Light emitting devices, flexible Fast nonlinear optical response Strong nonlinear susceptibility

36 Matrix Generated from trans-PA Yihua Bai, Robert C. Ward, and Guoping Zhang, Parallel Divide-and-Conquer Algorithm for Computing Full Spectrum of Polyacetylene, Poster at the Division of Atomic, Molecular and Optical Physics (DAMOP) 2006 meeting, Knoxville, Tennessee.

37 Two Steps to Compute Approximate Eigen-Solution Construct block-tridiagonal matrix from the original dense matrix H M = H + E, where M is block tridiagonal Algorithm: PBT Compute eigensolutions to reduced accuracy User defined accuracy, typically Algorithm: PBD&C

38 Trans-(CH) n=16000,  = With lower accuracy (i.e., ), the savings in execution time is order of magnitude. Compare Execution Time with ScaLAPACK PDSYEVD

39 Relative Execution Time with Fixed n 2 /p With fixed per-processor problem size, The relative execution time for an O(n 3 ) algorithm should be as the reference line shows. The curve for our new parallel algorithm shows a computational complexity between O(n 2 ) and O(n 3 )

40 Conclusion and Future Work

41 Conclusion PBD&C: very efficient on block tridiagonal matrices with Low ranks for off-diagonal blocks High ratio of deflation Comparison of PDSBTDC and PDSBTDCD PDSBTDCD performs better with smaller number of processors in use PDSBTDC scales better as the number of processors in use increases PBD&C combined with PBT Efficient on application matrices with specific locality property

42 Future Work  A Parallel Adaptive Eigensolver  Alternative method for computation of eigenvectors  Approximation in sparse eigensolver  Incorporate PBD&C and PBT into SCF for trans-PA  Fine tuning of PDSBTDCD

43 End of Presentation Thank you!

44 Acknowledgement Dr. R. P. Muller Sandia National Laboratories Dr. G. Zhang Indiana State University

45 Task Flowchart Major Efficiency improvements from Reduced accuracy in early iterations of SCF Reducing the reduction bottleneck Eigenvectors may be required if efforts made to improve efficiency

46 Complexity of Major Components Sequential Parallel BD&C BT OBR  message passing latency  time to transfer one floating point number  time for one floating point operation n b block size for parallel 2D matrix distribution