Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
Lecture 19: Parallel Algorithms
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Solving Linear Systems (Numerical Recipes, Chap 2)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Systems of Linear Equations
CISE301_Topic3KFUPM1 SE301: Numerical Methods Topic 3: Solution of Systems of Linear Equations Lectures 12-17: KFUPM Read Chapter 9 of the textbook.
Lecture 17 Introduction to Eigenvalue Problems
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
1cs542g-term Notes  Assignment 1 is out (questions?)
1cs542g-term Notes  Assignment 1 due tonight ( me by tomorrow morning)
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Ch 7.9: Nonhomogeneous Linear Systems
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Solution of Eigenproblem of Non-Proportional Damping Systems by Lanczos Method In-Won Lee, Professor, PE In-Won Lee, Professor, PE Structural Dynamics.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Eigenvalue Problems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. The second main.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Makoto Kudoh*1, Hisayasu Kuroda*1,
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
 6.2 Pivoting Strategies 1/17 Chapter 6 Direct Methods for Solving Linear Systems -- Pivoting Strategies Example: Solve the linear system using 4-digit.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Lecture 7 - Systems of Equations CVEN 302 June 17, 2002.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 4. Least squares.
Solving Linear Systems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. Solving linear.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Linear Systems Dinesh A.
1 Instituto Tecnológico de Aeronáutica Prof. Maurício Vicente Donadon AE-256 NUMERICAL METHODS IN APPLIED STRUCTURAL MECHANICS Lecture notes: Prof. Maurício.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
Computational Physics (Lecture 7) PHY4061. Eigen Value Problems.
Improvement to Hessenberg Reduction
ALGEBRAIC EIGEN VALUE PROBLEMS
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Chilimbi, et al. (2014) Microsoft Research
for more information ... Performance Tuning
Lecture 22: Parallel Algorithms
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSCE569 Parallel Computing
Memory System Performance Chapter 3
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University

Outline of the talk 1. Introduction 2. Tridiagonalization by the Householder Method 3. BLAS-3 based tridiagonalization algorithms 4. Performance evaluation 5. Conclusion

1. Introduction The problem treated in this study –Standard eigenvalue problem Ax = λx A : Real symmetric dense matrix Objective –To develop a parallel eigensolver that can solve huge scale eigenproblems efficiently on PCs or workstations. Efficient use of the cache memory is the key to high performance. Algorithms with high locality of data reference is necessary. Applications –Molecular orbital methods Solution of dense eigenproblems of order more than 100,000 is required to compute the electronic structure of large protein molecules. –Principal component analysis

Flow chart of the eigenvalue/eigenvector computation of a symmetric matrix Q t AQ = T (Q: orthogonal) Solve Ty i = λ i y i For y i x i = Qy i Tridiagonalization Computation of eigenvalues Computation of eigenvectors Back transformation Solve |T – λ i I|= 0 for λ i We focus on here. Computation Algorithm Householder method QR method D & C method Bisection & IIM MR 3 algorithm Inverse transformation Tridiagonal T Eigenvalues of T Eigenpairs of A: { i }, {x i } Real symmetric A Eigenvectors of T

Computational work and data locality of each part (4/3)N 3 Computational work Computation of eigenvalues O(N2)O(N2) Data locality Tridiagonalization Low (Householder method) High Algorithm with higher data locality is needed in the tridiagonalization part. Tridiagonal T Eigenvalues of T (or A) : { i } Real symmetric A

Objective of this study Evaluate the performance of the following three algorithms, which are variants of the Householder method optimized to enhance cache utilization: (1) Dongarra’s algorithm (2) Bischof’s algorithm (3) Wu’s algorithm

2. Tridiagonalization by the Householder method The basic idea – Reduction by Householder transformation H = I –  uu t H is a symmetric orthogonal matrix. H eliminates all but the first element of a vector. Computation at the k-th step k elements to be eliminated modified elements nonzero elements multiply H from left multiply H from right multiply H from left vector

Algorithm of the Householder method [Step 1] Repeat [Step 2] ~ [Step 8] from k = 1 to N – 2. [Step 2] σ (k) = sqrt(d (k)t d (k) ) [Step 3] Compute the reflector: u (k) = (d (k) 1 – sgn(d (k) 1 )σ (k), d (k) 2, …, d (k) N – k ) [Step 4] α (k) = 2 / ∥ u (k) ∥ 2 [Step 5] p (k) =α (k) C (k) u (k) : matrix-vector multiplication [Step 6] β (k) =α (k) u (k)t p (k) /2 [Step 7] q (k) = p (k) – β (k) u (k) [Step 8] C (k) = C (k) – u (k) q (k)t – q (k) u (k)t : rank-2 update k 0 0 C (k) pivot columnpivot row d (k) 1 d (k)

Computational work and data locality Computational work –Total computational work : (4/3)N 3 Matrix-vector multiplication : (2/3)N 3 rank-2 update : (2/3)N 3 Data locality –Data locality is low since most of the work is done using level 2 BLAS. –Cannot attain high performance on processors with hierarchical memory or SMP machines due to poor cache utilization. –Algorithms that can use higher-level BLAS are necessary.

3. BLAS-3 based tridiagonalization algorithms Dongarra’s algorithm (Dongarra et al., 1992) –Aggregate the rank-2 updates for M consecutive steps and perform them at once as a rank-2M update. –Cache utilization is improved since the rank-2 updates can be turned into a level-3 BLAS operation. –Implemented in many software packages including LAPACK and ScaLAPACK. 0 C (K*L’) d(k)d(k) 0 d (k+1) d (k +2) d (k +3) C ((K– 1)*L’) = × – U(K*M)U(K*M) Q(K*M)tQ(K*M)t × – Q(K*M)Q(K*M) U(K*M)tU(K*M)t rank-2 M update (level-3 BLAS)

Dongarra’s algorithm [Step 1] Repeat the [Step2] – [Step 4] from K = 1 to N /M. [Step 2] U ((K – 1)*M) = φ , Q ((K – 1)*M) = φ ( 0 by 0 matrix ) [Step 3] Repeat [Step 3-1 ~ 8] from k = (K–1)*M+1 to K*M [Step 3-1] Partial Householder transformation: d (k) := d (k) – U (k – 1) (Q (k – 1)t ) k – (K – 1)*M – Q (k – 1) (U (k – 1)t ) k – (K – 1)*M [Step 3-2] σ (k) = sqrt(d (k)t d (k) ) [Step 3-3] Compute the reflector vector: u (k) = (d (k) 1 – sgn(d (k) 1 )σ (k), d (k) 2, …, d (k) N – k ) [Step 3-4] α (k) = 2 / ∥ u (k) ∥ 2 [Step 3-5] p (k) =α (k) (C (k) – U (k – 1) Q (k – 1)t – Q (k – 1) U (k – 1)t ) u (k) [Step 3-6] β (k) =α (k) u (k)t p (k) /2 [Step 3-7] q (k) = p (k) – β (k) u (k) [Step 3-8] U (k) = [U (k – 1) | u (k) ] , Q (k) = [Q (k – 1) | q (k) ] [Step 12] rank-2M update of the matrix: C (K*M) = C ((K – 1)*M) – U (K*M) Q (K*M)t – Q (K*M) U (K*M)t level-3 BLAS

Characteristics of Dongarra’s algorithm Data locality –Half of the total work can be done using level-3 BLAS. –The other half (matrix-vector multiplication) still has to be done as level-2 BLAS. usually can attain only 10 ~ 25% of the peak performance on modern microprocessors due to poor cache utilization in the latter part. Algorithms fully based on the level-3 BLAS are desirable.

Bischof’s algorithm The basic idea (Bishof et al., 1993, 1994) –First, transform A to a band matrix B (of half bandwidth L ). –Next, transform B to a tridiagonal matrix T. Advantage of the two-step tridiagonalization –Reduction to a band matrix can be done using only level-3 BLAS. –The computational work needed to tri-diagonalize the band matrix is O(N 2 L), which is much smaller than the work for the first stage half bandwidth L ABT order N (4/3)N 3 O(N2L)O(N2L) Murata’s algorithm

Reduction to a band matrix Reduction by block Householder transformation –H = I – UαU t H is an orthogonal matrix. H transforms the topmost block of a block vector to a triangular matrix and clear all the other elements. Computation at the k-th step multiply H from left Block vector elements to be eliminated modified elements nonzero elements multiply H from left multiply H from right

Reduction to a band matrix (cont’d) [Step 1] Repeat [Step 2] ~ [Step 6] from K = 1 to N /L–1 [Step 2] Compute the block Householder transformation I – U (K)  (K) U (K)t that transforms the first block of D (K) to a triangular matrix and clears all the other elements. [Step 3] P (K) = C (K) U (K) α (K) : matrix-block vector multiply [Step 4] β (K) = α (K)t U (K)t P (K) / 2 [Step 5] Q (K) = P (K) –U (K) β (K) [Step 6] C (K) = C (K) – U (K) Q (K)t – Q (K) U (K)t : rank-2L update All can be done with the level-3 BLAS

Characteristics of Bischof’s algorithm Data locality –Almost all computation can be done with the level-3 BLAS. –Performance can be maximized by optimizing the half bandwidth L according to the cache size and the problem size. Choice of the half bandwidth L –Larger value of L provides more chance for cash utilization. –However, the computational work for Murata’s algorithm (tridiagonalization of the band matrix) increases proportionally with L. –The optimal value of L is determined from the trade-off of the above two aspects.

Wu’s algorithm The idea (Wu et al., 1996) –Aggregate the rank-2L updates for M consecutive steps and perform them at once as a rank-2LM update. –Cache utilization can be further improved without increasing the computational work for Murata’s algorithm. –However, the computational work for reduction to a band matrix increases (slightly) with M. –The optimal values of L and M are determined from the trade-off between cache utilization and increase in the computational work.

4. Performance Evaluation Target problems –Tridiagonalization of Frank matrices of order from 480 to –Performance of Dongarra, Bischof and Wu’s algorithms is compared. Computational environmtnts –Xeon (2.0GHz) ・ Alpha 21264A (750MHz) –Ultra SPARC III (750MHz) –Opteron (1.8GHz) SMP 1PE ~ 4PE –Power5 (1.9GHz) SMP 1PE ~ 16PE Details of the experiments –The MFLOPS value of each algorithm was computed by dividing (4/3)N 3 by the time needed to tridiagonalize the input matrix. –L and M are determined to maximize the performance in each case. –ATLAS 3.6.0, GOTO BLAS or ESSL is used as BLAS libraries. –Parallelization on is done only within the BLAS.

Performance on the Xeon (2.0GHz) (with ATLAS 3.6.0) Matrix size Performance (MFLOPS) Wu’s algorithm attains nearly twice the performance of Dongarra’s algorithm. L=24, M=4 L=48 M=32

Performance on the Alpha 21264A (750MHz) (with ATLAS 3.6.0) Matrix size Performance (MFLOPS) Wu’s algorithm is twice as fast as Dongarra’s agorithm and attains more than 50% of the peak performance when N = L=24, M=4 L=48 M=32

Performance on the Ultra SPARC III (750MHz) (with ATLAS 3.6.0) Matrix size Performance (MFLOPS) Wu’s algorithm attains more than twice the performance of Dongarra’s algorithm. L=24, M=2 L=48 M=32

Performance on the Opteron (1.8GHz) (with GOTO BLAS) Matrix size Performance (MFLOPS) Wu’s algorithm is more than twice as fast as Dongarra’s algorithm and attains 76% of the peak performance when N = L=48, M=2 L=48 M=32 76% of the peak

Performance on the Power5 (1.9GHz) (with ESSL) Matrix size Performance (GFLOPS) Wu’s algorithm is more than twice as fast as Dongarra’s agorithm and attains 63% of the peak performance when N = L=48, M=2 L=96 M=64 63% of the peak

Parallel performance on the Opteron (1.8GHz) SMP ( Wu’s algorithm ) Number of processors Performance (GFLOPS) Performance of reduction to a band matrix (tridiagonalization of the band matrix not included.) Wu’s algorithm attains 3.5 times speedup on 4 PEs when N =

Parallel performance on the Power5 (1.9GHz) SMP ( Wu’s algorithm ) Number of processors Performance (GFLOPS) L=48, L’=2 L=24, L’=4 L=24, L’=2 L=12, L’=2 Performance of reduction to a band matrix (tridiagonalization of the band matrix not included.) Wu’s algorithm attains 10 times speedup on 16 PEs when N = 7680.

5. Conclusion Summary of this study –Among the three algorithms for tridiagonalization, Wu’s algorithm is the fastest on modern microprocessors. –On average, it is two times faster than Dongarra’s algorithm. –For matrices with order larger than 3840, Wu’s algorithm attains more than 50% of the peak performance on the Alpha 21264A, Opteron and Power5 processors. –For matrices with order larger than 7680, reduction to band matrices by Wu’s algorithm attains 10 times speedup on Power5 SMP with 16 processors. –Numerical experiments show that the errors in the eigenvalues computed by Bischof or Wu’s algorithms are comparable with or only a few times larger than those for Dongarra’s algorithm.

Future work Evaluation of accuracy using various test matrices Performance valuation of the total eigensolver including the eigenvector computation part Evaluation of parallel performance of Murata’s algorithm Automatic optimization of parameters L and M