Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University

Outline of the talk 1. Introduction 2. Tridiagonalization by the Householder Method 3. BLAS-3 based tridiagonalization algorithms 4. Performance evaluation 5. Conclusion

1. Introduction The problem treated in this study –Standard eigenvalue problem Ax = λx A ： Real symmetric dense matrix Objective –To develop a parallel eigensolver that can solve huge scale eigenproblems efficiently on PCs or workstations. Efficient use of the cache memory is the key to high performance. Algorithms with high locality of data reference is necessary. Applications –Molecular orbital methods Solution of dense eigenproblems of order more than 100,000 is required to compute the electronic structure of large protein molecules. –Principal component analysis

Flow chart of the eigenvalue/eigenvector computation of a symmetric matrix Q t AQ = T (Q: orthogonal) Solve Ty i = λ i y i For y i x i = Qy i Tridiagonalization Computation of eigenvalues Computation of eigenvectors Back transformation Solve |T – λ i I|= 0 for λ i We focus on here. Computation Algorithm Householder method QR method D & C method Bisection & IIM MR 3 algorithm Inverse transformation Tridiagonal T Eigenvalues of T Eigenpairs of A: { i }, {x i } Real symmetric A Eigenvectors of T

Computational work and data locality of each part (4/3)N 3 Computational work Computation of eigenvalues O(N2)O(N2) Data locality Tridiagonalization Low (Householder method) High Algorithm with higher data locality is needed in the tridiagonalization part. Tridiagonal T Eigenvalues of T (or A) : { i } Real symmetric A

Objective of this study Evaluate the performance of the following three algorithms, which are variants of the Householder method optimized to enhance cache utilization: (1) Dongarra’s algorithm (2) Bischof’s algorithm (3) Wu’s algorithm

2. Tridiagonalization by the Householder method The basic idea – Reduction by Householder transformation H = I –  uu t H is a symmetric orthogonal matrix. H eliminates all but the first element of a vector. Computation at the k-th step k 0 0 0 0 0 0 elements to be eliminated modified elements nonzero elements multiply H from left multiply H from right multiply H from left vector

Algorithm of the Householder method [Step 1] Repeat [Step 2] ～ [Step 8] from k = 1 to N – 2. [Step 2] σ (k) = sqrt(d (k)t d (k) ) [Step 3] Compute the reflector: u (k) = (d (k) 1 – sgn(d (k) 1 )σ (k), d (k) 2, …, d (k) N – k ) [Step 4] α (k) = 2 / ∥ u (k) ∥ 2 [Step 5] p (k) =α (k) C (k) u (k) ： matrix-vector multiplication [Step 6] β (k) =α (k) u (k)t p (k) /2 [Step 7] q (k) = p (k) – β (k) u (k) [Step 8] C (k) = C (k) – u (k) q (k)t – q (k) u (k)t ： rank-2 update k 0 0 C (k) pivot columnpivot row d (k) 1 d (k)

Computational work and data locality Computational work –Total computational work ： (4/3)N 3 Matrix-vector multiplication ： (2/3)N 3 rank-2 update ： (2/3)N 3 Data locality –Data locality is low since most of the work is done using level 2 BLAS. –Cannot attain high performance on processors with hierarchical memory or SMP machines due to poor cache utilization. –Algorithms that can use higher-level BLAS are necessary.

3. BLAS-3 based tridiagonalization algorithms Dongarra’s algorithm (Dongarra et al., 1992) –Aggregate the rank-2 updates for M consecutive steps and perform them at once as a rank-2M update. –Cache utilization is improved since the rank-2 updates can be turned into a level-3 BLAS operation. –Implemented in many software packages including LAPACK and ScaLAPACK. 0 C (K*L’) d(k)d(k) 0 d (k+1) d (k +2) d (k +3) C ((K– 1)*L’) = × – U(K*M)U(K*M) Q(K*M)tQ(K*M)t × – Q(K*M)Q(K*M) U(K*M)tU(K*M)t rank-2 M update (level-3 BLAS)

Dongarra’s algorithm [Step 1] Repeat the [Step2] – [Step 4] from K = 1 to N /M. [Step 2] U ((K – 1)*M) = φ ， Q ((K – 1)*M) = φ （ 0 by 0 matrix ） [Step 3] Repeat [Step 3-1 ～ 8] from k = (K–1)*M+1 to K*M [Step 3-1] Partial Householder transformation: d (k) := d (k) – U (k – 1) (Q (k – 1)t ) k – (K – 1)*M – Q (k – 1) (U (k – 1)t ) k – (K – 1)*M [Step 3-2] σ (k) = sqrt(d (k)t d (k) ) [Step 3-3] Compute the reflector vector: u (k) = (d (k) 1 – sgn(d (k) 1 )σ (k), d (k) 2, …, d (k) N – k ) [Step 3-4] α (k) = 2 / ∥ u (k) ∥ 2 [Step 3-5] p (k) =α (k) (C (k) – U (k – 1) Q (k – 1)t – Q (k – 1) U (k – 1)t ) u (k) [Step 3-6] β (k) =α (k) u (k)t p (k) /2 [Step 3-7] q (k) = p (k) – β (k) u (k) [Step 3-8] U (k) = [U (k – 1) | u (k) ] ， Q (k) = [Q (k – 1) | q (k) ] [Step 12] rank-2M update of the matrix: C (K*M) = C ((K – 1)*M) – U (K*M) Q (K*M)t – Q (K*M) U (K*M)t level-3 BLAS

Characteristics of Dongarra’s algorithm Data locality –Half of the total work can be done using level-3 BLAS. –The other half (matrix-vector multiplication) still has to be done as level-2 BLAS. usually can attain only 10 ～ 25% of the peak performance on modern microprocessors due to poor cache utilization in the latter part. Algorithms fully based on the level-3 BLAS are desirable.

Bischof’s algorithm The basic idea (Bishof et al., 1993, 1994) –First, transform A to a band matrix B (of half bandwidth L ). –Next, transform B to a tridiagonal matrix T. Advantage of the two-step tridiagonalization –Reduction to a band matrix can be done using only level-3 BLAS. –The computational work needed to tri-diagonalize the band matrix is O(N 2 L), which is much smaller than the work for the first stage. 0 0 0 0 half bandwidth L ABT order N (4/3)N 3 O(N2L)O(N2L) Murata’s algorithm

Reduction to a band matrix Reduction by block Householder transformation –H = I – UαU t H is an orthogonal matrix. H transforms the topmost block of a block vector to a triangular matrix and clear all the other elements. Computation at the k-th step multiply H from left Block vector 0 0 0 0 0 0 elements to be eliminated modified elements nonzero elements multiply H from left multiply H from right

Reduction to a band matrix (cont’d) [Step 1] Repeat [Step 2] ～ [Step 6] from K = 1 to N /L–1 [Step 2] Compute the block Householder transformation I – U (K)  (K) U (K)t that transforms the first block of D (K) to a triangular matrix and clears all the other elements. [Step 3] P (K) = C (K) U (K) α (K) : matrix-block vector multiply [Step 4] β (K) = α (K)t U (K)t P (K) / 2 [Step 5] Q (K) = P (K) –U (K) β (K) [Step 6] C (K) = C (K) – U (K) Q (K)t – Q (K) U (K)t : rank-2L update All can be done with the level-3 BLAS

Characteristics of Bischof’s algorithm Data locality –Almost all computation can be done with the level-3 BLAS. –Performance can be maximized by optimizing the half bandwidth L according to the cache size and the problem size. Choice of the half bandwidth L –Larger value of L provides more chance for cash utilization. –However, the computational work for Murata’s algorithm (tridiagonalization of the band matrix) increases proportionally with L. –The optimal value of L is determined from the trade-off of the above two aspects.

Wu’s algorithm The idea (Wu et al., 1996) –Aggregate the rank-2L updates for M consecutive steps and perform them at once as a rank-2LM update. –Cache utilization can be further improved without increasing the computational work for Murata’s algorithm. –However, the computational work for reduction to a band matrix increases (slightly) with M. –The optimal values of L and M are determined from the trade-off between cache utilization and increase in the computational work.

4. Performance Evaluation Target problems –Tridiagonalization of Frank matrices of order from 480 to 11520. –Performance of Dongarra, Bischof and Wu’s algorithms is compared. Computational environmtnts –Xeon (2.0GHz) ・ Alpha 21264A (750MHz) –Ultra SPARC III (750MHz) –Opteron (1.8GHz) SMP 1PE ～ 4PE –Power5 (1.9GHz) SMP 1PE ～ 16PE Details of the experiments –The MFLOPS value of each algorithm was computed by dividing (4/3)N 3 by the time needed to tridiagonalize the input matrix. –L and M are determined to maximize the performance in each case. –ATLAS 3.6.0, GOTO BLAS or ESSL is used as BLAS libraries. –Parallelization on is done only within the BLAS.

Performance on the Xeon (2.0GHz) (with ATLAS 3.6.0) Matrix size Performance (MFLOPS) Wu’s algorithm attains nearly twice the performance of Dongarra’s algorithm. L=24, M=4 L=48 M=32

Performance on the Alpha 21264A (750MHz) (with ATLAS 3.6.0) Matrix size Performance (MFLOPS) Wu’s algorithm is twice as fast as Dongarra’s agorithm and attains more than 50% of the peak performance when N = 3840. L=24, M=4 L=48 M=32

Performance on the Ultra SPARC III (750MHz) (with ATLAS 3.6.0) Matrix size Performance (MFLOPS) Wu’s algorithm attains more than twice the performance of Dongarra’s algorithm. L=24, M=2 L=48 M=32

Performance on the Opteron (1.8GHz) (with GOTO BLAS) Matrix size Performance (MFLOPS) Wu’s algorithm is more than twice as fast as Dongarra’s algorithm and attains 76% of the peak performance when N = 11520. L=48, M=2 L=48 M=32 76% of the peak

Performance on the Power5 (1.9GHz) (with ESSL) Matrix size Performance (GFLOPS) Wu’s algorithm is more than twice as fast as Dongarra’s agorithm and attains 63% of the peak performance when N = 11520. L=48, M=2 L=96 M=64 63% of the peak

Parallel performance on the Opteron (1.8GHz) SMP （ Wu’s algorithm ） Number of processors Performance (GFLOPS) Performance of reduction to a band matrix (tridiagonalization of the band matrix not included.) Wu’s algorithm attains 3.5 times speedup on 4 PEs when N = 11520.

Parallel performance on the Power5 (1.9GHz) SMP （ Wu’s algorithm ） Number of processors Performance (GFLOPS) L=48, L’=2 L=24, L’=4 L=24, L’=2 L=12, L’=2 Performance of reduction to a band matrix (tridiagonalization of the band matrix not included.) Wu’s algorithm attains 10 times speedup on 16 PEs when N = 7680.

5. Conclusion Summary of this study –Among the three algorithms for tridiagonalization, Wu’s algorithm is the fastest on modern microprocessors. –On average, it is two times faster than Dongarra’s algorithm. –For matrices with order larger than 3840, Wu’s algorithm attains more than 50% of the peak performance on the Alpha 21264A, Opteron and Power5 processors. –For matrices with order larger than 7680, reduction to band matrices by Wu’s algorithm attains 10 times speedup on Power5 SMP with 16 processors. –Numerical experiments show that the errors in the eigenvalues computed by Bischof or Wu’s algorithms are comparable with or only a few times larger than those for Dongarra’s algorithm.

Future work Evaluation of accuracy using various test matrices Performance valuation of the total eigensolver including the eigenvector computation part Evaluation of parallel performance of Murata’s algorithm Automatic optimization of parameters L and M

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Similar presentations

Presentation on theme: "Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Similar presentations

Presentation on theme: "Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University."— Presentation transcript:

Similar presentations

About project

Feedback