Download presentation
Presentation is loading. Please wait.
Published byKory Maude Sherman Modified over 9 years ago
1
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama, Masami Takata, Kinji Kimura, Masashi Iwasaki and Yoshimasa Nakamura
2
1 Outline Introduction The CSX600 floating-point accelerator Optimization of the rectangular SVD algorithm for the CSX600 Performance evaluation Conclusion
3
2 Introduction
4
3 Image processing Electronic structure calculation –Filter Diagonalization Method Information retrieval –Latent Semantic Indexing Statistical computations –PCA, ICA and Least Squares Singular value decomposition of rectangular matrices = m n n n : m n dense U : m n orthogonal V : n n orthogonal : n n diagonal where m >> n Applications Example 10 5 5000
5
4 ClearSpeed CSX600 –1+96 processor cores –48GFLOPS (double precision) Floating-point accelerators Cell –1+8 processor cores –256GFLOPS (single precision) GRAPE-DR –512 processor cores –512GFLOPS (single precision) –256GFLOPS (double precision) Very high GFLOPS value due to a large number of cores Performance is limited due to relatively low memory bandwidth.
6
5 Matrix multiplication C:=C+AB –The amount of data is O(1/N) of the computational work. –By using the cache memory effectively, the effect of low memory bandwidth can be mitigated. We can exploit the potential performance of the CSX600 by reorganizing the algorithm to use matrix multiplications efficiently. Use of the Level-3 BLAS (matrix multiplication) = + CABC Amount of data : O(N 2 ) Computational work : O(N 3 ) For matrix-vector multiplication (y := y + Ax), both the amount of data and computational work is O(N 2 ).
7
6 Objective of this study Accelerate the SVD of rectangular matrices using the CSX600 processor. To exploit the potential of the CSX600, we reorganize the existing algorithm so that matrix multiplications can be used efficiently. Evaluate the performance and clarify the technical problems for further improving the performance.
8
7 The CSX600 floating- point accelerator
9
8 Architecture and performance of the CSX600 The CSX600 chip –One main processor –96 floating-point processors 64 bits 2 flops / cycle 128B register files 6KB SRAM –Operates at 250MHz –Peak performance: 48GFLOPS ClearSpeed Advance board –Two CSX600 processors –1GB DRAM –Connected to a host PC via the PCI-X bus –Peak performance : 96GFLOPS
10
9 Software Development Kit –Compiler: parallel programming with the C n language –Debugger –Simulator CSXL library –Basic Linear Algebra Subprograms (BLAS) for the ClearSpeed Advance board –The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory. –Sustained performance: 50GFLOPS with the DGEMM (dense matrix-matrix multiplication) CSFFT library Software environments for the CSX600 We use this in this study.
11
10 Performance of the CSXL DGEMM m = k = 450 1000 n 6000k = 450 1000 m = n 6000 Performance (MFLOPS) A B ×C+= nk m kn BA×C m At least two of the three size parameters (m, n and k) must be large to obtain considerable performance. nn,m
12
11 Optimization of the rectangular SVD algorithm for the CSX600
13
12 Algorithm for rectangular SVD QR decomposition: A = QR Bidiagonalization: R = U 1 B V 1 T SVD of the bidiagonal matrix: B = U 2 V 2 T Inverse transformation : R = U ’ V T V = V 1 V 2 U ’ = U 1 U 2 Multiplication by Q A = U V T U = QU ’ A m n Q m n R n n B n n
14
13 Computational work of each part 2mn 2 (8/3)n 3 O(n 2 ) O(n 3 ) 2n 3 4n 3 4mn 2 When m >> n (e.g., m =100000, n =5000) Computational work QR decomposition: A = QR Bidiagonalization: R = U 1 B V 1 T SVD of the bidiagonal matrix: B = U 2 V 2 T Inverse transformation : R = U ’ V T V = V 1 V 2 U ’ = U 1 U 2 Multiplication by Q A = U V T U = QU ’ Accounts for most of the computational work
15
14 QR decomposition: A = QR Multiplication by Q A = U V T U = QU ’ Optimization of each part Parts accelerated with the CSX600 LAPACK DGEBRD LAPACK DORMBR Integrable SVD Reorganize the algorithms to use matrix multiplications Accelerate the matrix multiplication with the CSXL BLAS Parts executed on the host only Bidiagonalization: R = U 1 B V 1 T Inverse transformation : V = V 1 V 2 U ’ = U 1 U 2 SVD of the bidiagonal matrix: B = U 2 V 2 T R = U ’ V T
16
15 QR decomposition of A Upper triangularization by Householder transformations A (1) A H n ・・・ H 2 H 1 A = A (n) ・・・ A (2) A (n) = R A = H 1 H 2 ・・・ H n A (n) = QR where, H 1 A = ( I – t 1 y 1 y 1 T ) A = A (1) level-2 BLAS CSXL cannot be used
17
16 Aggregating the Householder transformations Blocking technique H n ・・・ H 2 H 1 = ( I – t n y n y n T ) ・・・ ( I – t 2 y 2 y 2 T )( I – t 1 y 1 y 1 T ) = I – Y n T n Y n T where, Y n = [ y 1 | y 2 | ・・・ | y n ] (m n matrix) T n : n n lower triangular matrix × × I – × × ・・・ × × I – = Multiple Householder transformations can be aggregated and carried out by matrix multiplications. Acceleration with the CSXL.
18
17 Block QR requires the smallest amount of work, but some of the work is done with the level-2 BLAS. The size of matrix multiplication is rather small. Recursive QR requires the largest amount of work, but all in the level-3 BLAS. The size of matrix multiplication is large. Blocking strategies for QR decomposition Comparison of three blocking strategies No.Strategy QR decompositionMultiplication by Q Level-2 workLevel-3 workSize of matrix multiplication Non-blocked 2mn 2 1Block QR 2mnL 2mn(n L) LL 2Recursive QR 3mn 2 n/2n 3Extended recursive QR 2mn 2 + mnLLL L : blocking size. 1 L n/2.
19
18 Performance evaluation
20
19 Numerical experiments Xeon 3.2GHz, 8GB memory Intel Fortran -O3 + Intel Math Kernel Library ClearSpeed Advance board Computational environments SVD of an m by n matrix whose elements are random numbers in [-0.5, 0.5] 10000 m 100000 , 1000 n 4000 Problem Performance comparison of the three QR decomposition algorithms on the ClearSpeed board Speedup effect of the whole SVD with the ClearSpeed board Evaluation of accuracy Experiments
21
20 Performance of three QR decomposition algorithms m =100000 n =4000 Computational time (sec) Block QR Recursive QR Extended recursive QR
22
21 Performance of three QR decomposition algorithms m =10000 n =4000 Block QR Recursive QR Extended recursive QR Computational time (sec)
23
22 LAPACK DGESDD Our code LAPACK DGESDD Our code m = 10000 n=1000 (m:n = 10:1) Speedup of the whole SVD with the CSX600 m = 100000 n=4000 (m:n = 25:1) x 1.2 x 1.8 x 1.3 x 3.1 Computational time (sec) x 4
24
23 Speedup effect as a function of matrix size Speedup m n Speedup = Time with the PC only / Time with the PC + CSX600 Our code with recursive QR
25
24 Evaluation of accuracy ||US V T – A|| F ||U T U – I || F Orthogonality of left singular vectorResidual m:m: n : 100020003000 m:m: n : 100020003000
26
25 Conclusion
27
26 Summary and future work We showed how to accelerate the rectangular SVD algorithm with the CSX600 floating-point accelerator. By modifying the algorithm to use large matrix multiplications, we obtained up to 4 times speedup over LAPACK code on the 3.2GHz Xeon. Further improve the performance by optimizing the bidiagonalization and inverse transformation parts. Performance evaluation on other accelerators such as the GRAPE-DR. Application to other matrix computations Summary Future work
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.