Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Lecture 13 - Eigen-analysis CVEN 302 July 1, 2002.
The QR iteration for eigenvalues. ... The intention of the algorithm is to perform a sequence of similarity transformations on a real matrix so that the.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
1cs542g-term Notes  Assignment 1 is out (questions?)
Principal Component Analysis
Data Locality CS 524 – High-Performance Computing.
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Solution of Eigenproblem of Non-Proportional Damping Systems by Lanczos Method In-Won Lee, Professor, PE In-Won Lee, Professor, PE Structural Dynamics.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Matrices CHAPTER 8.1 ~ 8.8. Ch _2 Contents  8.1 Matrix Algebra 8.1 Matrix Algebra  8.2 Systems of Linear Algebra Equations 8.2 Systems of Linear.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
* 김 만철, 정 형조, 박 선규, 이 인원 * 김 만철, 정 형조, 박 선규, 이 인원 구조동역학 및 진동제어 연구실 구조동역학 및 진동제어 연구실 한국과학기술원 토목공학과 중복 또는 근접 고유치를 갖는 비비례 감쇠 구조물의 자유진동 해석 1998 한국전산구조공학회 가을.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
*Man-Cheol Kim, Hyung-Jo Jung and In-Won Lee *Man-Cheol Kim, Hyung-Jo Jung and In-Won Lee Structural Dynamics & Vibration Control Lab. Structural Dynamics.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
MODELING MATTER AT NANOSCALES 4. Introduction to quantum treatments Eigenvectors and eigenvalues of a matrix.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
1 Instituto Tecnológico de Aeronáutica Prof. Maurício Vicente Donadon AE-256 NUMERICAL METHODS IN APPLIED STRUCTURAL MECHANICS Lecture notes: Prof. Maurício.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Computational Physics (Lecture 7) PHY4061. Eigen Value Problems.
1 Munther Abualkibash University of Bridgeport, CT.
ALGEBRAIC EIGEN VALUE PROBLEMS
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
CSE 554 Lecture 8: Alignment
Eigenvalues and Eigenvectors
Sanjoy Baruah The University of North Carolina at Chapel Hill
COMP60621 Fundamentals of Parallel and Distributed Systems
Eigenvalues and Eigenvectors
Reseeding-based Test Set Embedding with Reduced Test Sequences
Memory System Performance Chapter 3
ECE 576 POWER SYSTEM DYNAMICS AND STABILITY
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University

Outline of the talk 1. Introduction 2. High performance tridiagonalization algorithms 3. Performance comparison of two tridiagonalization algorithms 4.Optimal choice of algorithm and its parameters 5.Conclusion

Introduction Target problem –Standard eigenvalue problem Ax = λx A : n by n real symmetric matrix We assume that A is dense. Applications –Molecular orbital methods Solution of dense eigenproblems of order more than 100,000 is required to compute the electronic structure of large protein molecules. –Principal component analysis

Target machines Symmetric Multi-Processors (SMP’s) Background –PC servers with 4 to 8 processors are quite common now. –The number of CPU cores integrated into one chip is rapidly increasing. Cell processor (1+8 cores) dual-core Xeon

Flow chart of a typical eigensolver for dense symmetric matrices Q * AQ = T (Q: orthogonal) Ty i = i y i x i = Qy i Tridiagonalization Eigensolution of the tridiagonal matrix Back transformation Tridiagonal matrix T Eigenvalues of T: { i } , Eigenvectors of T: {y i } Eigenvectors of A: {x i } Real symmetric A ComputationAlgorithm Householder method QR method D & C method Bisection & IIM MR 3 algorithm Back transformation

Objective of this study Study the performance of tridiagonalization and back transformation algorithms on various SMP machines. The optimal choice of algorithm and its parameters will differ greatly depending on the computational environment and the problem size. The obtained data will be useful in designing an automatically tuned eigensolver for SMP machines.

Algorithms for tridiagonalization and back transformation Tridiagonalization by the Householder method – Reduction by Householder transformation H = I –  uu t H is a symmetric orthogonal matrix. H eliminates all but the first element of a vector. Computation at the k-th step Back transformation –Apply the Householder transformations in reverse order. k elements to be eliminated modified elements nonzero elements multiply H from left multiply H from right multiply H from left vector

Problems with the conventional Householder method Characteristics of the algorithm –Almost all the computational work are done in matrix vector multiplication and rank-2 update, both of which are level-2 BLAS. Total computational work : (4/3)N 3 Matrix-vector multiplication : (2/3)N 3 rank-2 update : (2/3)N 3 Problems –Poor data reuse due to the level-2 BLAS The effect of cache misses and memory access contentions among the processors is large. –Cannot attain high performance on modern SMP machines.

High performance tridiagonalization algorithms (I) Dongarra’s algorithm ( Dongarra et al., 1992 ) –Defer the application of the Householder transformations until several (M) transformations are ready. –Rank-2 update can be replaced by rank-2M update, enabling a more efficient use of cache memory. –Adopted by LAPACK , ScaLAPACK and other libraries. Back transformation –Can be blocked in the same manner to use the level-3 BLAS. 0 A(K*M)A(K*M) 0 A ((K–1)*M) = × – U(K*M)U(K*M) Q(K*M)tQ(K*M)t × – Q(K*M)Q(K*M) U(K*M)tU(K*M)t rank-2 M update (using level-3 BLAS)

Properties of Dongarra’s algorithm Computational work –Tridiagonalization: (4/3)n 3 –Back transformation:2n 2 m (m: number of eigenvectors) Parameters to be determined –M T (block size for tridiagonalization) –M B (block size for back transformation) Level-3 aspects –In the tridiagonalization phase, only half of the total work can be done with the level-3 BLAS. –The other half is level-2 BLAS, thereby limiting the performance on SMP machines.

High performance tridiagonalization algorithms (II) Two-step tridiagonalization (Bishof et al., 1993, 1994) –First, transform A into a band matrix B (of half bandwidth L ). This can be done almost entirely with the level-3 BLAS. –Next, transform B into a tridiagonal matrix T. The work is O(n 2 L) and is much smaller than that in the first step. Transformation into a band matrix L ABT (4/3)n 3 O(n2L)O(n2L) Murata’s algorithm elements to be eliminated modified elements nonzero elements multiply H from left multiply H from right

High performance tridiagonalization algorithms (II) Wu’s improvement (Wu et al., 1996) –Defer the application of the block Householder transformations until several (M T ) transformations are ready. –Rank-2L update can be replaced by rank-2LM T update, enabling a more efficient use of cache memory. Back transformation –Two-step back transformation is necessary. 1 st step: eigenvectors of T -> eigenvectors of B 2 nd step: eigenvectors of B -> eigenvectors of A –Both steps can be reorganized to use the level-3 BLAS. –In the 2 nd step, several (M B ) block Householder transformations can be aggregated to enhance cache utilization.

Properties of Bischof’s algorithm Computational work –Reduction to a band matrix: (4/3)n 3 –Murata’s algorithm:6n 2 L –Back transformation 1 st step:2n 2 m 2 nd step:2n 2 m Parameters to be determined –L (half bandwidth of the intermediate band matrix) –M T (block size for tridiagonalization) –M B (block size for back transformation) Level-3 aspects –All the computations that require O(n 3 ) work can be done with the level-3 BLAS.

Performance comparison of two tridiagonalization algorithms We compare the performance of Dongarra’s and Bischof’s algorithm under the following conditions: –Computational environments Xeon (2.8GHz)Opteron (1.8GHz) Fujitsu PrimePower HPC2500IBM pSeries (Power5, 1.9GHz) –Number of processors 1 to 32 (depending on the target machines) –Problem size n=1500, 3000, 6000, and All eigenvalues & eigenvectors, or eigenvalues only –Parameters in the algorithm Choose nearly optimal values for each environment and matrix size.

Performance on the Xeon (2.8GHz) processor Execution time (sec.) Number of processors n = 1500n = 3000n = 6000 Time for computing all the eigenvalues and eigenvectors Time for computing all the eigenvalues only n = 1500n = 3000n = 6000 Bischof’s algorithm is preferable when only the eigenvalues are needed. Execution time (sec.) Number of processors

Performance on the Opteron (1.8GHz) processor Execution time (sec.) Number of processors n = 1500n = 3000n = 6000 Time for computing all the eigenvalues and eigenvectors Time for computing all the eigenvalues only n = 1500n = 3000n = 6000 Bischof’s algorithm is preferable for the 4 processor case even when all the eigenvalues and eigenvectors are needed. Execution time (sec.) Number of processors

Performance on the Fujitsu PrimePower HPC2500 Execution time (sec.) Number of processors n = 6000 n = 12000n = Time for computing all the eigenvalues and eigenvectors Time for computing all the eigenvalues only Bischof’s algorithm is preferable when the number of processors is greater than 2 even for computing all the eigenvectors. n = 6000 n = 12000n = Execution time (sec.) Number of processors

Performance on the IBM pSeries (Power5, 1.9GHz) Execution time (sec.) Number of processors n = 6000 n = n = Time for computing all the eigenvalues and eigenvectors Time for computing all the eigenvalues only n = 6000n = 12000n = Bischof’s algorithm is preferable when only the eigenvalues are needed. Execution time (sec.) Number of processors

Optimal algorithm as a function of the number of processors and necessary eigenvectors Xeon (2.8GHz) Opteron (1.8GHz) Bischof Dongarra Fraction of eigenvectors needed Number of processors n = 1500n = 3000n = 6000 n = 1500n = 3000n = 6000 Fraction of eigenvectors needed Number of processors Bischof Dongarra

Optimal algorithm as a function of the number of processors and necessary eigenvectors (Cont’d) PrimePower HPC2500 pSeries Bischof Dongarra Fraction of Eigen -vectors needed Number of processors n = 6000n = 12000n = Bischof Dongarra Fraction of Eigen -vectors needed Number of processors n = 6000n = 12000n = 24000

Optimal choice of algorithm and its parameters To construct an automatically tuned eigensolver, we need to select the optimal algorithm and its parameters according to the computational environment and the problem size. To do this efficiently, we need an accurate and inexpensive performance model that predicts the performance of an algorithm given the computational environment, problem size and the parameter values. A promising way to achieve this goal seems to be the hierarchical modeling approach (Cuenca et al., 2004): –Model the execution time of BLAS primitives accurately based on the measured data. –Predict the execution time of the entire algorithm by summing up the execution times of the BLAS primitives.

Performance prediction of Bischof’s algorithm Predicted and measured execution time of reduction to a band matrix (Opteron, n=1920) The model reproduces the variation of the execution time as a function of L and M T fairly well. Prediction errors: less than 10% Prediction time: 0.5s (for all the values of L and M T ) Time ( s ) Predicted timeMeasured time L MTMT L MTMT

Automatic optimization of parameters L and M T Performance for the optimized values of (L, M T ) The values of (L , M T ) determined from the model were actually optimal for almost all cases. Parameter optimization in other phases and selection of the optimal algorithm is in principle possible using the same methodology. Performance ( MFLOPS ) Matrix size n

Conclusion We studied the performance of Dongarra’s and Bischof’s tridiagonalization algorithms on various SMP machines for various problem sizes. The results show that the optimal algorithm strongly depends on the target machine, the number of processors and the number of necessary eigenvectors. On the Opteron and HPC2500, Bischof’s algorithm is always preferable when the number of processors exceeds 2. By using the hierarchical modeling approach, it seems possible to predict the optimal algorithm and its parameters prior to execution. This will open a way to an automatically tuned eigensolver.