Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Slides:



Advertisements
Similar presentations
The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.
Advertisements

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
MAGMA – LAPACK for HPC on Heterogeneous Architectures MAGMA – LAPACK for HPC on Heterogeneous Architectures Stan Tomov and Jack Dongarra Research Director.
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University.
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Matrix Algebra on GPU and Multicore Architectures Matrix Algebra on GPU and Multicore Architectures Stan Tomov Research Director Innovative Computing Laboratory.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
OpenFOAM on a GPU-based Heterogeneous Cluster
Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.
Determinants Bases, linear Indep., etc Gram-Schmidt Eigenvalue and Eigenvectors Misc
1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
On Anomalous Hot Spot Discovery in Graph Streams
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
High Performance Solvers for Semidefinite Programs
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
1. Systems of Linear Equations and Matrices (8 Lectures) 1.1 Introduction to Systems of Linear Equations 1.2 Gaussian Elimination 1.3 Matrices and Matrix.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
MODELING MATTER AT NANOSCALES 4. Introduction to quantum treatments Eigenvectors and eigenvalues of a matrix.
GPU Accelerated Vessel Segmentation Using Laplacian Eigenmaps Lin Cheng, Hyunsu Cho and Peter A. Yoon Trinity College.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
Scaling up R computation with high performance computing resources.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
International Supercomputing Conference 2014 Leipzig, Germany Tutorial on June 22, 2014.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
Generalized and Hybrid Fast-ICA Implementation using GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
All-Pairs Shortest Paths
Parallelization of Sparse Coding & Dictionary Learning
 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix
Dense Linear Algebra (Data Distributions)
Presentation transcript:

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)  Vibration analysis (of mechanical structures)  Image processing, compression, face recognition  Eigenvalues of graph, e.g., in Google’s page rank...  To solve it fast [ acceleration analogy – 64 mph vs speed of sound ! ] T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012. J. Dongarra, A. Haidar, T. Schulthess, R. Solca, and S. Tomov, A novel hybrid CPU- GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012.

The need for eigensolvers A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

The need for eigensolvers  Schodinger equation: Hψ = Eψ  Choose a basis set of wave functions  Two cases: —Orthonormal basis: H x = E x in general it needs a big basis set —Non-orthonormal basis: H x = E S x

Hermitian Generalized Eigenproblem Solve A x = λ B x 1)Compute the Cholesky factorization of B = LL H 2)Transform the problem to a standard eigenvalue problem à = L −1 AL −H 3)Solve Hermitian standard Eigenvalue problem à y = λy —Tridiagonalize à (50% of its flops are in Level 2 BLAS SYMV) —Solve the tridiagonal eigenproblem —Transform the eigenvectors of the tridiagonal to eigenvectors of à 4)Transform back the eigenvectors x = L −H y

Fast BLAS development Performance of MAGMA DSYMVs vs CUBLAS Keeneland system, using one node 3 NVIDIA GPUs 1.55 GHz, 5.4 GB) 2 x 6 Intel Cores 2.8 GHz, 23 GB)

Parallel SYMV on multiple GPUs  Multi-GPU algorithms were developed —1-D block-cyclic distribution —Every GPU  has a copy of x  Computes y i = α A i where A i is the local for GPU i matrix  Reuses the single GPU kernels —The final result is computed on the CPU GPU 0 GPU 1 GPU 2 GPU 0...

Parallel SYMV on multiple GPUs Performance of MAGMA DSYMV on multi M2090 GPUs Keeneland system, using one node 3 NVIDIA GPUs 1.55 GHz, 5.4 GB) 2 x 6 Intel Cores 2.8 GHz, 23 GB)

Hybrid Algorithms Two-sided factorizations (to bidiagonal, tridiagonal, and upper Hessenberg forms) for eigen- and singular-value problems  Hybridization – Trailing matrix updates (Level 3 BLAS) are done on the GPU (similar to the one-sided factorizations) – Panels (Level 2 BLAS) are hybrid – operations with memory footprint restricted to the panel are done on CPU – The time consuming matrix-vector products involving the entire trailing matrix are done on the GPU

Hybrid Two-Sided Factorizations

From fast BLAS to fast tridiagonalization  50 % of the flops are in SYMV  Memory bound, i.e. does not scale well on multicore CPUs  Use the GPU’s high memory bandwidth and optimized SYMV  8 x speedup over 12 Intel cores GHz) Keeneland system, using one node 3 NVIDIA GPUs 1.55 GHz, 5.4 GB) 2 x 6 Intel Cores 2.8 GHz, 23 GB) Performance of MAGMA DSYTRD on multi M2090 GPUs

Can we accelerate 4 x more ? A two-stages approach  Increases the computational intensity by introducing —1 st stage: reduce the matrix to band [ Level 3 BLAS; implemented very efficiently on GPU using “look-ahead” ] —2 nd stage: reduce the band to tridiagonal [ memory bound, but we developed a very efficient “bulge” chasing algorithm with memory aware tasks for multicore to increase the computational intensity ]

Schematic profiling of the eigensolver

An additional 4 x speedup ! Keeneland system, using one node 3 NVIDIA GPUs 1.55 GHz, 5.4 GB) 2 x 6 Intel Cores 2.8 GHz, 23 GB)  12 x speedup over 12 Intel cores GHz)

Conclusions  Breakthrough eigensolver using GPUs  Number of fundamental numerical algorithms for GPUs (BLAS and LAPACK type)  Released in MAGMA 1.2  Enormous impact in technical computing and applications  12 x speedup w/ a Fermi GPU vs state-of-the-art multicore system (12 Intel Core GHz) —From a speed of car to the speed of sound !

Colloborators / Support  MAGMA [Matrix Algebra on GPU and Multicore Architectures] team  PLASMA [Parallel Linear Algebra for Scalable Multicore Architectures] team  Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France KAUST, Saudi Arabia