Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Slides:

Advertisements

Similar presentations

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.

Advertisements

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved October.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Solving Linear Systems (Numerical Recipes, Chap 2)

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Data Locality CS 524 – High-Performance Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.

ClearSpeed CSX620 Overview. References ClearSpeed Technical Training Slides for ClearSpeed Accelerator 620, software version 3.0, Slide Sets 1-6, Presentor:

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

Practical PC, 7th Edition Chapter 17: Looking Under the Hood

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved th October.

Efficient FPGA Implementation of QR

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

MCC website: ©Board of Trustees University of Illinois Research Objectives: Using game consoles as a platform for molecular modeling.

4 November 2008NGS Innovation Forum '08 11 NGS Clearspeed Resources Clearspeed and other accelerator hardware on the NGS Steven Young Oxford NGS Manager.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

Scientific Computing Singular Value Decomposition SVD.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Chapter 17 Looking “Under the Hood”. 2Practical PC 5 th Edition Chapter 17 Getting Started In this Chapter, you will learn: − How does a computer work.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

A next-generation many-core processor with reliability, fault tolerance and adaptive power management features optimized for embedded.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Chapter 17 Looking “Under the Hood”

TI Information – Selective Disclosure

Backprojection Project Update January 2002

Parallel Plasma Equilibrium Reconstruction Using GPU

Ioannis E. Venetis Department of Computer Engineering and Informatics

Bisection and Twisted SVD on GPU

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

FPGAs in AWS and First Use Cases, Kees Vissers

BLAS: behind the scenes

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT

STUDY AND IMPLEMENTATION

ClearSpeed CSX620 Overview

CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

Chapter 17 Looking “Under the Hood”

Presentation transcript:

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama, Masami Takata, Kinji Kimura, Masashi Iwasaki and Yoshimasa Nakamura

1 Outline Introduction The CSX600 floating-point accelerator Optimization of the rectangular SVD algorithm for the CSX600 Performance evaluation Conclusion

2 Introduction

3 Image processing Electronic structure calculation –Filter Diagonalization Method Information retrieval –Latent Semantic Indexing Statistical computations –PCA, ICA and Least Squares Singular value decomposition of rectangular matrices = m n n n  ： m n dense U ： m n orthogonal V ： n n orthogonal  ：  n n diagonal where m >> n Applications Example

4 ClearSpeed CSX600 –1+96 processor cores –48GFLOPS (double precision) Floating-point accelerators Cell –1+8 processor cores –256GFLOPS (single precision) GRAPE-DR –512 processor cores –512GFLOPS (single precision) –256GFLOPS (double precision) Very high GFLOPS value due to a large number of cores Performance is limited due to relatively low memory bandwidth.

5 Matrix multiplication C:=C+AB –The amount of data is O(1/N) of the computational work. –By using the cache memory effectively, the effect of low memory bandwidth can be mitigated. We can exploit the potential performance of the CSX600 by reorganizing the algorithm to use matrix multiplications efficiently. Use of the Level-3 BLAS (matrix multiplication) = + CABC Amount of data ： O(N 2 ) Computational work ： O(N 3 ) For matrix-vector multiplication (y := y + Ax), both the amount of data and computational work is O(N 2 ).

6 Objective of this study Accelerate the SVD of rectangular matrices using the CSX600 processor. To exploit the potential of the CSX600, we reorganize the existing algorithm so that matrix multiplications can be used efficiently. Evaluate the performance and clarify the technical problems for further improving the performance.

7 The CSX600 floating- point accelerator

8 Architecture and performance of the CSX600 The CSX600 chip –One main processor –96 floating-point processors 64 bits 2 flops / cycle 128B register files 6KB SRAM –Operates at 250MHz –Peak performance: 48GFLOPS ClearSpeed Advance board –Two CSX600 processors –1GB DRAM –Connected to a host PC via the PCI-X bus –Peak performance ： 96GFLOPS

9 Software Development Kit –Compiler: parallel programming with the C n language –Debugger –Simulator CSXL library –Basic Linear Algebra Subprograms (BLAS) for the ClearSpeed Advance board –The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory. –Sustained performance: 50GFLOPS with the DGEMM (dense matrix-matrix multiplication) CSFFT library Software environments for the CSX600 We use this in this study.

10 Performance of the CSXL DGEMM m = k = n 6000k = m = n 6000 Performance (MFLOPS) A B ×C+= nk m kn BA×C m At least two of the three size parameters (m, n and k) must be large to obtain considerable performance. nn,m

11 Optimization of the rectangular SVD algorithm for the CSX600

12 Algorithm for rectangular SVD QR decomposition: A = QR Bidiagonalization: R = U 1 B V 1 T SVD of the bidiagonal matrix: B = U 2  V 2 T Inverse transformation : R = U ’  V T V = V 1 V 2 U ’ = U 1 U 2 Multiplication by Q A = U  V T U = QU ’ A m n Q m n R n n B n n

13 Computational work of each part 2mn 2 (8/3)n 3 O(n 2 ) O(n 3 ) 2n 3 4n 3 4mn 2 When m >> n (e.g., m =100000, n =5000) Computational work QR decomposition: A = QR Bidiagonalization: R = U 1 B V 1 T SVD of the bidiagonal matrix: B = U 2  V 2 T Inverse transformation : R = U ’  V T V = V 1 V 2 U ’ = U 1 U 2 Multiplication by Q A = U  V T U = QU ’ Accounts for most of the computational work

14 QR decomposition: A = QR Multiplication by Q A = U  V T U = QU ’ Optimization of each part Parts accelerated with the CSX600 LAPACK DGEBRD LAPACK DORMBR Integrable SVD Reorganize the algorithms to use matrix multiplications Accelerate the matrix multiplication with the CSXL BLAS Parts executed on the host only Bidiagonalization: R = U 1 B V 1 T Inverse transformation : V = V 1 V 2 U ’ = U 1 U 2 SVD of the bidiagonal matrix: B = U 2  V 2 T R = U ’  V T

15 QR decomposition of A Upper triangularization by Householder transformations A (1) A H n ・・・ H 2 H 1 A = A (n) ・・・ A (2) A (n) = R A = H 1 H 2 ・・・ H n A (n) = QR where, H 1 A = ( I – t 1 y 1 y 1 T ) A = A (1) level-2 BLAS CSXL cannot be used

16 Aggregating the Householder transformations Blocking technique H n ・・・ H 2 H 1 = ( I – t n y n y n T ) ・・・ ( I – t 2 y 2 y 2 T )( I – t 1 y 1 y 1 T ) = I – Y n T n Y n T where, Y n = [ y 1 | y 2 | ・・・ | y n ] (m n matrix) T n ： n n lower triangular matrix × × I – × × ・・・ × × I – = Multiple Householder transformations can be aggregated and carried out by matrix multiplications. Acceleration with the CSXL.

17 Block QR requires the smallest amount of work, but some of the work is done with the level-2 BLAS. The size of matrix multiplication is rather small. Recursive QR requires the largest amount of work, but all in the level-3 BLAS. The size of matrix multiplication is large. Blocking strategies for QR decomposition Comparison of three blocking strategies No.Strategy QR decompositionMultiplication by Q Level-2 workLevel-3 workSize of matrix multiplication  Non-blocked 2mn 2  1Block QR 2mnL 2mn(n  L) LL 2Recursive QR  3mn 2 n/2n 3Extended recursive QR  2mn 2 + mnLLL L ： blocking size. 1 L n/2.

18 Performance evaluation

19 Numerical experiments Xeon 3.2GHz, 8GB memory Intel Fortran -O3 + Intel Math Kernel Library ClearSpeed Advance board Computational environments SVD of an m by n matrix whose elements are random numbers in [-0.5, 0.5] m ， 1000 n 4000 Problem Performance comparison of the three QR decomposition algorithms on the ClearSpeed board Speedup effect of the whole SVD with the ClearSpeed board Evaluation of accuracy Experiments

20 Performance of three QR decomposition algorithms m = n =4000 Computational time (sec) Block QR Recursive QR Extended recursive QR

21 Performance of three QR decomposition algorithms m =10000 n =4000 Block QR Recursive QR Extended recursive QR Computational time (sec)

22 LAPACK DGESDD Our code LAPACK DGESDD Our code m = n=1000 (m:n = 10:1) Speedup of the whole SVD with the CSX600 m = n=4000 (m:n = 25:1) x 1.2 x 1.8 x 1.3 x 3.1 Computational time (sec) x 4

23 Speedup effect as a function of matrix size Speedup m n Speedup = Time with the PC only / Time with the PC + CSX600 Our code with recursive QR

24 Evaluation of accuracy ||US V T – A|| F ||U T U – I || F Orthogonality of left singular vectorResidual m：m： n ： m：m： n ：

25 Conclusion

26 Summary and future work We showed how to accelerate the rectangular SVD algorithm with the CSX600 floating-point accelerator. By modifying the algorithm to use large matrix multiplications, we obtained up to 4 times speedup over LAPACK code on the 3.2GHz Xeon. Further improve the performance by optimizing the bidiagonalization and inverse transformation parts. Performance evaluation on other accelerators such as the GRAPE-DR. Application to other matrix computations Summary Future work