Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.

Slides:

Advertisements

Similar presentations

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Advertisements

Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Introduction to Parallel Computing

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Solving Linear Systems (Numerical Recipes, Chap 2)

The QR iteration for eigenvalues. ... The intention of the algorithm is to perform a sequence of similarity transformations on a real matrix so that the.

Systems of Linear Equations

SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

1cs542g-term Notes  In assignment 1, problem 2: smoothness = number of times differentiable.

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

3D Geometry for Computer Graphics

Lecture 19 Quadratic Shapes and Symmetric Positive Definite Matrices Shang-Hua Teng.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Ordinary least squares regression (OLS)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.

Eigenvalue Problems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. The second main.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Independent Component Analysis (ICA) A parallel approach.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

MIMO continued and Error Correction Code. 2 by 2 MIMO Now consider we have two transmitting antennas and two receiving antennas. A simple scheme called.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Chapter 5 MATRIX ALGEBRA: DETEMINANT, REVERSE, EIGENVALUES.

Scientific Computing Singular Value Decomposition SVD.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Image transformations Digital Image Processing Instructor: Dr. Cheng-Chien LiuCheng-Chien Liu Department of Earth Sciences National Cheng Kung University.

Singular Value Decomposition

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.

Introduction to Linear Algebra Mark Goldman Emily Mackevicius.

1. Systems of Linear Equations and Matrices (8 Lectures) 1.1 Introduction to Systems of Linear Equations 1.2 Gaussian Elimination 1.3 Matrices and Matrix.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Matrices and Matrix Operations. Matrices An m×n matrix A is a rectangular array of mn real numbers arranged in m horizontal rows and n vertical columns.

Instructor: Mircea Nicolescu Lecture 8 CS 485 / 685 Computer Vision.

1 Instituto Tecnológico de Aeronáutica Prof. Maurício Vicente Donadon AE-256 NUMERICAL METHODS IN APPLIED STRUCTURAL MECHANICS Lecture notes: Prof. Maurício.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Performed by:Liran Sperling Gal Braun Instructor: Evgeny Fiksman המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory.

Chapter 61 Chapter 7 Review of Matrix Methods Including: Eigen Vectors, Eigen Values, Principle Components, Singular Value Decomposition.

Computational Physics (Lecture 7) PHY4061. Eigen Value Problems.

Improvement to Hessenberg Reduction

ALGEBRAIC EIGEN VALUE PROBLEMS

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Review of Linear Algebra

Matrices and vector spaces

Bisection and Twisted SVD on GPU

ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT

Parallel Matrix Operations

Numerical Analysis Lecture 16.

Parallelization of Sparse Coding & Dictionary Learning

Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 – 14, Tuesday 8th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR)

6- General Purpose GPU Programming

Force Directed Placement: GPU Implementation

Presentation transcript:

nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is Takagi Factorization? Special form of singular value decomposition (SVD) Applicable to complex symmetric matrices. Mathematically: A = UsU T U = left singular matrix s = Non-Negative Diagonal matrix (Eigen values). U T = Transpose of V matrix Applications: Grunsky inequalities Computation of the near-best uniform polynomial Rational approximation of a high degree polynomial on a disk Complex independent component analysis problems Complex coordinate rotation method Optical potential method Nuclear magnetic resonance Diagonalization of mass matrices of Majorana fermions SVD is defined as A = UsV*. Left and right singular vectors are different matrices Computing Takagi vectors is not trivial Takagi factorization takes advantage of symmetry Less computational complexity Less storage requirements How is it different from SVD? Jacobi Method For a given pair of p and q, form a 2x2 symmetric matrix as shown: Diagonalize this matrix such that: Apply transformation to p,q rows and columns. Apply such unitary transform to all possible pairs N C 2 = N(N-1)/2 combinations. This finishes one sweep. Keep performing sweeps until the matrix converges. Convergence: sum of absolute values of off-diagonal elements. Quadratic in nature Complexity: N^3 p th colq th col p th row q th row ac bc J JTJT & cos sin cos - sin J Other Algorithms Divide and Conquer Algorithm Twisted Factorization method Both involve tridiagonalization followed by diagonalization Complexity: N^2 Problems : Not as parallelizable as Jacobi Jacobi method is more stable Example: ill-conditioned matrices Related work Fast SVD on 7900 GTK A generation before CUDA enabled GPUs (2005) Recently, SVD implemented on GPU using CUDA Implemented only for real matrices Bidiagonalization by Householder transforms Diagonalization by implicitly shifted QR algorithm Complexity: m*n^2 High storage space Matlab and serial C implementations available Parallel Jacobi Algorithm N/2 parallel unitary transforms possible. Order does not alter the final result. Parallel threads cannot share a common index This is done (N-1) times for one sweep All row updates need to finish before column updates To prevent race conditions Column entries will be part of rows of other thread’s row Computation now done on full symmetric matrix Need a mechanism to generate unique N/2 pairs for (N-1) times Cyclic chess tournament scheduling GPU Execution Copy complex Matrix A from host to GPU Kernel : Initialize auxiliary data structures – real diagonal matrix D and complex Takagi vector matrix U Kernel : Calculate threshold Convergence reached Kernel : Diagonalize and row updates Host synchronization Kernel : Column and Unitary matrix updates Host synchronization Kernel : make diagonal elements non-negative and sort Yes No N-1 times ? No Yes Copy U and D from GPU Results Conclusion Tid=0Tid=1Tid=2 Tid=3 Tid=4 Iteration: Tid=0Tid=1Tid=2 Tid=3 Tid=4 Iteration: 1 Pthreads implementation reach speedup of 1.64x over serial CPU version. NVIDIA GTX 260 reaches a speedup of 11.69x. NVIDIA GTX 480 (Fermi) reaches a speedup of 30.80x. Speedup of GPUs and Pthreads for different matrix sizes Jacobi method is the most stable and accurate parallel algorithm. Good speedups achieved over serial and Pthreads implementation. Nvidia Fermi architecture gave much better speedups. Non-coalesced reads are cached. Much higher FLOP performance.