Acceleration of software package "R" using GPU's Sachinthaka Abeywardana.

Slides:



Advertisements
Similar presentations
Lecture 1: Introduction
Advertisements


Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
GPU Programming using BU Shared Computing Cluster
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
ATLAS, Technische Universität München The Future of ATLAS Track Reconstruction Robert Langenberg (TU München, CERN) Robert Langenberg – Gentner Day 2013.
OpenFOAM on a GPU-based Heterogeneous Cluster
Empowering visual categorization with the GPU Present by 陳群元 我是強壯 !
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
A many-core GPU architecture.. Price, performance, and evolution.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
LAPACK HTML version LAPACK = Linear Algebra PACKage ~ LINPACK + EISPACK
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
CE 311 K - Introduction to Computer Methods Daene C. McKinney
OpenSSL acceleration using Graphics Processing Units
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Lecture 8: Caffe - CPU Optimization
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Enhancing GPU for Scientific Computing Some thoughts.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Initial experience on openCL pragamming and develop GPU solver for OpenFoam Presented by: Qingfeng Xia School of MACE University of Manchester Date:
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
Martin Kruliš by Martin Kruliš (v1.0)1.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
COMPUTER GRAPHICS AND LINEAR ALGEBRA AN INTRODUCTION.
Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.
GPU Computing CIS-543 Lecture 10: CUDA Libraries
Graphics Processing Unit
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Multi-Layer Perceptron On A GPU
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Introduction to cuBLAS
CSE Social Media & Text Analytics
GPU Implementations for Finite Element Methods
Presentation transcript:

Acceleration of software package "R" using GPU's Sachinthaka Abeywardana

CSIRO. Introduction to Graphic Processing Units (GPU)

CSIRO. Introduction to GPU contd.

CSIRO. Introduction to R and BLAS R Statistical Package Graphics BLAS (Basic Linear Algebra Subprograms) Vector-Vector Addition/Multiplication etc. Vector-Matrix Addition/Multiplication etc. Matrix-Matrix Addition/Multiplication etc. LAPack (Linear Algebra Package)

What has been done in this project Aim: Replace Rblas.dll with a faster BLAS library CSIRO. R LAPackBLAS New BLAS Replace

Rblas.dll How New Rblas.dll was created CSIRO. CUBLAS library C program wrapper FORTRAN call Initialise call

CSIRO. Results for 1000 x 1000 Matrices CPU Average (s) 3.2 * A %*% B * A (3.2 A x B B) A%*%B (Matrix A x matrix B) t(A)%*%B (Transpose matrix A x Matrix B) solve(A) (Invert Matrix A) GPU Average (s) Single Precision GPU Average (s) Double Precision

CSIRO. Improvements Single Precision (%) Double Precision (%) 3.2 * A %*% B * A A%*%B t(A)%*%B solve(A)

CSIRO. Who to Blame A.Simply random? B.Me??? C.Stupid Computer? D.Memory allocation.

CSIRO. Nvidia GPU Architecture

CSIRO. Nvidia GPU Architecture contd.

CSIRO. Nvidia GPU Architecture contd.

CSIRO.

Comparison with Atlas RBlas Improvement on multiplication : A%*%B319% Improvement on inverting matrix: solve(A)281% (source: Limitations on Atlas: Latest version is for pentium 4 only

CSIRO. Limitations of this Project Specific Card Cost GeForce GTX 280 $582 (Source: Precision? RMS of e-06 for inverting a 1024 x 1024 matrix for the single precision cards. IEEE 754 deviations

CSIRO. Where can I get this from

CSIRO. Where to from now? Implementation of more Blas functions Getting rid of overhead Adjusting LAPack Double precision to Single Precision and Single to Double Conversion Parallel Extensions (CPU)

CSIRO. Thank You Luke Domanski Dadong Wang Pascal Valotton Glenn Stone Robert Dunne CMIS/ CSIRO staff

CSIRO.