Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Slides:



Advertisements
Similar presentations
Acceleration of software package "R" using GPU's Sachinthaka Abeywardana.
Advertisements

Lecture 1: Introduction
Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
GPU Programming using BU Shared Computing Cluster
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Enhancing GPU for Scientific Computing Some thoughts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Accelerating MATLAB with CUDA
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
GPU Architecture and Its Application
Image Transformation 4/30/2009
Graphics Processing Unit
Multi-Layer Perceptron On A GPU
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Pipeline parallelism and Multi–GPU Programming
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
NVIDIA Fermi Architecture
Graphics Processing Unit
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Robert Liao Tracy Wang CS252 Spring 2007

Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK Performance and Issues

A Quick Note on Naming “G80” is the codename for the GPU found in the following graphics cards. NVIDIA GeForce 8 Series Graphics Cards NVIDIA Quadro FX 4600 NVIDIA Quadro FX 5600

Traditional GPUs From Intel Corporation

Traditional GPUs GPUs talk Polygons Vertex Processor From CPU Pixel Fragmenting Creation Merge Output Process Fragments Display

Traditional GPUs OpenGL and DirectX abstract this away. Vertex Processor From CPU Pixel Fragmenting Creation Merge Output Process Fragments Display

The NVIDIA G80 Architecture Reconfigurable Processor Pipeline From NVIDIA

G80 History and Specifications Project Started in Summer of Compute Cores 1.35 GHz in the GeForce 8800 Floating Point Ops Stream Processor Architecture One Computing Unit Streams into another Computing Unit

The CUDA Interface to the G80 Compute Unified Device Architecture C Interface for Performing Operations on the NVIDIA Processor Contains traditional C memory semantics with the context of a GPU

Working with CUDA Custom compiler provided to compile C code that the GPU can understand. The API functions provide a whole host of ways to interface with the GPU. CUDA Libraries are provided for common tasks. CUDA Runtime helps management of memory No DirectX or OpenGL knowledge needed!

Working with CUDA Running C on the CPU Running C on the GPU malloc free CPU Code cudaMalloc cudaFree GPU Code Pointers on one side stay on one side. This will create issues for existing applications

LAPACK Linear Algebra PACKage Implemented in Fortran 77 Interfaces with BLAS (Basic Linear Algebra Subprograms) Professor James Demmel involved in Project

CLAPACK An F2C’ed version of LAPACK. Very ugly! s_rsle(&io___8); do_lio(&c__3, &c__1, (char *)&nm, (ftnlen)sizeof(integer)); e_rsle(); if (nm < 1) { s_wsfe(&io___10); do_fio(&c__1, " NM ", (ftnlen)4); do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer)); do_fio(&c__1, (char *)&c__1, (ftnlen)sizeof(integer)); e_wsfe(); nm = 0; fatal = TRUE_; } else if (nm > 12) { s_wsfe(&io___11); do_fio(&c__1, " NM ", (ftnlen)4); do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer)); do_fio(&c__1, (char *)&c__12, (ftnlen)sizeof(integer)); e_wsfe(); nm = 0;

CUBLAS NVIDIA’s CUDA Based Implementation of BLAS Many functions are similar, but argument signatures are slightly different Adds some other functions as well cublasAlloc cublasFree CUBLAS lives in the GPU world

CLAPACK and CUBLAS Putting them together is not as easy as just linking CLAPACK to CUBLAS. Matrices and data structures must be moved into GPU memory space. CLAPACK executes on the CPU. CUBLAS executes on the GPU. CLAPACK Function CUBLAS Memory copy CPU->GPU Memory copy GPU->CPU

CLAPACK Concentration General Solve sgesv Computes solution to linear system of equations A × X = B To Solve, A is factored into three matrices, P, L, and U. P = Permutation Matrix L = Lower Triangular U = Upper Triangular Currently, our results cover the triangular factoring step

Performance Results

Performance Issues Much copying must be done from the CPU to GPU and GPU to CPU to communicate results. Why not convert all pointers into GPU pointers? Requires CLAPACK to run in GPU memory. Could be someone’s research paper…

Other Issues Floating Point Behaves Differently Section 5.2 of the CUDA Programming Guide Discusses Deviations from IEEE-754 No support for denormalized numbers Underflowed numbers are flushed to zero We noticed some results appearing as instead of 0, for example

Current State Investigating some interesting memory issues on the GPU side. Allocations Mysteriously Fail.

Conclusions To Date Small data sets are better left off on the CPU. GPU calculations may not be appropriate for scientific computing depending on needs.

Future Directions Moving all of LAPACK into GPU Resolving the copying issue Perhaps resolved by unifying the CPU and GPU? Want to give it a try? Can’t find Quadro FX 5600 on Market (MSRP $2,999) GeForce 8 Series have the G80 Processor GeForce 8500GT ($99.99) GeForce 8800GTX ($939.99)