GPU Programming using BU Shared Computing Cluster

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Is There a Real Difference between DSPs and GPUs?
GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links > Sino-German Workshop > Chen Tang > DLR.de Chart 1 Chen Tang.
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Introduction to the CUDA Platform
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
OpenCL Framework for Heterogeneous CPU/GPU Programming a very brief introduction to build excitement NCCS User Forum, March 20, 2012 György (George) Fekete.
1 Latest Generations of Multi Core Processors
"Distributed Computing and Grid-technologies in Science and Education " PROSPECTS OF USING GPU IN DESKTOP-GRID SYSTEMS Klimov Georgy Dubna, 2012.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
GPU Programming using BU Shared Computing Cluster
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Graphic Processing Units Presentation by John Manning.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
General Purpose computing on Graphics Processing Units
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
Our Graphics Environment
Graphics Processing Unit
Introduction to CUDA C Slide credit: Slides adapted from
Graphics Processing Unit
Computer Graphics Graphics Hardware
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

GPU Programming using BU Shared Computing Cluster Scientific Computing and Visualization Boston University

GPU – graphics processing unit GPU Programming GPU – graphics processing unit Originally designed as a graphics processor Nvidia's GeForce 256 (1999) – first GPU single-chip processor for mathematically-intensive tasks transforms of vertices and polygons lighting polygon clipping texture mapping polygon rendering GPU (The graphics processing unit) is a specialized and highly parallel microprocessor designed to offload CPU and accelerate 2D or 3D rendering. Up until 1999, the term "GPU" didn't actually exist, and NVIDIA coined the term during it's launch of the GeForce 256. The 3dfx Voodoo (1996) was considered one of the first true 3D game cards

Modern GPUs are present in GPU Programming Modern GPUs are present in Embedded systems Personal Computers Game consoles Mobile Phones Workstations GPUs can be found in a wide range of systems, from desktops and laptops to mobile phones and super computers

Traditional GPU workflow GPU Programming Traditional GPU workflow The original GPUs were modeled after the concept of a fixed graphics pipeline. This pipeline simply transforms coordinates from 3D space (specified by programmer) into 2D pixel space on the screen. Early GPUs implemented only the rendering stage in hardware, requiring the CPU to generate triangles for the GPU to operate on. As GPU technology progressed, more and more stages of the pipeline were implemented in hardware on the GPU, freeing up more CPU cycles. Vertex processing Triangles, Lines, Points Shading, Texturing Blending, Z-buffering

GPU Programming GPGPU 1999-2000 computer scientists from various fields started using GPUs to accelerate a range of scientific applications. GPU programming required the use of graphics APIs such as OpenGL and Cg. 2002 James Fung (University of Toronto) developed OpenVIDIA. NVIDIA greatly invested in GPGPU movement and offered a number of options and libraries for a seamless experience for C, C++ and Fortran programmers. The next step in the evolution of GPU hardware was the introduction of the programmable pipeline on the GPU. In 2001, NVIDIA released the GeForce 3 which gave programmers the ability to program parts of the previously non-programmable pipeline. Instead of sending all the graphics description data to the GPU and have it simply flow through the fixed pipeline, the programmer can now send this data along with vertex "programs" (called shaders) that operate on the data while in the pipeline [1]. These shader programs were small "kernels", written in assembly-like shader languages. For the first time, there was limited amount of programmability in the vertex processing stage of the

GPGPU timeline GPU Programming In November 2006 Nvidia launched CUDA, an API that allows to code algorithms for execution on Geforce GPUs using C programming language. Khronus Group defined OpenCL in 2008 supported on AMD, Nvidia and ARM platforms. In 2012 Nvidia presented and demonstrated OpenACC - a set of directives that greatly simplify parallel programming of heterogeneous systems. The GPU hardware evolution thus far has gone from an extremely specific, single core, fixed function hardware pipeline implementation just for graphics rendering, to a set of highly parallel and programmable cores for more general purpose computation.

GPU Programming CPU GPU Modern CPU's nowdays tipically have 2 to 12 cores and are highly optimized for serial processing. GPUs consist of hundreds or even thousands of smaller efficient cores designed for parallel processing CPUs consist of a few cores optimized for serial processing GPUs consist of hundreds or thousands of smaller, efficient cores designed for parallel performance

4 instructions per cycle CPU - 6 cores 2.66 x 4 x 6 = GPU Programming SCC CPU SCC GPU As an example we can look at the CPUs and GPUs that are used on our Linux cluster. Intel Xeon X5650: Clock speed: 2.66 GHz 4 instructions per cycle CPU - 6 cores 2.66 x 4 x 6 = 63.84 Gigaflops double precision NVIDIA Tesla M2070: Core clock: 1.15GHz Single instruction 448 CUDA cores 1.15 x 1 x 448 = 515 Gigaflops double precision

GPU Programming SCC CPU SCC GPU Intel Xeon X5650: Memory size: 288 GB There is a significant difference in memory size and bandwidth that should be taken into account while deciding where to perform the calculations. Intel Xeon X5650: Memory size: 288 GB Bandwidth: 32 GB/sec NVIDIA Tesla M2070: Memory size: 3GB total Bandwidth: 150 GB/sec

GPU Computing Growth 2008 2013 GPU Programming 100M 150K 1 4,000 430M CUDA-capable GPUs 150K CUDA downloads 1 Supercomputer 4,000 Academic Papers x 4.3 x 10.67 x 50 x 9.25 430M CUDA-capable GPUs 1.6M CUDA downloads 50 Supercomputers 37,000 Academic Papers If we look back only 5 years and compare some the number of CUDA-capable GPU's used or CUDA downloads, we can see a significant rise in GP using GPU. Future GPU generations will look more and more like wide-vector general purpose CPUs, and eventually both will be seamlessly combined as one.

Applications GPU Acceleration GPU Programming GPU-accelerated libraries OpenACC Directives Programming Languages Seamless linking to GPU-enabled libraries. Simple directives for easy GPU-acceleration of new and existing applications Most powerful and flexible way to design GPU accelerated applications cuFFT, cuBLAS, Thrust, NPP, IMSL, CULA, cuRAND, etc. PGI Accelerator C/C++, Fortran, Python, Java, etc.

GPU Accelerated Libraries GPU Programming GPU Accelerated Libraries powerful library of parallel algorithms and data structures; provides a flexible, high-level interface for GPU programming; For example, the thrust::sort algorithm delivers 5x to 100x faster sorting performance than STL and TBB

GPU Accelerated Libraries GPU Programming GPU Accelerated Libraries cuBLAS a GPU-accelerated version of the complete standard BLAS library; 6x to 17x faster performance than the latest MKL BLAS Complete support for all 152 standard BLAS routines Single, double, complex, and double complex data types Fortran binding

GPU Accelerated Libraries GPU Programming GPU Accelerated Libraries cuSPARSE NPP cuFFT cuRAND

OpenACC Directives CPU GPU GPU Programming Program myscience Simple compiler directives Works on multicore CPUs & many core GPUs Future integration into OpenMP Program myscience ... serial code ... !$acc compiler Directive do k = 1,n1 do i = 1,n2 ... parallel code ... enddo $acc end compiler Directive End Program myscience CPU The other way to add GPU accelerationto application is through a set of OpenACC directives GPU

GPU Programming CUDA Programming language extension to C/C++ and FORTRAN; Designed for efficient general purpose computation on GPU. __global__ void kernel(float* x, float* y, float* z, int n){ int idx= blockIdx.x * blockDim.x + threadIdx.x; if(idx < n) z[idx] = x[idx] * y[idx]; } int main(){ ... cudaMalloc(...); cudaMemcpy(...); kernel <<<num_blocks, block_size>>> (...); cudaFree(...);

MATLAB with GPU-acceleration GPU Programming MATLAB with GPU-acceleration Use GPUs with MATLAB through Parallel Computing Toolbox GPU-enabled MATLAB functions such as fft, filter, and several linear algebra operations GPU-enabled functions in toolboxes: Communications System Toolbox, Neural Network Toolbox, Phased Array Systems Toolbox and Signal Processing Toolbox CUDA kernel integration in MATLAB applications, using only a single line of MATLAB code A=rand(2^16,1); B=fft(A); A=gpuArray(rand(2^16,1)); B=fft(A);

Will Execution on a GPU Accelerate My Application? GPU Programming Will Execution on a GPU Accelerate My Application? Computationally intensive—The time spent on computation significantly exceeds the time spent on transferring data to and from GPU memory. Massively parallel—The computations can be broken down into hundreds or thousands of independent units of work. Will Execution on a GPU Accelerate My Application? A GPU can accelerate an application if it fits both of the following criteria: