Graphics Processing Units

Slides:

Advertisements

Similar presentations

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.

Optimization on Kepler Zehuan Wang

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

GPU Architecture and Programming

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

The Present and Future of Parallelism on GPUs

Prof. Zhang Gang School of Computer Sci. & Tech.

Mattan Erez The University of Texas at Austin

NVIDIA Fermi Architecture

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Graphics Processing Units References: Computer Architecture 5th edition, Hennessy and Patterson, 2012 http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fe rmi_Compute_Architecture_Whitepaper.pdf http://www.realworldtech.com/page.cfm?ArticleID=RWT09300911 0932&p=1 http://www.moderngpu.com/intro/performance.html http://heather.cs.ucdavis.edu/parprocbook

CPU vs. GPU CPU: small fraction of chip used for arithmetic http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html

CPU vs GPU GPU: large fraction of chip used for arithmetic http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed -GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite

CPU vs GPU Intel Haswell AMD Radeon R9 290 Nvidia GTX 970 170 GFlops on quad-core at 3.4GHz AMD Radeon R9 290 4800 GFlops at 9.5GHz Nvidia GTX 970 5000 Gflops at 1.05GHz

GPGPU General Purpose GPU programming Massively parallel Scientific computing, brain simulations, etc In supercomputers 53 of top500.org supercomputers used NVIDIA/AMD GPUs (Nov 2014 ranking) Including 2nd and 6th places

OpenCL vs CUDA Both for GPGPU OpenCL CUDA Similar performance Open standard Supported on AMD, NVIDIA, Intel, Altera, … CUDA Proprietary (Nvidia) Losing ground to OpenCL? Similar performance

CUDA Programming on Parallel Machines, Norm Matloff, Chapter 5 http://www.nvidia.com/content/PDF/fermi_white _papers/NVIDIA_Fermi_Compute_Architecture_W hitepaper.pdf Uses a thread hierarchy Thread Block Grid

Thread Executes an instance of a kernel (program) ThreadID (within block), program counter, registers, private memory, input and output parameters Private memory for register spills, function calls, array variables Nvidia Fermi Whitepaper pg 6

Block Set of concurrently executing threads Cooperate via barrier synchronization and shared memory (fast but small) BlockID (within grid) Nvidia Fermi Whitepaper pg 6

Grid Array of thread blocks running same kernel Read and write global memory (slow – hundreds of cycles) Synchronize between dependent kernel calls Nvidia Fermi Whitepaper pg 6

Hardware Mapping GPU Streaming Multiprocessor (sm) CUDA core executes 1+ kernel (program) grids Streaming Multiprocessor (sm) executes 1+ thread blocks CUDA core executes thread

Fermi Architecture Debuted in 2010 512 CUDA cores 32 CUDA cores per SM executes 1 FP or integer instruction per cycle 32 CUDA cores per SM 16 SMs per GPU 6 64-bit memory ports PCI-Express interface to CPU GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware) fast context switch 3 billion transistors

Nvidia Fermi Whitepaper pg 7

CUDA core pipelined integer and FP units IEEE 754-2008 FP integer unit fused multiply-add integer unit boolean, shift, move, compare, ... Nvidia Fermi Whitepaper pg 8

Streaming Multiprocessor (SM) 32 CUDA cores 16 ld/st units calculate source/destination addresses Special Function Units sin, cosine, reciprocal, sqrt Nvidia Fermi Whitepaper pg 8

Warps 32 threads from a block are bundled into warps which execute the same instr/cycle this becomes the minimum size of SIMD data warps are implicitly synchronized if threads branch in different directions, they step through both using predicated instructions two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs

Maxwell Architecture 2014 16 streaming multiprocessors * 128 cores/SM

Programming CUDA C code daxpy(n,2.0,x,y); // invoke void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i]; }

Programming CUDA CUDA code __host__ int nblocks=(n+511)/512; // grid size daxpy<<<nblocks,512>>(n,2.0,x,y); // 512 threads/block __global__ void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; }

n=8192, 512 threads/block grid block0 warp0 Y[0]=A*X[0]+Y[0] ...

Moving data between host and GPU int main() { double *x, *y, a, *dx, *dy; x = (double *)malloc(sizeof(double)*n); y = (double *)malloc(sizeof(double)*n); // initialize x and y… cudaMalloc(dx, n*sizeof(double)); cudaMalloc(dy, n*sizeof(double)); cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); … daxpy<<<nblocks,512>>(n,2.0,x,y); cudaThreadSynchronize(); cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost); cudaMemFree(dx); cudaMemFree(dy); free(x); free(y); }