Chun-Yuan Lin Brief of GPU&CUDA. What is GPU? Graphics Processing Units 2015/12/16 2 GPU.

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 6: GPU as part of the PC Architecture.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 5: History of Graphics HW and APIs.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
Chun-Yuan Lin Introduction to Computer Graphics 2015/11/11 1 Ch00.
1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system.
Lecture 25 PC System Architecture PCIe Interconnect
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
GPU Programming Shirley Moore CPS 5401 Fall 2013
Chun-Yuan Lin Brief of GPU&CUDA 2015/12/16. Introduce to Myself Name: Chun-Yuan Lin ( 林俊淵 ) (1977) Education: Ph.D., Dept. CS, FCU Univ. Experience: Post.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 7: GPU as part of the PC Architecture.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Xen GPU Rider. Outline Target & Vision GPU & Xen CUDA on Xen GPU Hardware Acceleration On VM - VMGL.
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Brief of GPU&CUDA Chun-Yuan Lin.
CS427 Multicore Architecture and Parallel Computing
Introduction to Computer Graphics
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
Graphics Processing Unit
CSC 2231: Parallel Computer Architecture and Programming GPUs
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Computer Graphics Graphics Hardware
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Chun-Yuan Lin Brief of GPU&CUDA

What is GPU? Graphics Processing Units 2015/12/16 2 GPU

© David Kirk/NVIDIA and Wen- mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign The Challenge Render infinitely complex scenes And extremely high resolution In 1/60 th of one second Luxo Jr took 2-3 hours per frame to render on a Cray-1 supercomputer Today we can easily render that in 1/30 th of one second Over 300,000x faster Still not even close to where we need to be… but look how far we’ve come! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU PC/DirectX Shader Model Timeline Quake 3 Giants Halo Far Cry UE3Half-Life DirectX 6 Multitexturing Riva TNT DirectX 8 SM 1.x GeForce 3Cg DirectX 9 SM 2.0 GeForceFX DirectX 9.0c SM 3.0 GeForce 6 DirectX 5 Riva 128 DirectX 7 T&L TextureStageState GeForce /12/16 4 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API GPU in every PC and workstation – massive volume and potential impact GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 Why Massively Parallel Processor 2015/12/16 5 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU GeForce 8800 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

G80 Characteristics 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

Objective To understand the major factors that dictate performance when using GPU as an compute accelerator for the CPU The feeds and speeds of the traditional CPU world The feeds and speeds when employing a GPU To form a solid knowledge base for performance programming in modern GPU’s Knowing yesterday, today, and tomorrow The PC world is becoming flatter Outsourcing of computation is becoming easier… © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered “ supercomputing applications ” Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

Stretching from Both Ends for the Meat New GPU’s cover massively parallel parts of applications better than CPU Attempts to grow current CPU architectures “out” or domain-specific architectures “in” lack success Using a strong combination on apps a compelling idea CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

Bandwidth – Gravity of Modern Computer Systems The Bandwidth between key components ultimately dictates system performance Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance goes falls back to what the “speeds and feeds” dictate © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

Classic PC architecture Northbridge connects 3 components that must be communicate at high speed CPU, DRAM, video Video also needs to have 1 st -class access to DRAM Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers Southbridge serves as a concentrator for slower I/O devices CPU Core Logic Chipset © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

PCI Bus Specification Connected to the southBridge Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate More recently 66 MHz, 64-bit, 512 MB/second peak Upstream bandwidth remain slow for device (256MB/s peak) Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device) 2015/12/16 14 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Northbridge handles “primary” PCIe to video/GPU and DRAM. PCIe x16 bandwidth at 8 GB/s (4 GB each direction)

Parallel Computing on a GPU NVIDIA GPU Computing Architecture Via a separate HW interface In laptops, desktops, workstations, servers G80 to G200 8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism Tesla C870 Tesla S870 Tesla D870 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 15 Tesla C1060 1T GFLOPS Tesla S1070

TESLA S1070 NVIDIA ® Tesla™ S1070 : 4 teraflop 1U system 。

GPU What is GPGPU ? General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting 2015/12/16 17 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

DirectX 5 / OpenGL 1.0 and Before Hardwired pipeline Inputs are DIFFUSE, FOG, TEXTURE Operations are SELECT, MUL, ADD, BLEND Blended with FOG RESULT = (1.0-FOG)*COLOR + FOG*FOGCOLOR Example Hardware RIVA 128, Voodoo 1, Reality Engine, Infinite Reality No “ops”, “stages”, programs, or recirculation © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

The 3D Graphics Pipeline Application Scene Management Geometry Rasterization Pixel Processing ROP/FBI/Display Frame Buffer Memory Host GPU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

The GeForce Graphics Pipeline Host Vertex Control Vertex Cache VS/T&L Triangle Setup Raster Shader ROP FBI Texture Cache Frame Buffer Memory Matt 20 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

Feeding the GPU GPU accepts a sequence of commands and data Vertex positions, colors, and other shader parameters Texture map images Commands like “draw triangles with the following vertices until you get a command to stop drawing triangles”. Application pushes data using Direct3D or OpenGL GPU can pull commands and data from system memory or from its local memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU CUDA “Compute Unified Device Architecture” General purpose programming model GPU = dedicated super-threaded, massively data parallel co- processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Guaranteed maximum download & readback speeds Explicit GPU memory management 2015/12/16 22 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU CUDA Programming Model: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few 2015/12/16 23 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA 2015/12/16 24 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU CUDA Device Memory Space Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories 2015/12/16 25 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU Global, Constant, and Texture Memories (Long Latency Accesses) Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Courtesy: NDVIA 2015/12/16 26 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

GPU 27 What is Behind such an Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control DRAM Cache ALU Control ALU DRAM CPUGPU 2015/12/16 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign

Resource CUDA ZONE: CUDA Course: w.html w.html