GPU Programming Shirley Moore CPS 5401 Fall 2013

Slides:

Advertisements

Similar presentations

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

Advertisements

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.

ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.

A many-core GPU architecture.. Price, performance, and evolution.

Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CS6963 Parallel Programming for Graphics Processing Units (GPUs) Lecture 1: Introduction L1: Introduction 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 6: GPU as part of the PC Architecture.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

ECSE , Spring 2014 (modified from Stanford CS 193G) Lecture 1: Introduction to Massively Parallel Computing.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Spring 2012.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Chun-Yuan Lin Introduction to Computer Graphics 2015/11/11 1 Ch00.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Final Project Kickoff.

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 1 Future mass apps reflect a concurrent world u Exciting applications.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

Chun-Yuan Lin Brief of GPU&CUDA. What is GPU? Graphics Processing Units 2015/12/16 2 GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CS/EE 217 GPU Architecture and Parallel Programming Project.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Kickoff.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Brief of GPU&CUDA Chun-Yuan Lin.

Introduction to Computer Graphics

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

Introduction to Heterogeneous Parallel Computing

Intro to GPU’s for Parallel Computing

Presentation transcript:

GPU Programming Shirley Moore svmoore@utep.edu CPS 5401 Fall 2013 http://svmoore.pbworks.com/ October 24, 2013

Learning Objectives After tonight’s class, you should be able to Describe GPU architecture Describe characteristics of applications that run well on GPUs Compare and contrast different GPU programming models Describe how to build and run a GPU program using CUDA OpenACC

Textbooks Textbook by Kirk and Hwu, Programming massively Parallel Processors, Elsevier, ISBN-13: 978-0-12-381472-2 NVIDIA, NVidia CUDA Programming Guide, NVidia (reference book) T. Mattson, et al “Patterns for Parallel Programming,” Addison Wesley, 2005 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Why Massively Parallel GPU Processors A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until recently, programmed through graphics API GPU in every PC and workstation – massive volume and potential impact © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

CPUs and GPUs have fundamentally different design philosophies Cache ALU Control DRAM DRAM GPU CPU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Architecture of a CUDA-capable GPU Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

GT200 Characteristics 1 TFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management Do not go over all of the different applications. Let them read them instead. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Stretching Traditional Architectures Traditional parallel architectures cover some super-applications DSP, GPU, network apps, Scientific The game is to grow mainstream architectures “out” or domain-specific architectures “in” CUDA is latter This leads to memory wall: Why we can’t transparently extend out the current model to the future application space: memory wall. How does the meaning of the memory wall change as we transition into architectures that target the super-application space. Lessons learned from Itanium © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Samples of Previous Projects Application Description Source Kernel % time H.264 SPEC ‘06 version, change in guess vector 34,811 194 35% LBM SPEC ‘06 version, change to single precision and print fewer reports 1,481 285 >99% RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99% RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion 1,104 281 PNS Petri Net simulation of a distributed system 322 160 SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine 952 31 TRACF Two Point Angular Correlation Function 536 98 96% FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1,365 93 16% MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction 490 33 LBM: Fluid simulation About 20 versions were tried for each app. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Speedup of Applications GeForce 8800 GTX vs. 2.2GHz Opteron 248 10 speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25 to 400 speedup if the function’s data requirements and control flow suit the GPU and the application is optimized All kernels have large numbers of simultaneously executing threads (ask me if you want to see details). Most of the 10X kernels saturate memory bandwidth © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign