© David Kirk/NVIDIA and Wen-mei W. Hwu,

Slides:

Advertisements

Similar presentations

Lecture 1 ECE 412: Microprocessor Laboratory Lecture 1: Course Introduction.

Advertisements

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

(1) ECE 8823: GPU Architectures Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology NVIDIA Keplar.

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

EEL4712 Digital Design. Instructor Dr. Greg Stitt Office Hours: TBD (Benton 323) Also, by appointment.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Welcome to CS 115! Introduction to Programming. Class URL Write this down!

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

(1) ECE 3056: Architecture, Concurrency and Energy in Computation Lecture Notes by MKP and Sudhakar Yalamanchili Sudhakar Yalamanchili (Some small modifications.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Final Project Kickoff.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, Dec 26, 2012outline.1 ITCS 4145/5145 Parallel Programming Spring 2013 Barry Wilkinson Department.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.

GPU Programming Shirley Moore CPS 5401 Fall 2013

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Kickoff.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.

Brief of GPU&CUDA Chun-Yuan Lin.

CS427 Multicore Architecture and Parallel Computing

Introduction to Computer Graphics

CS/EE 217 – GPU Architecture and Parallel Programming

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Course Information Mark Stanovich Principles of Operating Systems

Basic CUDA Programming

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

ECE408 / CS483 Applied Parallel Programming Lecture 1: Introduction

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.

EEL4930/5934 Reconfigurable Computing

CS/EE 217 – GPU Architecture and Parallel Programming

Andy Wang Operating Systems COP 4610 / CGS 5765

Dr. Barry Wilkinson © B. Wilkinson Modification date: Jan 9a, 2014

VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.

Andy Wang Operating Systems COP 4610 / CGS 5765

CGS 3763 Operating Systems Concepts Spring 2013

Andy Wang Operating Systems COP 4610 / CGS 5765

Andy Wang Operating Systems COP 4610 / CGS 5765

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

Introduction to Heterogeneous Parallel Computing

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Andy Wang Operating Systems COP 4610 / CGS 5765

ECE 8823: GPU Architectures

Mattan Erez The University of Texas at Austin

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

EE 147 – GPU Computing and Programming

Andy Wang Operating Systems COP 4610 / CGS 5765

EEL4930/5934 Reconfigurable Computing

Presentation transcript:

Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Course Goals Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability across future generations Acquire technical knowledge required to achieve the above goals principles and patterns of parallel programming processor architecture features and constraints programming API, tools and techniques © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

People (replace with your own) Professors: Wen-mei Hwu 215 CSL, w-hwu@uiuc.edu, 244-8270 use ECE498AL to start your e-mail subject line Office hours: 2-3:30pm Wednesdays; or after class David Kirk Chief Scientist, NVIDIA and Professor of ECE Teaching Assistant: ece498alTA@gmail.com John Stratton (stratton@uiuc.edu) Office hours: TBA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Web Resources Web site: http://courses.ece.uiuc.edu/ece498/al Handouts and lecture slides/recordings Textbook, documentation, software resources Note: While we’ll make an effort to post announcements on the web, we can’t guarantee it, and won’t make any allowances for people who miss things in class. Web board Channel for electronic announcements Forum for Q&A - the TAs and Professors read the board, and your classmates often have answers © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Bonus Days Each of you get five bonus days A bonus day is a no-questions-asked one-day extension that can be used on most assignments You can’t turn in multiple versions of a team assignment on different days; all of you must combine individual bonus days into one team bonus day. You can use multiple bonus days on the same thing Weekends/holidays don’t count for the number of days of extension (Friday-Monday is one day extension) Intended to cover illnesses, interview visits, just needing more time, etc. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Using Bonus Days Web page has a bonus day form. Print it out, sign, and attach to the thing you’re turning in. Everyone who’s using a bonus day on an team assignment needs to sign the form Penalty for being late beyond bonus days is 10% of the possible points/day, again counting only weekdays (Spring/Fall break counts as weekdays) Things you can’t use bonus days on: Final project design documents, final project presentations, final project demo, exam © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Academic Honesty You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people who’ve already taken the course is also fine. Any reference to assignments from previous terms or web postings is unacceptable Any copying of non-trivial code is unacceptable Non-trivial = more than a line or so Includes reading someone else’s code and then going off to write your own. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Academic Honesty (cont.) Giving/receiving help on an exam is unacceptable Penalties for academic dishonesty: Zero on the assignment for the first occasion Automatic failure of the course for repeat offenses © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Team Projects Work can be divided up between team members in any way that works for you However, each team member will demo the final checkpoint of each MP individually, and will get a separate demo grade This will include questions on the entire design Rationale: if you don’t know enough about the whole design to answer questions on it, you aren’t involved enough in the MP © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Lab Equipment Your own PCs running G80 emulators Better debugging environment Sufficient for first couple of weeks NVIDIA G80/G280 boards QP/AC x86/GPU cluster accounts Much much faster but less debugging support © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

UIUC/NCSA AC Cluster 32 nodes 4-GPU (GTX280, Tesla), 1-FPGA Opteron node at NCSA GPUs donated by NVIDIA FPGA donated by Xilinx Coulomb Summation: 1.78 TFLOPS/node 271x speedup vs. Intel QX6700 CPU core w/ SSE UIUC/NCSA QP Cluster http://www.ncsa.uiuc.edu/Projects/GPUcluster/ A partnership between NCSA and academic departments. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Text/Notes Textbook by Kirk and Hwu, Programming massively Parallel Processors, Elsevier, ISBN-13: 978-0-12-381472-2 NVIDIA, NVidia CUDA Programming Guide, NVidia (reference book) T. Mattson, et al “Patterns for Parallel Programming,” Addison Wesley, 2005 (recomm.) Lecture notes and recordings will be posted at the class web site © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Why Massively Parallel Processor A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API GPU in every PC and workstation – massive volume and potential impact © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

CPUs and GPUs have fundamentally different design philosophies Cache ALU Control DRAM DRAM GPU CPU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Architecture of a CUDA-capable GPU Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

GT200 Characteristics 1 TFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management Do not go over all of the different applications. Let them read them instead. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Stretching Traditional Architectures Traditional parallel architectures cover some super-applications DSP, GPU, network apps, Scientific The game is to grow mainstream architectures “out” or domain-specific architectures “in” CUDA is latter This leads to memory wall: Why we can’t transparently extend out the current model to the future application space: memory wall. How does the meaning of the memory wall change as we transition into architectures that target the super-application space. Lessons learned from Itanium © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Samples of Previous Projects Application Description Source Kernel % time H.264 SPEC ‘06 version, change in guess vector 34,811 194 35% LBM SPEC ‘06 version, change to single precision and print fewer reports 1,481 285 >99% RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99% RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion 1,104 281 PNS Petri Net simulation of a distributed system 322 160 SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine 952 31 TRACF Two Point Angular Correlation Function 536 98 96% FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1,365 93 16% MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction 490 33 LBM: Fluid simulation About 20 versions were tried for each app. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign

Speedup of Applications GeForce 8800 GTX vs. 2.2GHz Opteron 248 10 speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25 to 400 speedup if the function’s data requirements and control flow suit the GPU and the application is optimized “Need for Speed” Seminar Series organized by Patel and Hwu this semester. All kernels have large numbers of simultaneously executing threads (ask me if you want to see details). Most of the 10X kernels saturate memory bandwidth © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign