© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors.

Slides:

Advertisements

Similar presentations

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

Advertisements

Lecture 1 ECE 412: Microprocessor Laboratory Lecture 1: Course Introduction.

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CS6963 Parallel Programming for Graphics Processing Units (GPUs) Lecture 1: Introduction L1: Introduction 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

(1) ECE 8823: GPU Architectures Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology NVIDIA Keplar.

Computer Graphics Graphics Hardware

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Spring 2012.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

Course Information Sarah Diesburg Operating Systems COP 4610.

Principles of Computer Science I Honors Section Note Set 1 CSE 1341 – H 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Chun-Yuan Lin Introduction to Computer Graphics 2015/11/11 1 Ch00.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Final Project Kickoff.

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 1 Future mass apps reflect a concurrent world u Exciting applications.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

GPU Programming Shirley Moore CPS 5401 Fall 2013

Chun-Yuan Lin Brief of GPU&CUDA. What is GPU? Graphics Processing Units 2015/12/16 2 GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Kickoff.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.

Brief of GPU&CUDA Chun-Yuan Lin.

CS427 Multicore Architecture and Parallel Computing

Introduction to Computer Graphics

CS/EE 217 – GPU Architecture and Parallel Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE408 / CS483 Applied Parallel Programming Lecture 1: Introduction

CS/EE 217 – GPU Architecture and Parallel Programming

Introduction to Heterogeneous Parallel Computing

ECE 8823: GPU Architectures

Mattan Erez The University of Texas at Austin

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

EE 147 – GPU Computing and Programming

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors Lecture 1: Introduction

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Course Goals Learn how to program massively parallel processors and achieve –high performance –functionality and maintainability –scalability across future generations Acquire technical knowledge required to achieve the above goals –principles and patterns of parallel programming –processor architecture features and constraints –programming API, tools and techniques

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign People Professors: Wen-mei Hwu 215 CSL, use ECE498AL1 to start your subject line Office hours: 10-11:30am Tuesdays; or after class David Kirk Chief Scientist, NVIDIA and Professor of ECE Teaching Assistants: John Stratton Kwangwei Hwang Office hours: TBA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Web Resources Web site: –Handouts and lecture slides/recordings –Documentation, software resoucres –Note: While we’ll make an effort to post announcements on the web, we can’t guarantee it, and won’t make any allowances for people who miss things in class. Web board - coming soon –Channel for electronic announcements –Forum for Q&A - the TAs and Professors read the board, and your classmates often have answers Compass - grades

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Grading This is a lab oriented course! Test: 20% Labs: 30% –Demo/knowledge: 25% –Functionality: 40% –Report: 35% Project: 50% –Design Document: 25% –Project Presentation: 25% –Demo/Final Report: 50%

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Bonus Days Each of you get five bonus days –A bonus day is a no-questions-asked one-day extension that can be used on most assignments –You can’t turn in multiple versions of a team assignment on different days; all of you must combine individual bonus days into one team bonus day. –You can use multiple bonus days on the same thing –Weekends/holidays don’t count for the number of days of extension (Friday-Monday is one day extension) Intended to cover illnesses, interview visits, just needing more time, etc.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Using Bonus Days Web page has a bonus day form. Print it out, sign, and attach to the thing you’re turning in. –Everyone who’s using a bonus day on an team assignment needs to sign the form Penalty for being late beyond bonus days is 10% of the possible points/day, again counting only weekdays (Fall break will count as weekdays) Things you can’t use bonus days on: –Final project design documents, final project presentations, final project demo

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Academic Honesty You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people who’ve already taken the course is also fine. Any reference to assignments from previous terms or web postings is unacceptable Any copying of non-trivial code is unacceptable –Non-trivial = more than a line or so –Includes reading someone else’s code and then going off to write your own.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Academic Honesty (cont.) Giving/receiving help on a test is unacceptable Penalties for academic dishonesty: –Zero on the assignment for the first occasion –Automatic failure of the course for repeat offenses

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Team Projects Work can be divided up between team members in any way that works for you However, each team member will demo the final checkpoint of each MP individually, and will get a separate demo grade –This will include questions on the entire design –Rationale: if you don’t know enough about the whole design to answer questions on it, you aren’t involved enough in the MP

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Lab Equipment Location –Distributed (your own laptop/PC) first few weeks Windows PCs running G80 emulators –Better debugging environment –Sufficient for first few weeks NVIDIA G80 boards –Trusted Illiac Prototype Server accounts –Much much faster but less debugging support

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Text/Notes 1.NVIDIA, NVidia CUDA Programming Guide, NVidia, M. Pharr (ed.), GPU Gems 2 – Programming Techniques for High Performance Graphics and General-Purpose Computation, Addison Wesley, 2005 (recomm.) 3.T. Mattson, et al “Patterns for Parallel Programming,” Addison Wesley, 2005 (recomm.) 4.Lecture notes by Prof. Hwu and Prof. Kirk will be posted at the class web site

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Tentative Schedule/Make-up Classes Regular make-up classes –Tuesdays, 5:10-6:30 during selected weeks, location 141 CSL Week 1: –Wed, 8/22 : Lecture 1: Introduction –MP-0, installation, run hello world Week 2: –Tue, 8/28 (makeup): Lecture 2 – GPU Computing and CUDA Intro –Wed, 8/29 : Lecture 3 – GPU Computing and CUDA Intro –MP-1, simple matrix multiplication and simple vector reduction Week 3: DK in IL –Tue, 9/4 (makeup) : Lecture 4 - CUDA memory model, tiling –Wed, 9/5 (makeup) : Lecture 5 – GPU History –MP-2, tiled matrix multiplication Week 4 –Mon, 9/10: Lecture 6 – CUDA Hardware –Tue, 9/11 : Lecture 7 – GPU Compute Core –Wed, 9/12, Lecture 8 – GPU Compute Core –MP-3, simple and tiled 2D convolution

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE498AL Development History 1/0611/061/072/073/07 … Kirk gives guest lecture at Hwu’s class. Blahut challenges Kirk and Hwu Kirk visits UIUC, picking UIUC over others NVIDIA announces G80, Hwu/students training at NVIDIA Kirk and Hwu in panic mode cooking up course material NVIDIA releases CUDA, UIUC lecture and lab material went on- line, G80 cards NAMD and other apps group report results and post projects

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign A quiet revolution and potential build-up –Calculation: 367 GFLOPS vs. 32 GFLOPS –Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s –Until last year, programmed through graphics API –GPU in every PC and workstation – massive volume and potential impact GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 Why Massively Parallel Processor

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU GeForce 8800

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign G80 Characteristics 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered “ supercomputing applications ” –Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products –These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… –programming model must not hinder parallel implementation –data delivery needs careful management

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Stretching Traditional Architectures Traditional parallel architectures cover some super-applications –DSP, GPU, network apps, Scientific The game is to grow mainstream architectures “ out ” or domain-specific architectures “ in ” –CUDA is latter

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Previous Projects ApplicationDescriptionSourceKernel% time H.264 SPEC ‘06 version, change in guess vector 34, % LBM SPEC ‘06 version, change to single precision and print fewer reports 1,481285>99% RC5-72 Distributed.net RC5-72 challenge client code 1,979218>99% FEM Finite element modeling, simulation of 3D graded materials 1, % RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion 1, % PNS Petri Net simulation of a distributed system >99% SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine 95231>99% TRACF Two Point Angular Correlation Function % FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1, % MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction 49033>99%

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Speedup of Applications GeForce 8800 GTX vs. 2.2GHz Opteron  speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25  to 400  speedup if the function’s data requirements and control flow suit the GPU and the application is optimized Keep in mind that the speedup also reflects how suitable the CPU is for executing the kernel