Download presentation
Presentation is loading. Please wait.
Published byPhilippa Lynch Modified over 9 years ago
1
EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!
2
- 1 - Announcements & Reading Material v This class reading »“Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008. v Project demos »Dec 13-16, 19 (19th is full) »Send me email with date and a few timeslots if your group does not have a slot »Almost all have signed up already
3
- 2 - Project Demos v Demo format »Each group gets 30 mins Strict deadlines because many back to back groups Don’t be late! »Plan for 20 mins of presentation (no more!), 10 mins questions Some slides are helpful, try to have all group members say something Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results Demo or real code examples are good v Report »5 pg double spaced including figures – same content as presentation »Due either when you do you demo or Dec 19 at 6pm
4
- 3 - Midterm Results Mean: 97.9StdDev: 13.9High: 128Low: 50 If you did poorly, all is not lost. This is a grad class, the project is by far most important!! Answer key on the course webpage, pick up graded exams from Daya
5
- 4 - Why GPUs? 4
6
- 5 - Efficiency of GPUs 5 High Memory Bandwidth GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 :102 GFLOPS GTX 285 :1062 GFLOPS High DP Flop Rate i7 : 51 GFLOPS GTX 285 : 88.5 GFLOPS GTX 480 : 168 GFLOPS High Flop Per Watt GTX 285 : 5.2 GFLOP/W i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ i7 : 0.36 GFLOP/$
7
- 6 - GPU Architecture 6 Shared Regs 0 1 2 3 4 5 6 7 Interconnection Network Global Memory (Device Memory) PCIe Bridge CPU Host Memory Shared Regs 0 1 2 3 4 5 6 7 Shared Regs 0 1 2 3 4 5 6 7 Shared Regs 0 1 2 3 4 5 6 7 SM 0SM 1SM 2SM 29
8
- 7 - CUDA v “Compute Unified Device Architecture” v General purpose programming model »User kicks off batches of threads on the GPU v Advantages of CUDA »Interface designed for compute - graphics free API »Orchestration of on chip cores »Explicit GPU memory management »Full support for Integer and bitwise operations 7
9
- 8 - Programming Model 8 Host Kernel 1 Kernel 2 Grid 1 Grid 2 Device Time
10
- 9 - Grid 1 GPU Scheduling 9 SM 0 Shared Regs 0 1 2 3 4 5 6 7 SM 1 Shared Regs 0 1 2 3 4 5 6 7 SM 2 Shared Regs 0 1 2 3 4 5 6 7 SM 3 Shared Regs 0 1 2 3 4 5 6 7 SM 30 Shared Regs 0 1 2 3 4 5 6 7
11
- 10 - Warp Generation Block 0 Block 1 Block 3 Shared Registers 0 1 2 4 5 3 6 7 SM0 10 Block 2 ThreadId 0 31 3263 Warp 0Warp 1
12
- 11 - Memory Hierarchy 11 Per Block Shared Memory __shared__ int SharedVar Block 0 Per-thread Register int LocalVarArray[10] Per-thread Local Memory int RegisterVar Thread 0 Grid 0 Per app Global Memory Host __global__ int GlobalVar __constant__ int ConstVar Texture TextureVar Per app Texture Memory Per app Constant Memory Device
13
- 12 - Discussion Points v Who has written CUDA, how have you optimized it, how long did it take? »Did you do tune using a better algorithm than trial and error? v Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do? »How would you do it? »What’s the input language? C, C++, Java, StreamIt? v Are GPUs a compiler writers best friend or worst enemy? v What about non-scientific codes, can they be mapped to GPUs? »How can GPUs be made more “general”?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.