Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

Similar presentations


Presentation on theme: "EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!"— Presentation transcript:

1 EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

2 - 1 - Announcements & Reading Material v This class reading »“Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008. v Project demos »Dec 13-16, 19 (19th is full) »Send me email with date and a few timeslots if your group does not have a slot »Almost all have signed up already

3 - 2 - Project Demos v Demo format »Each group gets 30 mins  Strict deadlines because many back to back groups  Don’t be late! »Plan for 20 mins of presentation (no more!), 10 mins questions  Some slides are helpful, try to have all group members say something  Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results  Demo or real code examples are good v Report »5 pg double spaced including figures – same content as presentation »Due either when you do you demo or Dec 19 at 6pm

4 - 3 - Midterm Results Mean: 97.9StdDev: 13.9High: 128Low: 50 If you did poorly, all is not lost. This is a grad class, the project is by far most important!! Answer key on the course webpage, pick up graded exams from Daya

5 - 4 - Why GPUs? 4

6 - 5 - Efficiency of GPUs 5 High Memory Bandwidth GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 :102 GFLOPS GTX 285 :1062 GFLOPS High DP Flop Rate i7 : 51 GFLOPS GTX 285 : 88.5 GFLOPS GTX 480 : 168 GFLOPS High Flop Per Watt GTX 285 : 5.2 GFLOP/W i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ i7 : 0.36 GFLOP/$

7 - 6 - GPU Architecture 6 Shared Regs 0 1 2 3 4 5 6 7 Interconnection Network Global Memory (Device Memory) PCIe Bridge CPU Host Memory Shared Regs 0 1 2 3 4 5 6 7 Shared Regs 0 1 2 3 4 5 6 7 Shared Regs 0 1 2 3 4 5 6 7 SM 0SM 1SM 2SM 29

8 - 7 - CUDA v “Compute Unified Device Architecture” v General purpose programming model »User kicks off batches of threads on the GPU v Advantages of CUDA »Interface designed for compute - graphics free API »Orchestration of on chip cores »Explicit GPU memory management »Full support for Integer and bitwise operations 7

9 - 8 - Programming Model 8 Host Kernel 1 Kernel 2 Grid 1 Grid 2 Device Time

10 - 9 - Grid 1 GPU Scheduling 9 SM 0 Shared Regs 0 1 2 3 4 5 6 7 SM 1 Shared Regs 0 1 2 3 4 5 6 7 SM 2 Shared Regs 0 1 2 3 4 5 6 7 SM 3 Shared Regs 0 1 2 3 4 5 6 7 SM 30 Shared Regs 0 1 2 3 4 5 6 7

11 - 10 - Warp Generation Block 0 Block 1 Block 3 Shared Registers 0 1 2 4 5 3 6 7 SM0 10 Block 2 ThreadId 0 31 3263 Warp 0Warp 1

12 - 11 - Memory Hierarchy 11 Per Block Shared Memory __shared__ int SharedVar Block 0 Per-thread Register int LocalVarArray[10] Per-thread Local Memory int RegisterVar Thread 0 Grid 0 Per app Global Memory Host __global__ int GlobalVar __constant__ int ConstVar Texture TextureVar Per app Texture Memory Per app Constant Memory Device

13 - 12 - Discussion Points v Who has written CUDA, how have you optimized it, how long did it take? »Did you do tune using a better algorithm than trial and error? v Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do? »How would you do it? »What’s the input language? C, C++, Java, StreamIt? v Are GPUs a compiler writers best friend or worst enemy? v What about non-scientific codes, can they be mapped to GPUs? »How can GPUs be made more “general”?


Download ppt "EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!"

Similar presentations


Ads by Google