NVIDIA Fermi Architecture Joseph Kider University of Pennsylvania CIS 565 - Fall 2011
Administrivia Project checkpoint on Monday
Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors 1 to 8. Maybe 1 if all threads are in the same warp and issued at the same time. 3
G80, GT200, and Fermi November 2006: G80 June 2008: GT200 March 2011: Fermi (GF100) Fermi design started around 2005. Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
New GPU Generation What are the technical goals for a new GPU generation?
New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How?
New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How? Advance programmability. In what ways? Given your experience with programming the G80, what features do you want?
Fermi: What’s More? More total cores (SPs) – not SMs though More registers: 32K per SM More shared memory: up to 48K per SM More Super Functional Units (SFUs)
Fermi: What’s Faster? Faster double precision – 8x over GT200 Faster atomic operations. What for? 5-20x Faster context switches Between applications – 10x Between graphics and compute, e.g., OpenGL and CUDA Double precision runs at half speed, on GT200 it was 1/8 speed Atomics are faster due to L1/L2 caches and more atomic hardware units. Faster atomics allow more CPU-like applications like ray tracing and random scatter. Application context switches are below 25 microseconds
Fermi: What’s New? L1 and L2 caches. Dual warp scheduling For compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754-2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics Previous L1/L2 was for textures only Motivation for cache is more for compute than graphics. Graphics is streaming; all data is brought on-chip each frameC++ support includes virtual functions, new/delete, and try/catch C++: virtual functions, new/delete, try/catch IEEE support includes denorms, so it doesn’t flush to zero. It is a pain to do in hardware; CPUs would use a trap and do it in software – thousands of cycles Previous GPUs used IEEE 754-198
G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
G80, GT200, and Fermi Number of cores have basically increased at the same rate as the number of transistors, following Moore’s Law. Granted, double precision support was added as well as L1 and L2 caches and more complex dispatch (dual warp scheduling). Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
GT200 and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Block Diagram GF100 16 SMs Each with 32 cores 512 total cores Each SM hosts up to 48 warps, or 1,536 threads In flight, up to 24,576 threads Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM Why 32 cores per SM instead of 8? Why not more SMs? G80 – 8 cores GT200 – 8 cores GF100 – 32 cores Enables dual-warp scheduling. 2 cycles per warp now, instead of 4.
Fermi SM Dual warp scheduling 32K registers 32 cores 16 Load/stores Why? 32K registers 32 cores Floating point and integer unit per core 16 Load/stores 4 SFUs Dual warp scheduling allows increased use of execution units by allowing a better mix of instructions G80 had 8K registers, GT200 had 16 registers SFU is for log, sin, cos, etc. Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM 16 SMs * 32 cores/SM = 512 floating point operations per cycle Why not in practice? In practice, there is memory access and branches. But also look at this block diagram, there are only 16 load/store units and 4 SFUs, so lots of calls to log() for example will reduce parallelism. Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM Each SM 64KB on-chip memory Configurable by CUDA developer 48KB shared memory / 16KB L1 cache, or 16KB L1 cache / 48 KB shared memory Configurable by CUDA developer Cache is good for unpredictable or irregular memory access; shared memory is good for predictable, regular memory access. Cache makes it easier to port CPU applications More shared memory is good for porting applications from previous GPU generations Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi Dual Warping Scheduling Double precision instructions do not support dual dispatch with any other operation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Kernels from same application execute in parallel Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf
Fermi Caches Registers spill to L1 cache instead of directly to DRAM Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi: Unified Address Space Useful for developers implementing libraries since one pointer can point to any memory type. Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? Useful for developers implementing libraries since one pointer can point to any memory type.
Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? No explicit CPU/GPU copies Direct GPU-GPU copies Direct I/O device to GPU copies Useful for developers implementing libraries since one pointer can point to any memory type.
Fermi ECC ECC Protected Register file, L1, L2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors What are the applications? Naturally occurring radiation can flip bits Graphics doesn’t need ECC, but compute does. For example: medical imaging and finical options pricing. Single-Error Correct Double---Error Detect (SECDED)
Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Tessellation 1.6 billion triangles per second for water Hair include physics simulation and rendering Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Tessellation Fixed function hardware on each SM for graphics Texture filtering Texture cache Tessellation Vertex Fetch / Attribute Setup Stream Output Viewport Transform. Why? Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Observations Becoming easier to port CPU code to the GPU In fact… Recursion, fast atomics, L1/L2 caches, faster global memory In fact…
Observations Becoming easier to port CPU code to the GPU In fact… Recursion, fast atomics, L1/L2 caches, faster global memory In fact… GPUs are starting to look like CPUs Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics