Download presentation
Presentation is loading. Please wait.
1
NVIDIA Fermi Architecture
Joseph Kider University of Pennsylvania CIS Fall 2011
2
Administrivia Project checkpoint on Monday
3
Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide
CUDA by Example Programming Massively Parallel Processors 1 to 8. Maybe 1 if all threads are in the same warp and issued at the same time. 3
4
G80, GT200, and Fermi November 2006: G80 June 2008: GT200
March 2011: Fermi (GF100) Fermi design started around 2005. Image from:
5
New GPU Generation What are the technical goals for a new GPU generation?
6
New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How?
7
New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How? Advance programmability. In what ways? Given your experience with programming the G80, what features do you want?
8
Fermi: What’s More? More total cores (SPs) – not SMs though
More registers: 32K per SM More shared memory: up to 48K per SM More Super Functional Units (SFUs)
9
Fermi: What’s Faster? Faster double precision – 8x over GT200
Faster atomic operations. What for? 5-20x Faster context switches Between applications – 10x Between graphics and compute, e.g., OpenGL and CUDA Double precision runs at half speed, on GT200 it was 1/8 speed Atomics are faster due to L1/L2 caches and more atomic hardware units. Faster atomics allow more CPU-like applications like ray tracing and random scatter. Application context switches are below 25 microseconds
10
Fermi: What’s New? L1 and L2 caches. Dual warp scheduling
For compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics Previous L1/L2 was for textures only Motivation for cache is more for compute than graphics. Graphics is streaming; all data is brought on-chip each frameC++ support includes virtual functions, new/delete, and try/catch C++: virtual functions, new/delete, try/catch IEEE support includes denorms, so it doesn’t flush to zero. It is a pain to do in hardware; CPUs would use a trap and do it in software – thousands of cycles Previous GPUs used IEEE
11
G80, GT200, and Fermi Image from:
12
G80, GT200, and Fermi Number of cores have basically increased at the same rate as the number of transistors, following Moore’s Law. Granted, double precision support was added as well as L1 and L2 caches and more complex dispatch (dual warp scheduling). Image from:
13
GT200 and Fermi Image from:
14
Fermi Block Diagram GF100 16 SMs Each with 32 cores
512 total cores Each SM hosts up to 48 warps, or 1,536 threads In flight, up to 24,576 threads Image from:
15
Fermi SM Why 32 cores per SM instead of 8? Why not more SMs?
G80 – 8 cores GT200 – 8 cores GF100 – 32 cores Enables dual-warp scheduling. 2 cycles per warp now, instead of 4.
16
Fermi SM Dual warp scheduling 32K registers 32 cores 16 Load/stores
Why? 32K registers 32 cores Floating point and integer unit per core 16 Load/stores 4 SFUs Dual warp scheduling allows increased use of execution units by allowing a better mix of instructions G80 had 8K registers, GT200 had 16 registers SFU is for log, sin, cos, etc. Image from:
17
Fermi SM 16 SMs * 32 cores/SM = 512 floating point operations per cycle Why not in practice? In practice, there is memory access and branches. But also look at this block diagram, there are only 16 load/store units and 4 SFUs, so lots of calls to log() for example will reduce parallelism. Image from:
18
Fermi SM Each SM 64KB on-chip memory Configurable by CUDA developer
48KB shared memory / 16KB L1 cache, or 16KB L1 cache / 48 KB shared memory Configurable by CUDA developer Cache is good for unpredictable or irregular memory access; shared memory is good for predictable, regular memory access. Cache makes it easier to port CPU applications More shared memory is good for porting applications from previous GPU generations Image from:
19
Fermi Dual Warping Scheduling
Double precision instructions do not support dual dispatch with any other operation Image from:
20
Kernels from same application execute in parallel
Slide from:
21
Fermi Caches Registers spill to L1 cache instead of directly to DRAM
Slide from:
22
Fermi Caches Slide from:
23
Fermi: Unified Address Space
Useful for developers implementing libraries since one pointer can point to any memory type. Image from:
24
Fermi: Unified Address Space
64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? Useful for developers implementing libraries since one pointer can point to any memory type.
25
Fermi: Unified Address Space
64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? No explicit CPU/GPU copies Direct GPU-GPU copies Direct I/O device to GPU copies Useful for developers implementing libraries since one pointer can point to any memory type.
26
Fermi ECC ECC Protected
Register file, L1, L2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors What are the applications? Naturally occurring radiation can flip bits Graphics doesn’t need ECC, but compute does. For example: medical imaging and finical options pricing. Single-Error Correct Double---Error Detect (SECDED)
27
Fermi Tessellation Image from:
28
Fermi Tessellation 1.6 billion triangles per second for water
Hair include physics simulation and rendering Image from:
29
Fermi Tessellation Fixed function hardware on each SM for graphics
Texture filtering Texture cache Tessellation Vertex Fetch / Attribute Setup Stream Output Viewport Transform. Why? Image from:
30
Observations Becoming easier to port CPU code to the GPU In fact…
Recursion, fast atomics, L1/L2 caches, faster global memory In fact…
31
Observations Becoming easier to port CPU code to the GPU In fact…
Recursion, fast atomics, L1/L2 caches, faster global memory In fact… GPUs are starting to look like CPUs Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.