Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graphics Hardware Kurt Akeley CS248 Lecture 14 8 November 2007

Similar presentations


Presentation on theme: "Graphics Hardware Kurt Akeley CS248 Lecture 14 8 November 2007"— Presentation transcript:

1 Graphics Hardware Kurt Akeley CS248 Lecture 14 8 November 2007 http://graphics.stanford.edu/courses/cs248-07/

2 CS248 Lecture 14Kurt Akeley, Fall 2007 Implementation = abstraction (from lecture 2) L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Prim Thread IssueFrag Thread Issue Data Assembler Application SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB Vertex assembly Primitive assembly Rasterization Fragment operations Vertex operations Application Primitive operations NVIDIA GeForce 8800OpenGL Pipeline Framebuffer Source : NVIDIA

3 CS248 Lecture 14Kurt Akeley, Fall 2007 Correspondence (by color) L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Prim Thread IssueFrag Thread Issue Data Assembler Application SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB Vertex assembly Primitive assembly Rasterization (fragment assembly) Fragment operations Vertex operations Application Primitive operations NVIDIA GeForce 8800OpenGL Pipeline Framebuffer this was missing Application- programmable parallel processor Fixed-function assembly processors Fixed-function framebuffer operations

4 CS248 Lecture 14Kurt Akeley, Fall 2007 Why does graphics hardware exist? Special-purpose hardware tends to disappear over time n Lisp machines and CAD workstations of the 80s n CISC CPUs iAPX432 (circa 1982) www.dvorak.org/blog/ Symbolics Lisp Machines (circa 1984) www.abstractscience.freeserve.co.uk/symbolics/photos/

5 CS248 Lecture 14Kurt Akeley, Fall 2007 Why does graphics hardware exist? Graphics acceleration has been around for 40 years. Why do GPUs remain? Confluence of four things: n Performance differentiation n GPUs are much faster than CPUs at 3-D rendering tasks n Work-load sufficiency n The accelerated 3-D rendering tasks make up a significant portion of the overall processing (thus Amdahl’s law doesn’t limit the resulting performance increase). n Strong market demand n Customer demand for 3-D graphics performance is strong n Driven by the games market n Ubiquity n With the help of standardized APIs/architectures (OpenGL and Direct3D) GPUs have achieved ubiquity in the PC market n Inertia now works in favor of continued graphics hardware

6 CS248 Lecture 14Kurt Akeley, Fall 2007 NVIDIA 8800 Ultra Stream processors128 Peak floating-point performance400+ GFLOPS Memory768 MB Memory bandwidth103.7 GB/sec Triangle rate (vertex rate)306 million/sec (est) Texture fill rate (fragment rate)39.2 billion/sec

7 CS248 Lecture 14Kurt Akeley, Fall 2007 NVIDIA performance trends YearProductTri rateCAGRTex rateCAGR 1998 Riva ZX 3m - 100m - 1999 Riva TNT2 9m 3.0 350m 3.5 2000 GeForce2 GTS 25m 2.8 664m 1.9 2001 GeForce3 30m 1.2 800m 1.2 2002 GeForce Ti 4600 60m 2.0 1200m 1.5 2003 GeForce FX 167m 2.8 2000m 1.7 2004 GeForce 6800 Ultra 170m 1.0 6800m 2.7 2005 GeForce 7800 GTX 215m 1.2 6800m 1.0 2006 GeForce 7900 GTX 260m 1.3 15600m 2.3 2007 GeForce 8800 Ultra 306m 1.2 39200m 2.5 1.7 1.9 Yearly Growth is well above 1.5 (Moore’s Law)

8 CS248 Lecture 14Kurt Akeley, Fall 2007 SGI performance trends (depth buffered) YearProductZTri rateCAGRZbuf rateCAGR 1984Iris 2000 1k - 100k - 1988GTX 135k 3.6 40m 4.5 1992 RealityEngine 2m 2.0 380m 1.8 1996 InfiniteReality 12m 1.6 1000m 1.3 2.2 Yearly Growth well above 1.5 (Moore’s Law)

9 CS248 Lecture 14Kurt Akeley, Fall 2007 CPU performance CAGR has been slowing Source: Hennessy and Patterson

10 CS248 Lecture 14Kurt Akeley, Fall 2007 The situation could change … CPUs are becoming much more parallel n CPU performance increase (1.2x to 1.5x per year) is low compared with the GPU increase (1.7x to 2x per year). n This could change now with CPU parallelism (many-core) The vertex pipeline architecture is getting old n Approaches such as ray tracing offer many advantages, but the vertex pipeline is poorly optimized for them n The work-load argument is somewhat circular, because the brute-force algorithms employed by GPUs inflate their own performance demands GPUs have and will continue to evolve n But a revolution is always possible

11 CS248 Lecture 14Kurt Akeley, Fall 2007 Outline The rest of this lecture is organized around the four ideas that most informed the design of modern GPUs (as enumerated by David Blythe in this lecture’s reading assignment): n Parallelism n Coherence n Latency n Programmability I’ll continue to use the NVIDIA 8800 as a specific example

12 CS248 Lecture 14Kurt Akeley, Fall 2007 Parallelism

13 CS248 Lecture 14Kurt Akeley, Fall 2007 Graphics is “embarrassingly parallel” Vertex assembly Primitive assembly Rasterization Fragment operations Display Vertex operations Application Primitive operations struct { float x,y,z,w; float r,g,b,a; } vertex; struct { vertex v0,v1,v2 } triangle; struct { short int x,y; float depth; float r,g,b,a; } fragment; struct { int depth; byte r,g,b,a; } pixel; Framebuffer Many separate tasks (the types I keep talking about) No “horizontal” dependencies, few “vertical” (in-order execution)

14 CS248 Lecture 14Kurt Akeley, Fall 2007 Data and task parallelism Data parallelism n Simultaneously doing the same thing to similar data n E.g., transforming vertexes n Some variance in “same thing” is possible Task parallelism n Simultaneously doing different things n E.g., the tasks (stages) of the vertex pipeline Vertex assembly Primitive assembly Rasterization Fragment operations Display Vertex operations Application Primitive operations Framebuffer Data Parallelism Task Parallelism

15 CS248 Lecture 14Kurt Akeley, Fall 2007 Trend from pipeline to data parallelism Command Processor Round-robin Aggregation Coord, normal Transform Lighting Clip testing Clipping state Divide by w (clipping) Viewport Prim. Assy. Backface cull Coordinate Transform 6-plane Frustum Clipping Divide by w Viewport Clark “Geometry Engine” (1983) SGI 4D/GTX (1988) SGI RealityEngine (1992)

16 CS248 Lecture 14Kurt Akeley, Fall 2007 Load balancing Easy for data parallelism Challenging for task parallelism n Static balance is difficult to achieve n But is insufficient n Mode changes affect execution time (e.g., complex lighting) n Worse, data can affect execution time (e.g., clipping) Unified architectures ease pipeline balance n Pipeline is virtual, processors assigned as required n 8800 is unified Vertex assembly Primitive assembly Rasterization Fragment operations Display Vertex operations Application Primitive operations Framebuffer

17 CS248 Lecture 14Kurt Akeley, Fall 2007 Unified pipeline architecture L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Prim Thread IssueFrag Thread Issue Data Assembler Application SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB Vertex assembly Primitive assembly Rasterization (fragment assembly) Fragment operations Vertex operations Application Primitive operations NVIDIA GeForce 8800OpenGL Pipeline Framebuffer this was missing Application- programmable parallel processor

18 CS248 Lecture 14Kurt Akeley, Fall 2007 Queueing FIFO buffering (first-in, first-out) is provided between task stages n Accommodates variation in execution time n Provides elasticity to allow unified load balancing to work FIFOs can also be unified n Share a single large memory with multiple head-tail pairs n Allocate as required Vertex assembly Primitive assembly Vertex operations Application FIFO

19 CS248 Lecture 14Kurt Akeley, Fall 2007 In-order execution Work elements must be sequence stamped Can use FIFOs as reorder buffers as well

20 CS248 Lecture 14Kurt Akeley, Fall 2007 Coherence

21 CS248 Lecture 14Kurt Akeley, Fall 2007 Two aspects of coherence Data locality n The data required for computation are “near by” Computational coherence n Similar sequences of operations are being performed

22 CS248 Lecture 14Kurt Akeley, Fall 2007 Data locality Prior to texture mapping: n Vertex pipeline was a stream processor n Each work element (vertex, primitive, fragment) carried all the state it needed n Modal state was local to the pipeline stage n Assembly stages operated on adjacent work elements n Data locality was inherent in this model Post texture mapping: n All application-programmable stages have memory access (and use them) n So the vertex pipeline is no longer a stream processor n Data locality must be fought for … Vertex assembly Primitive assembly Rasterization Fragment operations Display Vertex operations Application Primitive operations Framebuffer

23 CS248 Lecture 14Kurt Akeley, Fall 2007 Post-texture mapping data locality (simplified) Modern memory (DRAM) operates in large blocks n Memory is a 2-D array n Access is to an entire row To make efficient use of memory bandwidth all the data in a block must be used Two things can be done: n Aggregate read and write requests n Memory controller and cache n Complex part of GPU design n Organize memory contents coherently (blocking)

24 CS248 Lecture 14Kurt Akeley, Fall 2007 Texture Blocking 4x4 texels Cache Line Size Cache Size 6D Organization (s2,t2)(s1,t1)(s3,t3) s1t1s2t2s3t3base Address 4x4 blocks Source: Pat Hanrahan

25 CS248 Lecture 14Kurt Akeley, Fall 2007 Computational coherence Data parallelism is computationally coherent n Simultaneously doing the same thing to similar data n Can share a single instruction sequencer with multiple data paths: struct { float x,y,z,w; float r,g,b,a; } vertex; Instruction fetch and execute SIMD – Single Instruction Multiple Data

26 CS248 Lecture 14Kurt Akeley, Fall 2007 SIMD processing L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Prim Thread IssueFrag Thread Issue Data Assembler Application SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB NVIDIA GeForce 8800 this was missing One of eight 16-wide SIMD processors Why not use one 128-wide processor?

27 CS248 Lecture 14Kurt Akeley, Fall 2007 SIMD conditional control flow The “shader” abstraction operates on each data element independently But SIMD implementation shares a single execution unit across multiple data elements If data elements in the same SIMD unit branch differently the execution unit must follow both paths (sequentially) The solution is predication: n Both paths are executed n Data paths are enabled only during their selected path n Can be nested n Performance is obviously lost! SIMD width is a compromise: n Too wide  too much performance loss due to predication n Too narrow  inefficient hardware implementation

28 CS248 Lecture 14Kurt Akeley, Fall 2007 Latency

29 CS248 Lecture 14Kurt Akeley, Fall 2007 Again two issues Overall rendering latency n Typically measured in frames n Of concern to application programmers n Short on modern GPUs (more from Dave Oldcorn on this) n But GPUs with longer rendering latencies have been designed n Fun to talk about in a graphics architecture course Memory access latency n Typically measured in clock cycles (and reaching thousands of those) n Of direct concern to GPU architects and implementors n But useful for application programmers to understand too!

30 CS248 Lecture 14Kurt Akeley, Fall 2007 Multi-threading Another kind of processor virtualization n Unified GPUs share a single execution engine among multiple pipeline (task) stages n Equivalent to CPU multi-tasking n Multi-threading shares a single execution engine among multiple data-parallel work elements n Similar to CPU hyper-threading The 8800 Ultra multi-threading mechanism is used to support both multi-tasking and data-parallel multi-threading A thread is a data structure: struct { int pc; // program counter float reg[n]; // live register state enum ctxt; // context information … } thread; More live registers mean more memory usage

31 CS248 Lecture 14Kurt Akeley, Fall 2007 Programmability L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Prim Thread IssueFrag Thread Issue Data Assembler Application SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB NVIDIA GeForce 8800 this was missing Multi-threading

32 CS248 Lecture 14Kurt Akeley, Fall 2007 Multi-threading hides latency struct { float x,y,z,w; float r,g,b,a; } vertex; Instruction fetch and execute Memory reference (or resulting data dependency) Ready to Run Threads Blocked Threads Processor stalls if no threads are ready to run. Possible result of large thread context (too many live registers) Memory data available (dependency resolved)

33 CS248 Lecture 14Kurt Akeley, Fall 2007 Cache and thread store CPU n Uses cache to hide memory latency n Caches are huge (many MBs) GPU n Uses cache to aggregate memory requests and maximize effective bandwidth n Caches are relatively small n Uses multithreading to hide memory latency n Thread store is large Total memory usage on CPU and GPU chips is becoming similar …

34 CS248 Lecture 14Kurt Akeley, Fall 2007 Programmability

35 CS248 Lecture 14Kurt Akeley, Fall 2007 Programmability trade-offs Fixed-function: n Efficient in die area and power dissipation n Rigid in functionality n Simple Programmable: n Wasteful of die area and power n Flexible and adaptable n Able to manage complexity

36 CS248 Lecture 14Kurt Akeley, Fall 2007 Programmability is not new The Silicon Graphics VGX (1990) supported programmable vertex, primitive, and fragment operations. n These operations are complex and require flexibility and adaptability n The assembly operations are relatively simple and have few options n Texture fetch and filter are also simple and benefit from fixed- function implementation What is new is allowing application developers to write vertex, primitive, and fragment shaders Vertex assembly Primitive assembly Rasterization Fragment operations Vertex operations Application Primitive operations OpenGL Pipeline Framebuffer

37 CS248 Lecture 14Kurt Akeley, Fall 2007 Questions

38 CS248 Lecture 14Kurt Akeley, Fall 2007 Why insist on in-order processing? Even Direct3D 10 does Testability (repeatability) Invariance for multi-pass rendering (repeatability) Utility of painter’s algorithm State assignment!

39 CS248 Lecture 14Kurt Akeley, Fall 2007 Why can’t fragment shaders access the framebuffer? Equivalent to: why do other people’s block diagrams distinguish between fragment operations and framebuffer operations? Simple answer: cache consistency Vertex assembly Primitive assembly Rasterization Fragment operations Vertex operations Application Primitive operations OpenGL Pipeline Framebuffer

40 CS248 Lecture 14Kurt Akeley, Fall 2007 Why hasn’t tiled rendering caught on? It seems very attractive: n Small framebuffer (that can be on-die in some cases) n Deep framebuffer state (e.g., for transparency sorting) n High performance Problems: n May increase rendering latency n Has difficulty with multi-pass algorithms n Doesn’t match the OpenGL/Direct 3D abstraction

41 CS248 Lecture 14Kurt Akeley, Fall 2007 Summary Parallelism n Graphics is inherently highly data and task parallel n Challenges include in-order execution and load balancing Coherence n Streaming is inherently data and instruction coherent n But texture fetch breaks streaming model / data coherence n Reference aggregation and memory layout restore data coherence Latency n Modern GPU implementations have minimal rendering latency n Multithreading (not caching) hides (the large) memory latency Programmability n “Operation” stages are (and have long been) programmable n Assembly stages, texture filtering, and ROPs typically are not n Application programmability is new

42 CS248 Lecture 14Kurt Akeley, Fall 2007 Assignments Next lecture: Performance Tuning and Debugging (guest lecturer Dave Oldcorn, AMD) Reading assignment for Tuesday’s class: n Sections 2.8 (vertex arrays) and 2.9 (buffer objects) of the OpenGL 2.1 specification Short office hours today

43 CS248 Lecture 14Kurt Akeley, Fall 2007 End


Download ppt "Graphics Hardware Kurt Akeley CS248 Lecture 14 8 November 2007"

Similar presentations


Ads by Google