Presentation is loading. Please wait.

Presentation is loading. Please wait.

第七课 GPU & GPGPU.

Similar presentations


Presentation on theme: "第七课 GPU & GPGPU."— Presentation transcript:

1 第七课 GPU & GPGPU

2 Overview Traditional Graphics Pipeline Programmable Graphics Pipeline
Vertex Shader Fragment (Pixel) Shader Brief Intro of Cg GPGPU (General Purpose GPU)

3 Generation I: 3dfx Voodoo (1996)
One of the first true 3D game cards Worked by supplementing standard 2D video card. Did not do vertex transformations: these were done in the CPU Did do texture mapping, z-buffering. Vertex Transforms Primitive Assembly Rasterization and Interpolation Raster Operations Frame Buffer CPU GPU PCI

4 Generation II: GeForce/Radeon 7500 (1998)
Main innovation: shifting the transformation and lighting calculations to the GPU Allowed multi-texturing: giving bump maps, light maps, and others.. Faster AGP bus instead of PCI Vertex Transforms Primitive Assembly Rasterization and Interpolation Raster Operations Frame Buffer GPU AGP

5 Generation III: GeForce3/Radeon 8500(2001)
For the first time, allowed limited amount of programmability in the vertex pipeline Also allowed volume texturing and multi-sampling (for antialiasing) Vertex Transforms Primitive Assembly Rasterization and Interpolation Raster Operations Frame Buffer GPU AGP Small vertex shaders

6 Generation IV: Radeon 9700/GeForce FX (2002)
This generation is the first generation of fully-programmable graphics cards Different versions have different resource limits on fragment/vertex programs Vertex Transforms Primitive Assembly Rasterization and Interpolation Raster Operations Frame Buffer AGP Programmable Vertex shader Programmable Fragment Processor

7 Traditional Graphics PipeLine
CPU GPU Graphics State Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) Application Transform & Light Assemble Primitives Rasterize Shade Vertices (3D) Video Memory (Textures) Render-to-texture A simplified graphics pipeline Note that pipe widths vary Many caches, FIFOs, and so on not shown

8 Pipeline : Transform Transform & light
Transform from “world space” to “image space” Compute per-vertex lighting

9 ModelView Transformation
Vertices mapped from object space to world space M = model transformation (scene) V = view transformation (camera) Each matrix transform is applied to each vertex in the input stream. Think of this as a kernel operator. X’ Y’ Z’ W’ X Y Z 1 M * V *

10 Color(v) = emissive + ambient + diffuse + specular
Lighting Lighting information is combined with normals and other parameters at each vertex in order to create new colors. Color(v) = emissive + ambient + diffuse + specular Each term in the right hand side is a function of the vertex color, position, normal and material properties.

11 Pipeline : Rasterizer Rasterizer
Convert geometric rep. (vertex) to image rep. (fragment) Fragment = image fragment Pixel + associated data: color, depth, stencil, etc. Interpolate per-vertex quantities across pixels

12 Pipeline: Shade Fragment processors (multiple in parallel)
Compute a color for each pixel Optionally read colors from textures (images)

13 The Modern Graphics Pipeline
CPU GPU Graphics State Vertex Processor Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragment Processor Fragments (pre-pixels) Final Pixels (Color, Depth) Application Transform & Light Assemble Primitives Rasterize Shade Vertices (3D) Video Memory (Textures) Render-to-texture Programmable vertex processor! Programmable pixel processor!

14 The Current Graphics Pipeline
CPU GPU Graphics State Xformed, Lit Vertices (2D) Geometry Processor Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) Application Vertex Processor Assemble Primitives Rasterize Fragment Processor Vertices (3D) Video Memory (Textures) Render-to-texture Programmable primitive assembly! More flexible memory access!

15 NVIDIA GeForce 6800 3D Pipeline
Vertex Triangle Setup Z-Cull Shader Instruction Dispatch Fragment L2 Tex Fragment Crossbar Composite Memory Partition Memory Partition Memory Partition Memory Partition

16 Precision 32-bit IEEE floating-point throughout pipeline Framebuffer
Textures Fragment processor Vertex processor Interpolants

17 Vertex Processor Fully programmable (SIMD / MIMD)
Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather Can change the location of current vertex Cannot read info from other vertices Can only read a small constant memory Latest GPUs: Vertex Texture Fetch Random access memory for vertices Gather (But not from the vertex stream itself)

18 Vertex processor capabilities
4-vector FP32 operations Condition codes + true data-dependent control flow Conditional branches, subroutine calls, jump table Useful for avoiding extra work, e.g.: Don’t do animation, skinning if vertex will be clipped Do displacement mapping only for vertices near silhouette Transcendental arithmetic instructions (e.g. COS) User clip-plane support Texture reads (up to 4 textures, unlimited lookups)

19 Vertex processor limitations
No arbitrary memory write No “vertex kill” Can put vertex off-screen Can make degenerate primitives Only 32-bit texture formats supported

20 Fragment Processor Fully programmable (SIMD)
Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter RAM read (texture fetch), but no RAM write Output address fixed to a specific pixel Typically more useful than vertex processor More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline)

21 Fragment processor: texture mapping
Texture reads are just another instruction Allows computed texture coordinates, nested to arbitrary depth This is a big difference w/ NVIDIA and ATI right now Allows multiple uses of a single texture unit Optional LOD control – can specify filter extent Think of it as a memory-read instruction, with optional user-controlled filtering

22 Fragment processor capabilities
Dynamic branching Conditional fragment-kill instruction Read access to window-space position Read/write access to fragment Z (but not stencil) Multiple render targets Built-in derivative instructions Partial derivatives w.r.t. screen-space x or y Useful for anti-aliasing shaders FP32, FP16, and fixed-point data

23 Fragment processor limitations
Dynamic branching less efficient than vertex proc. Especially for non-coherent branching (<~ 30x30 pixels) Can do a lot with condition codes No indexed reads from registers I.e., no indexed arrays Must use texture reads instead No arbitrary memory write

24 GPU vendor differences
Note: this slide will be dated almost instantly NVIDIA: as described in previous slides ATI hardware today (1900XT current high-end part): No vertex texture fetch (but good render-to-vertex-array) Far fewer levels of computed texture coordinates Better at fine-grained (less coherent) dynamic branching ATI Xenos (Xbox 360 chip): Unified shader model: vertex proc == pixel proc Scatter support: shaders can write arbitrary memory loc

25 Cg : C for Graphics Cg is a high-level GPU programming language
Designed by NVIDIA and Microsoft Competes with the (quite similar) GL Shading Language, a.k.a GLslang

26 Programming in assembly is painful
Cg … FRC R2.y, C11.w; ADD R3.x, C11.w, -R2.y; MOV H4.y, R2.y; ADD H4.x, -H4.y, C4.w; MUL R3.xy, R3.xyww, C11.xyww; ADD R3.xy, R3.xyww, C11.z; TEX H5, R3, TEX2, 2D; ADD R3.x, R3.x, C11.x; TEX H6, R3, TEX2, 2D; … L2weight = timeval – floor(timeval); L1weight = 1.0 – L2weight; ocoord1 = floor(timeval)/ /128.0; ocoord2 = ocoord /64.0; L1offset = f2tex2D(tex2, float2(ocoord1, 1.0/128.0)); L2offset = f2tex2D(tex2, float2(ocoord2, 1.0/128.0)); Easier to read and modify Cross-platform Combine pieces etc.

27 Some points in the design space
CPU languages C – close to the hardware; general purpose C++, Java, lisp – require memory management RenderMan – specialized for shading Real-time shading languages Stanford shading language Creative Labs shading language

28 Design strategy Start with C (and a bit of C++)
Minimizes number of decisions Gives you known mistakes instead of unknown ones Allow subsetting of the language Add features desired for GPU’s To support GPU programming model To enable high performance Tweak to make it fit together well

29 How are GPUs different from CPUs?
GPU is a stream processor Multiple programmable processing units Connected by data flows Vertex Processor Fragment Processor Assembly & Rasterization Framebuffer Operations Application Framebuffer Textures

30 How are GPUs different from CPUs?
Greater variation in basic capabilities Most processors don’t yet support branching Vertex processors don’t support texture mapping Some processors support additional data types Compiler can’t hide these differences Least-common-denominator is too restrictive Cg exposes differences via language profiles (list of capabilities and data types) Over time, profiles will converge

31 How are GPUs different from CPUs?
Optimized for 4-vector arithmetic Useful for graphics – colors, vectors, texcoords Easy way to get high performance/cost C philosophy says: expose these HW data types Cg has vector data types and operations e.g. float2, float3, float4 Makes it obvious how to get high performance Cg also has matrix data types e.g. float3x3, float3x4, float4x4

32 How are GPUs different from CPUs?
No support for pointers Arrays are first-class data types in Cg No integer data type Cg adds “bool” data type for boolean operations This change isn’t obvious except when declaring vars

33 Cg basic data types All profiles: All profiles with texture lookups:
float bool All profiles with texture lookups: sampler1D, sampler2D, sampler3D, samplerCUBE NV_fragment_program profile: half -- half-precision float fixed -- fixed point [-2,2)

34 Cg Example The following fragment program implements a (very) simple toon shader Flat 3-tone shading Highlight Base color Shadow Black silhouettes

35 Cg Example – part 1 // Get eye-space eye vector.
// In: // eye_space position = TEX7 // eye space T = (TEX4.x, TEX5.x, TEX6.x) denormalized // eye space B = (TEX4.y, TEX5.y, TEX6.y) denormalized // eye space N = (TEX4.z, TEX5.z, TEX6.z) denormalized fragout frag program main(vf30 In) { float m = 30; // power float3 hiCol = float3( 1.0, 0.1, 0.1 ); // lit color float3 lowCol = float3( 0.3, 0.0, 0.0 ); // dark color float3 specCol = float3( 1.0, 1.0, 1.0 ); // specular color // Get eye-space eye vector. float3 e = normalize( -In.TEX7.xyz ); // Get eye-space normal vector. float3 n = normalize(float3(In.TEX4.z, In.TEX5.z, In.TEX6.z));

36 Cg Example – part 2 float edgeMask = (dot(e, n) > 0.4) ? 1 : 0;
float3 lpos = float3(3,3,3); float3 l = normalize(lpos - In.TEX7.xyz); float3 h = normalize(l + e); float specMask = (pow(dot(h, n), m) > 0.5) ? 1 : 0; float hiMask = (dot(l, n) > 0.4) ? 1 : 0; float3 ocol1 = edgeMask * (lerp(lowCol, hiCol, hiMask) + (specMask *specCol)); fragout O; O.COL = float4(ocol1.x, ocol1.y, ocol1.z, 1); return O; }

37 GPGPU The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor Programmability Precision Power GPGPU: an emerging field seeking to harness GPUs for general-purpose computation

38 Motivation: Computational Power
GPUs are fast… 3.0 GHz dual-core Pentium4: 24.6 GFLOPS NVIDIA GeForceFX 7800: 165 GFLOPs 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s ATI Radeon X850 XT Platinum Edition: 37.8 GB/s GPUs are getting faster, faster CPUs: 1.4× annual growth GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth

39 Motivation: Computational Power

40 Motivation: Flexible and Precise
Modern GPUs are deeply programmable Programmable pixel, vertex, video engines Solidifying high-level language support Modern GPUs support high precision 32 bit floating point throughout the pipeline High enough for many (not all) applications

41 Motivation: The Potential of GPGPU
The power and flexibility of GPUs makes them an attractive platform for general-purpose computation Example applications range from in-game physics simulation to conventional computational science Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

42 Problems: Difficult To Use
GPUs designed for & driven by video games Programming model unusual Programming idioms tied to computer graphics Programming environment tightly constrained Underlying architectures are: Inherently parallel Rapidly evolving (even in basic feature set!) Largely secret Can’t simply “port” CPU code!

43 GPGPU Why GPU for General Purpose Computing? How Programming?


Download ppt "第七课 GPU & GPGPU."

Similar presentations


Ads by Google