Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graphics processors Norm Rubin – compiler architect –

Similar presentations


Presentation on theme: "Graphics processors Norm Rubin – compiler architect –"— Presentation transcript:

1 Graphics processors Norm Rubin – compiler architect – normanr@ati.com

2 Feb 15, 2005 2 Size of market Many millions of gpu’s shipped per month The 3d market is entertainment (games) Each new generation of gpu adds enough performance to support a new version of a game. Each time a game is released, player have to replace hardware to run the game. Game industry is larger then Hollywood.

3 Feb 15, 2005 3 Technology view Not enough okToo good performance / function gpu cpu ProprietaryCommodity architecture interfaces MutableLocked down

4 Feb 15, 2005 4 How much headroom Pixar uses 100,000 min of compute per min of image Gpu’s are real time so 100,000 = 20 doubles Most optimistic marketing version of Moore’s law – performance doubles every 6 months So there is 10 years to go.

5 Feb 15, 2005 5 Application space Problems are embarrassingly parallel Problems are big, screen 1000 x 1000, program runs per pixel, including some pixels that are behind others so 10* 1000 * 1000 calls per frame * 20-60 frames per second Run the same program over and over so Gpus are SIMD machines

6 Feb 15, 2005 6 SIMD There are many units executing in parallel –These are in lock-step, executing the same instruction on different pixels/vertices at the same time –Dynamic flow control can cause inefficiencies in such an architecture since different pixels/vertices can take different code paths –Dynamic branching is not always a performance win –For an if…then…else, need to execute both sides, turning processors on and off.

7 Feb 15, 2005 7 Application space Many values are coherent – values in neighbor pixels are close. Compute coherent variables at selected points use interpolation to find the intermediate values Today programmer specifies which variables are coherent by splitting programs in two.

8 Feb 15, 2005 8 Application space Common subproblem is texture filtering –Evaluate some array of memory around a stencil and combine –Provide a small fixed set of stencil patterns in hardware –You could think of this as slighty smart memory –Hardware support for 1-3 d arrays and several filtering functions –Exact stencil patterns and combining operations are proprietary(some look better then others)

9 Feb 15, 2005 9 Application space Little communication between processing elements Approximate spatial derivative by 2x2 difference operator Forces all machine designs to work on multiples of four pixels

10 Feb 15, 2005 10 Application space Throughput is important Use threading to cover latency The chips can support hundreds of threads, and can switch from thread to thread every cycle –No thread switch overhead –Hardware scheduler and thread system –Compiler knows about threads and splits resources over threads Caches are very different – can only cover spatial locality

11 Feb 15, 2005 11 Programming model Performance is much less then users want Min of 100,000 times less Most developers write each program at least four times –Xbox –Playstation –Ati top machine –Nvidia top machine Programs are in two parts: Vertex and Pixel shaders.

12 Feb 15, 2005 12 Programming model 2 Programs could be written in a high level language (C like) HLSL/OGL2 Or in virtual assembly language (DirectX, …) –Almost one dialect per chip –While virtual languages but physical resources. developers review virtual machine listings for performance developers ship virtual assembly language.

13 Feb 15, 2005 13 Programming model 3 At game startup – virtual assembly language is JIT compiled to real machine language – –Drastic change in resource requirements –Somewhat hard to debug –Hard to identify performance bottlenecks Even though applications could build code on the fly, developers pretest everything – they want the most performance to get the best looking image. Only approximate what they really want.

14 Feb 15, 2005 14 Programmable Pipeline Vertex Data (Model space) Fixed Function Transform and Lighting Clipping and Viewport Mapping Texture Stages Fog, Alpha, Stencil Depth Testing Geometry Stage Rasterizer Stage Vertex Shader Pixel Shader

15 Feb 15, 2005 15 Vertex Processing Flow Position Normal Texture Coordinates Etc. Per-Vertex Data View Matrix Projection Matrix Skin/Bone Matrices Light Positions Etc. Constants Temporary Registers Vertex Shader Instructions Triangle Mesh Vertex Shader Engine Position “Texture” Coordinates Color(s)

16 Feb 15, 2005 16 Vertex Shader Input: –Program specifies vertex data Position Normal Vertex color Texture coordinate(s) … –Data is sent to the graphics card and processed by the vertex shader Output –Vertex shader computes output quantities Position Vertex color: diffuse and specular Texture coordinate(s) –Sent to rasterizer via interpolators

17 Feb 15, 2005 17 Pixel Processing Flow Temporary Registers “Texture” Coordinates Color(s) Light Colors Ambient Lighting Colors Etc. Constants Pixel Shader Instructions Interpolated Values Textures Pixel Shader Engine ColorMulti-Render Target

18 Feb 15, 2005 18 Program sizes Most programs are very small 100 virtual instructions would be a large program Basic data type is a four element vector of floats Integer data types are not yet available Dynamic branching is new Small amount of nesting allowed

19 Feb 15, 2005 19 polygons Polygon Budget –Ruby : 75,000 –Optico: 50,000 –Ninja: 25,000 –Environment: 100,000 –Props: 50,000 Lighting Limits –3 Dynamic lights per shot (1 shadow casting) –Lightmaps used for set Animation Limits –35 total blend shapes –5 simultaneous blend shapes –4 weighted bones per vertex –Number of on-screen characters limited to 4 at once

20 Feb 15, 2005 20 Shader Breakdown Depth of Field Hair Skin

21 Feb 15, 2005 21 Depth Of Field

22 Feb 15, 2005 22 Depth Of Field

23 Feb 15, 2005 23 Hair Model Authoring Several layers of patches to approximate volumetric qualities of hair Ambient occlusion to approximate self-shadowing –Per vertex Why Polygons –Lower geometric complexity than line rendering –Makes depth sorting faster –Integrates well into our art pipeline

24 Feb 15, 2005 24 Shader Breakdown Glows Motion Blur Reflections

25 Feb 15, 2005 25 Glows

26 Feb 15, 2005 26 Motion Blur

27 Feb 15, 2005 27 Reflections

28 Feb 15, 2005 28 Hardware view X1900 Xbox 360 Both machines are current

29 Feb 15, 2005 29 Radeon X1800 3D Architecture 16 Pixel Shader Processors –Ultra-Threading Dispatch Processor –4 Shader Cores 8 Vertex Shader Processors 16 Texture Address Units 16 Texture Units 16 Render Back-End Units

30 Feb 15, 2005 30 X1900

31 Feb 15, 2005 31 Quad Pixel Shader Core Pixel Shader Processor Per Clock Cycle: 1 vec3 ADD + input modifier 1 scalar ADD + input modifier 1 vec3 ADD/MUL/MADD 1 scalar ADD/MUL/MADD 1 flow control instruction Texture Address Units 1 texture address instructions per unit per clock cycle TextureAddressUnit1TextureAddressUnit2 TextureAddressUnit3TextureAddressUnit4 Pixel Shader Processors

32 Feb 15, 2005 32 Vertex Engine Upgraded to support SM3.0 –Dynamic flow control –1,024 instructions (practically unlimited with flow control) –More temporary registers 8 Vertex Shader Processors –Each can handle 2 shader instructions per clock –10 billion instructions per second

33 Feb 15, 2005 33 Ring Bus Memory Controller Supports today’s fastest graphics memory devices –GDDR3, 48+ GB/sec –GDDR4, The future 512-bit Ring Bus –Simplifies layout and enables extreme memory clock scaling New Cache Design –Fully Associative for more optimal performance Improved Hyper Z –Better compression and hidden surface removal Programmable Arbitration Logic –Maximizes memory efficiency –Can be upgraded via software

34 Feb 15, 2005 34 Memory Channels - 4x Improvement in Random Access over X850 Radeon X850 4x64-bit channels 4 banks Per Dram Radeon X850 4x64-bit channels 4 banks Per Dram Radeon X1900 8x32-bit channels 8 Banks Per Dram Radeon X1900 8x32-bit channels 8 Banks Per Dram

35 Feb 15, 2005 35 Cache Design Graphics Memory Cache Graphics Memory Cache Direct Mapped Cache Direct Mapped Cache Fully Associative Cache Fully Associative Cache Fully Associative Caches –Cache lines can map to any location in external memory –Earlier designs used Direct Mapped & N-Way Associative Caches –Could only access limited blocks of external memory Texture, Color, Z & Stencil caches are all now fully associative –Reduces memory bandwidth requirements –Minimizes cache contention stalls –Optimized game performance –Gains up to 25% clock for clock in fill/bandwidth bound cases

36 Feb 15, 2005 36 Xbox 3.2GHz Custom IBM Central Processor Three CPU Cores Two Threads Per core VMX Unit Per Core 128 VMX Registers Per Thread 1MB L2 Cache (Lockable by Graphics Processor) 500MHz Custom ATI Graphics Processor Unified Shader Core 48 ALU’s for Vertex or Pixel Shader processing 16 Filtered & 16 Unfiltered Texture samples per clock 10MB eDRAM Framebuffer 512MB System RAM Unified Memory Architecture (UMA) 128-bit interface 700MHz GDDR3 RAM

37 Feb 15, 2005 37 CommandProcessor Memory Hub VertexGrouper PrimitiveAssembly ShaderInterp ShaderInterp Sequencer ShaderPipe(x16) Vertex Cache TexturePipe TexturePipe TexturePipe TexturePipe ShaderPipe(x16) ShaderPipe(x16) PipeComm 256 GB/sec Texture Cache ScanConverter Z/Alpha/Stencil Processors Z/Alpha/Stencil Processors 10MBDRAM Architecture

38 Feb 15, 2005 38 Adaptive Shader Array Unified shader architecture One processor type Dynamic load balancing Pixel and vertex processing where and when they’re needed –48 shaders 120 billion operations per second

39 Feb 15, 2005 39 ADVERTISMENT:Baby-Strollers-Guide.com ADVERTISMENT:Baby-Strollers-Guide.com

40 Feb 15, 2005 40 Some interesting problems Coherence (branch prediction?) What are the right instructions Can you do non graphics applications Programming language Threading by compiler Off line compile?

41 Feb 15, 2005 41 Implications for programming languages GPU – can convince people to use a new language if you can prove it is faster, even if it means lots of changes Desktop CPU – have to prove it can meet some other (non-performance/function) need Top of the line price for GPU going up- top of the line desktop CPU price going down, lots of change to do cool design. Less need to be backward compatible.

42 Feb 15, 2005 42 More info http://www.ati.com/developer/index.html


Download ppt "Graphics processors Norm Rubin – compiler architect –"

Similar presentations


Ads by Google