Workload Characterization of 3D Games

Slides:

Advertisements

Similar presentations

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Advertisements

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

Graphics Pipeline.

Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.

RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.

1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

Status – Week 231 Victor Moya. Summary Primitive Assembly Primitive Assembly Clipping triangle rejection. Clipping triangle rejection. Rasterization.

IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.

Status – Week 277 Victor Moya.

GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.

1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

Status – Week 208 Victor Moya. Summary Traces. Traces. Planification. Planification.

Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.

The Graphics Pipeline CS2150 Anthony Jones. Introduction What is this lecture about? – The graphics pipeline as a whole – With examples from the video.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.

1 Attila Research Group Computer Architecture Department Univ Politècnica de Catalunya (UPC)

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

High Performance in Broad Reach Games Chas. Boyd

© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.

REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &

Graphics Hardware and Graphics in Video Games COMP136: Introduction to Computer Graphics.

Kenneth Hurley Sr. Software Engineer

Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.

1 ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca,

Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.

The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,

1 Attila Research Group attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC)

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.

CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.

OpenGL Architecture Display List Polynomial Evaluator Per Vertex Operations & Primitive Assembly Rasterization Per Fragment Operations Frame Buffer Texture.

Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.

Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

Ray Tracing using Programmable Graphics Hardware

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Week 2 - Friday CS361.

The Graphic PipeLine

A Crash Course on Programmable Graphics Hardware

Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,

Graphics Processing Unit

Introduction to OpenGL

Chapter 6 GPU, Shaders, and Shading Languages

From Turing Machine to Global Illumination

Understanding Theory and application of 3D

Graphics Hardware CMSC 491/691.

Graphics Processing Unit

RADEON™ 9700 Architecture and 3D Performance

CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders

Introduction to OpenGL

Balancing the Graphics Pipeline for Optimal Performance

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Workload Characterization of 3D Games Jordi Roca, Victor Moya, Carlos González, Chema Solis, Agustín Fernandez and Roger Espasa (Intel DEG Barcelona) Computer Architecture Department

Outline Introduction Game selection & stats gathering Game analysis System → GPU traffic Primitive culling efficiency Rasterization pipeline Fragment shading & texturing Memory usage Conclusions

Introduction Games and GPU evolve fast GPUs cater for game demands: Better effects (flexible programming models) Higher fill-rate (more processing power) Higher quality (HDR, MSAA, AF) Games highly tuned to released GPUs New characterization needed for every Game and GPU generation.

Outline Introduction Game selection & stats gathering Game analysis System → GPU traffic Primitive culling efficiency Rasterization pipeline Fragment shading & texturing Memory usage Conclusions

Game workload selection Game/Timedemo Frames Duration at 30 fps Texture Quality Aniso Level Shaders Graphics API Engine Release Date UT2004/Primeval 1992 1’ 06” High/Aniso 16X NO OpenGL Unreal 2.5 Mar 2004 Doom3/trdemo1 3464 1’ 55” YES Doom3 Aug 2004 Doom3/trdemo2 3990 2’ 13” Quake4/demo4 2976 1’ 39” Oct 2005 Quake4/guru5 3081 1’ 43” Riddick/MainFrame 1629 0’ 54” High/Trilinear - Starbreeze Dec 2004 Riddick/PrisonArea 2310 1’ 17” FEAR/built-in demo 576 0’ 19” Direct3D Monolith FEAR/interval2 2102 1’ 10” Half Life 2 LC/built-in 1805 1’ 00” Valve Source Oblivion/Anvil Castle 2620 1’ 27” Gamebryo Mar 2006 Splinter Cell 3/first level 2970 Unreal 2.5++ Mar 2005 Resolution: 1024x768

Statistics environment (OpenGL) Collect GLInterceptor Trace ATI R520/NVidia G70 Framebuffer CHECK! Vendor OGL Driver GLPlayer Verify Signal Visualizer μ-arch stats Signal Traffic Simulate ATTILA OGL Driver ATTILA Simulator Framebuffer CHECK! Analyze OpenGL API call stats OGL Application OGL Application ATI R520/NVidia G70 Framebuffer Vendor OGL Driver GLInterceptor GLPlayer OpenGL API call stats ATI R520/NVidia G70 Framebuffer Vendor OGL Driver ATI R520/NVidia G70 Framebuffer Vendor OGL Driver

Statistics environment (Direct3D) Collect Verify Simulate Analyze D3D Application PIXRun Trace Microsoft PIX Direct3D API call stats DXPlayer Microsoft D3D Driver Microsoft D3D Driver ATI R520/NVidia G70 ATI R520/NVidia G70 Framebuffer Framebuffer CHECK!

Outline Introduction Game selection & stats gathering Game analysis System → GPU traffic Primitive culling efficiency Rasterization pipeline Fragment shading & texturing Memory usage Conclusions

System → GPU traffic Old games (Voodoo) New games (GeForce) Vertex processing Done in CPU Done In GPU (T&L) Vertex data communication Every frame At startup Vertex data storage System memory Local GDDR memory Rendering action Sends transformed data Sends indices to data to transform Proper analysis Vertex data BW * Index data BW * T. Mitra. T. Chiueh, “Dynamic 3D Graphics Workload Characterization and the architectural implications”, MICRO ‘99

System → GPU traffic Index BW Game/Timedemo Avg. batches per frame Avg. indexes per batch Avg. indexes per frame Bytes per index Index BW at 100fps PCIExpress x16 usage (4 Gb/s) Triangle List Triangle Strip Triangle Fan UT2004/Primeval 229 1110 249285 2 50 MB/s 1.3% 99.9% 0.1% Doom3/trdemo1 776 275 196416 4 79 MB/s 2.0% 100% Doom3/trdemo2 483 304 136548 55 MB/s 1.4% Quake4/demo4 423 405 172330 69 MB/s 1.7% Quake4/guru5 834 166 135051 54 MB/s Riddick/MainFrame 676 356 214965 43 MB/s 1.1% Riddick/PrisonArea 363 658 239425 48 MB/s 1.2% FEAR/built-in demo 488 641 331374 66 MB/s FEAR/interval2 294 1085 307202 61 MB/s 1.5% 96.7% 3.3% Half Life 2 LC/built-in 441 736 328919 Oblivion/Anvil Castle 564 998 711196 142 MB/s 3.4% 46.3% 53.7% Splinter Cell 3/first level 563 308 177300 35 MB/s 0.9% 69.1% 26.7% 4.2%

System → GPU traffic Post-T&L vertex cache 66% hit rate Index Buffer Vertex data Fetcher Memory Vertex shader (T&L) Primitive Assembly For adjacent triangles lists: 2/3 of referenced vertexes already computed : 66% hit rate v1 v2 v3 v4

Post-T&L vertex cache experiments System → GPU traffic Post-T&L vertex cache experiments Results show expected hit rate Game preference for triangle lists: Low Bus BW usage related to index sent Same vertex computation work as with strips or fans using a Post-T&L vertex cache Triangle lists are easier managed by modeling tools.

Outline Introduction Game selection & stats gathering Game analysis System → GPU traffic Primitive culling efficiency Rasterization pipeline Fragment shading & texturing Memory usage Conclusions

Primitive culling efficiency Assembled triangles Traversed triangles Game/timedemo %rejected %traversed %clipped %culled UT2004/Primeval 30% 21% 49% Doom3/trdemo2 37% 28% 35% Quake4/demo4 51% Clipping/Culling intensively used by our games. Quake4: half of the polygons lie out of the view volume. Game renderer engines let GPU do the important clipping/culling work: Easier and cheaper in GPU Hardware.

Outline Introduction Game selection & stats gathering Game analysis System → GPU traffic Primitive culling efficiency Rasterization pipeline Fragment shading & texturing Memory usage Conclusions

Rasterization pipeline The Basics Triangles are broken into quads (2x2 fragments) Boundaries generate non-full quads Quad frags are tested individually in different stages: Z test (hidden surfaces),Stencil test, Alpha Test (transparency), Color Mask. Finally alive frags update framebuffer Empty quads are not further processed

Rasterization pipeline Experimentation Quad generation efficiency: Game/timedemo Avg Triangle Size Quad Efficiency UT2004/Primeval 652 92% Doom3/trdemo2 2117 93% Quake4/demo4 1232 Higher efficiency than reported in [Mitra 99] Results show between 40 and 60% efficiencies. Interactive 3D games use less detailed 3D models (larger triangles).

Rasterization pipeline Doom3 and Quake4 Polygon rasterization overhead due to stencil shadow volumes (SSV)

Rasterization pipeline Fragment rejection breakdown: Game/timedemo Rejected Fragments Blended Fragments HZ Z&Stencil Alpha Color Mask = FALSE UT2004/Primeval 38% 2% 4.15% 0% 56% Doom3/trdemo2 34% 14% 0.03% 18% Quake4/demo4 42% 21% 0.32% 19% On-die HZ greatly reduces GDDR BW avoiding Z&Stencil buffer accesses. In SSV games: Still room for higher BW reduction with HZ performing also Stencil test

Outline Introduction Game selection & stats gathering Game analysis System → GPU traffic Primitive culling efficiency Rasterization pipeline Fragment shading & texturing Memory usage Conclusions

Fragment shading & texturing Texture filtering cost measured in bilinears: Bilinear filtering: 1 bilinear (constant) Trilinear filtering: 2 bilinears (constant) Anisotropic filtering: from 2 up to 32 bilinears (variable) Texture pipelines can usually execute 1 bilinear/cycle

Fragment shading & texturing ALU to Texture Ratio Game/Timedemo Instructions Texture requests ALU to Texture Ratio UT2004/Primeval 4.6 1.5 2.0 Doom3/trdemo1 12.9 4.0 2.2 Doom3/trdemo2 13.0 2.3 Quake4/demo4 16.3 4.3 2.8 Quake4/guru5 17.2 4.5 Riddick/MainFrame 14.6 1.9 6.6 Riddick/PrisonArea 13.6 1.8 6.4 FEAR/built-in demo 21.3 FEAR/interval2 19.3 2.7 6.1 Half Life 2 LC/built-in 19.9 3.9 4.1 Oblivion/Anvil Castle 15.5 1.4 10.4 Splinter Cell 3/first level 2.1 1.2 Game/timedemo Bilinear samples per tex. request UT2004/Primeval 5.2 Doom3/trdemo2 4.4 Quake4/demo4 4.7 Game/timedemo ALU instructions per bilinear request UT2004/Primeval 0.4 Doom3/trdemo2 0.5 Quake4/demo4 0.6 ATI Xenos, RV530, R580 peak performance: Up to 3 ALU instructions per bilinear 80% ALU power not used

Outline Introduction Game selection & stats gathering Game analysis System → GPU traffic Primitive culling efficiency Rasterization pipeline Fragment shading & texturing Memory usage Conclusions

Memory usage Memory Hierarchy: Hit rate and miss BW: Cache Size Way Line Size Z&Stencil 16Kb 64 256B Texture L0/L1 4Kb/16Kb 16/16 64B/64B Color Specialized features: Fast clears Transparent compression Hit rate and miss BW: Game/ timedemo Z&Stencil Texture Color % Read % Write BW@ 100fps % BW Hit rate UT2004/Primeval 15% 94.9% 42% 97.7% 35% 93.7% 73% 27% 8 GB/s Doom3/trdemo2 54% 91.0% 26% 99.2% 93.2% 63% 37% 11 GB/s Quake4/demo4 51% 93.4% 23% 99.3% 17% 62% 38% 10 GB/s In non-SSV games (UT2004): Most demanding stages: Texture, Color. In SSV games (Doom3, Quake4) The most demanding stage: Z&Stencil (50%!!)

Conclusions characterization of 3D games will lead future GPU designs. 2. studying arch improvements in our research group. Towards GPGPU

Conclusions Do our 3D games use GPU resources efficiently? The results The numbers Low CPU ↔ GPU traffic when carrying idx data 1.5% PCIE x16 BW Effective Post-T&L vtx cache with TLs. 66% hit rate Clipping/Culling stages are shown very effective 51% to 72% of polygon reduction On-die HZ greatly reduce GDDR BW because Z&Stencil is the most demanding stage 53% of total BW in Doom3 High quad efficiency 91% to 93% ALU processing power is underutilised in fragment processing 80% ALU power unused

Conclusions Some inferred implications Experimental Observations Implications/Solutions Games using SSV stress Z&Stencil the most (becomes the most GDDR BW demanding stage) Improving HZ (i.e: supporting also stencil) would reduce even more total GDDR BW Fragment processing does not exploit ALU processing power Increase ALU to Texture ratio in fragment programs (newer games tend to it) or Reduce bilinears cost in anisotropic sampling.