1 ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca,

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Workload Characterization of 3D Games
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Status – Week 250 Victor Moya. Summary Current State. Current State. Next Tasks. Next Tasks. Future Work. Future Work. Creditos investigación. Creditos.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
Status – Week 259 Victor Moya. Summary OpenGL Traces. OpenGL Traces. DirectX Traces. DirectX Traces. Proxy CPU. Proxy CPU. Command Processor. Command.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Status – Week 231 Victor Moya. Summary Primitive Assembly Primitive Assembly Clipping triangle rejection. Clipping triangle rejection. Rasterization.
Status – Week 277 Victor Moya.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.
Status – Week 208 Victor Moya. Summary Traces. Traces. Planification. Planification.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Status – Week 240 Victor Moya. Summary Post Geometry Pipeline. Post Geometry Pipeline. Rasterization. Rasterization. Triangle Setup. Triangle Setup. Triangle.
Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
1 Attila Research Group Computer Architecture Department Univ Politècnica de Catalunya (UPC)
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Status – Week 266 Victor Moya. Summary ShaderEmulator ShaderEmulator ShaderFetch ShaderFetch ShaderDecodeExecute ShaderDecodeExecute Communication storage.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
High Performance in Broad Reach Games Chas. Boyd
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
CHAPTER 4 Window Creation and Control © 2008 Cengage Learning EMEA.
Enhancing GPU for Scientific Computing Some thoughts.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
4/23/2017 4:23 AM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
1 Attila Research Group attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC)
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Ray Tracing using Programmable Graphics Hardware
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
A Crash Course on Programmable Graphics Hardware
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
Graphics Processing Unit
Introduction to OpenGL
Chapter 6 GPU, Shaders, and Shading Languages
GRAPHICS PROCESSING UNIT
Graphics Processing Unit
RADEON™ 9700 Architecture and 3D Performance
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Graphics Processing Unit
Introduction to OpenGL
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

1 ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department of Computer Architecture UPC Roger Espasa Intel DEG Barcelona

2 Introduction Graphics rendering and GPUs have become a key component on consumer computers Game PCs Game PCs Microsoft VISTA Microsoft VISTA Videogame consoles Videogame consoles Portable videogame consoles Portable videogame consoles Mobile phones Mobile phones Meanwhile the GPU is also reaching the HPC segment General Purpose GPU ( General Purpose GPU ( ClawHMMer: A Streaming HMMer-Search Implementation, Supercomputing 2005, Daniel Reiter Horn, Mike Houston, and Pat Hanrahan ClawHMMer: A Streaming HMMer-Search Implementation, Supercomputing 2005, Daniel Reiter Horn, Mike Houston, and Pat Hanrahan A complete, accurate and flexible framework for GPU research becomes essential

3 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work

4 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work

5 ATTILA Our implementation of current GPUs Inspired in both NVIDIA and ATI Inspired in both NVIDIA and ATI Not exact to either pipeline Not exact to either pipeline Lack of detailed micro architecture information Educated guessing on our side Implemented Features 2D Homogeneous Recursive Rasterization 2D Homogeneous Recursive Rasterization Tiled Rasterization Tiled Rasterization Hierarchical Z Hierarchical Z Texture compression Texture compression Anisotropic filtering Anisotropic filtering Depth compression, fast z/stencil and color clear Depth compression, fast z/stencil and color clear

6 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work

7 Vertex Shader Vertex Shader Vertex Shader Vertex Shader Primitive Assembly Clipping Triangle Setup Rasterization Fragment Shader Fragment Shader Fragment Shader Fragment Shader ROP HierarchicalZ Vertex Fetch Memory Controller Memory Controller Memory Controller Memory Controller Attila Classic Specialized Shaders

8 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work

9 Memory Controller Memory Controller Memory Controller Memory Controller ROP Shader Vertex Fetch Primitive Assembly Clipping Triangle Setup Rasterization HierarchicalZ Scheduler Distributor Attila Unified Unified Shader Pool

10 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion

11 Signals and Boxes The simulator was implemented using a signal and box paradigm Based on Asim: A Performance Model Framework, Joel Emer et al., IEEE Computer February 2002 Based on Asim: A Performance Model Framework, Joel Emer et al., IEEE Computer February 2002 The different hardware units and/or stages are implemented as boxes Store data, control the data flux, perform computations Store data, control the data flux, perform computationsRegistersQueuesALUs The communication between the hardware stages is implemented as signals Carry a limited amount of data between two boxes with a delay Carry a limited amount of data between two boxes with a delayWiresBuses

12 Boxes Each box has a unique name Boxes have a large number of configuration parameters (up to 20) The box state is updated every clock cycle A box implements what can be done or accessed in a single cycle in a real chip A box implements what can be done or accessed in a single cycle in a real chip The communication with other boxes is carried through signals A box has a set of input signals A box has a set of input signals A box has a set of output signals A box has a set of output signals Boxes implement the data storage Boxes perform calls to emulator classes to perform the hardware functionality at the required point

13 Signals Each signal has a unique name Signal parameters: Latency : Fixed or Variable Latency : Fixed or Variable Bandwidth Bandwidth The signal state is updated each time a box reads from or writes to the signal May not happen every cycle May not happen every cycle Carries a limited number of data objects from a source box to a destination with a time delay A signal performs redundant checks with every operation to avoid data losses or override the restrictions A signal performs redundant checks with every operation to avoid data losses or override the restrictions The traffic carried through signals can be dumped to a file each cycle for debugging

14 Signals and Boxes Shader Fetch Shader Decode Execute next instruction feedback from decode shader inputfeedback to producer shader output feedback from consumer ALU latency texture access feedback from texture unit texture sample

15 ATTILA Boxes Streamer Fetch Streamer Output Cache Streamer Commit Streamer Loader Primitive Assembly Clipper Triangle Setup Fragment Generator Hierarchical Z Shader Fetch Shader Decode Execute Texture Unit Fragment FIFO Interpolator Z Stencil Test Color Write DAC Command Processor Memory Controller STREAMER/VERTEX FETCH SHADER

16 Emulation Libraries The emulation of the rendering operations is implemented in a separated emulation classes The simulator boxes perform calls to the emulation classes to emulate the GPU functionality at same points a real GPU performs them Emulation Classes ClipperEmulator ClipperEmulator RasterizerEmulator RasterizerEmulator ShaderEmulator ShaderEmulator TextureEmulator TextureEmulator FragmentOpEmulator FragmentOpEmulator

17 Support Classes OptimizedDynamicMemory Cheap memory allocation for dynamic data objects Cheap memory allocation for dynamic data objectsDynamicObject Abstract class for the data objects carried through the signals Abstract class for the data objects carried through the signals Stores information to be used for the signal traffic dump Stores information to be used for the signal traffic dumpSignalBinder Signal name directory Signal name directory Supports dumping the signal traffic every cycle Supports dumping the signal traffic every cycleStatistics Abstract class for simulator statistics Abstract class for simulator statisticsStatisticManager Statistics name directory Statistics name directory Supports dumping the statistics values to a file every N cycles Supports dumping the statistics values to a file every N cycles

18 Simulator Class Hierarchy Optimized DynamicMemory SignalBoxStatisctisDynamicObjectFragmentInputSignalBinder Fragment Generator ShaderFetchTextureUnitStatisctisManagerFragment Rasterizer Emulator Shader Emulator TextureEmulator *

19 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion

20 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK!

21 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! Trace GL Interceptor Capture a trace of OpenGL API calls from a real gameCapture a trace of OpenGL API calls from a real game Gather statistics from the game executionGather statistics from the game execution

22 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! Trace GL Player Verification of the captured traceVerification of the captured trace

23 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! OpenGL Library Transforms Fixed Function into Shader codeTransforms Fixed Function into Shader code 200 API Calls supported200 API Calls supported ARB Vertex and Fragment extensionsARB Vertex and Fragment extensions Alpha and Fog emulated via Shader codeAlpha and Fog emulated via Shader code Low Level Driver Low level accessLow level access Attila memory managementAttila memory management Trace

24 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! ATTILA Simulator Detailed cycle-by-cycle simulation of all pipeline stagesDetailed cycle-by-cycle simulation of all pipeline stages 20 boxes, modeling a 100-deep pipeline20 boxes, modeling a 100-deep pipeline functionality embedded at each pipeline functionality embedded at each pipeline stage Trace

25 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! Signal Trace Visualizer (STV) Visualization of the signal traffic between the simulator boxesVisualization of the signal traffic between the simulator boxes Debug the simulator performanceDebug the simulator performance Trace

26

27 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture GPU Architecture Trends GPU Architecture Trends ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion

28 Case Study Evaluate the performance drop when downgrading the texturing capabilities of a middle level GPU architecture Unified architecture Unified architecture 3 quad shader processors 3 quad shader processors In order queue based shader Out of order thread based shader A single quad fragment ROP pipeline A single quad fragment ROP pipeline 2 x 64 bit channels to GDDR2 memory (simplified) 2 x 64 bit channels to GDDR2 memory (simplified)Benchmarks: UT2004 Primeval timedemo: 40 frames UT2004 Primeval timedemo: 40 frames Doom3 trDemo2 timedemo: 40 frames Doom3 trDemo2 timedemo: 40 frames

29 The in order queue based shader had a bug at the time which explained most of the performance lose. Without the bug it should be at around 70% of the performance of the out of order thread based shader. The performance drop for these two benchmarks is limited when the 3 shader processors share 2 Texture Units but significative when only one Texture Unit is shared by all three

30

31

32 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture GPU Architecture Trends GPU Architecture Trends ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion

33 Future Work Simulator Improve the memory model and implement a new memory controller Improve the memory model and implement a new memory controller Hardware support for SM 4 level features Hardware support for SM 4 level features Branching support in the shader Geometry shader Float point textures Render to target HDR framebuffer Implement antialiasing techniques Implement antialiasing techniquesMSAA Rotated Grid

34 Future Work Framework Support for more OpenGL games and applications Support for more OpenGL games and applications Full support Chronicles of Riddick Serius Sam 2 And future OpenGL games: Quake Wars, Return to Castle Wolfestein II? Implement a framework for the Direct3D 9 and Direct3D 10 APIs Implement a framework for the Direct3D 9 and Direct3D 10 APIsD3DInterceptorD3DPlayer D3D9 and D3D10 library Support for glSlang shader programs Support for glSlang shader programs

35 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture GPU Architecture Trends GPU Architecture Trends ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion

36 Conclusion We have presented a complete framework for the research of future GPU microarchitectures Covers all the aspects involved in GPU research Capture traces of high level graphics API function calls from real applications Capture traces of high level graphics API function calls from real applications Translate the high level API to low level GPU operations Translate the high level API to low level GPU operations A cycle-by-cycle execution-driven simulation of the GPU microarchitecture A cycle-by-cycle execution-driven simulation of the GPU microarchitecture Debug and evaluate the GPU microarchitecture using the generated statistics and the signal traffic trace Debug and evaluate the GPU microarchitecture using the generated statistics and the signal traffic trace

37 Questions