1 ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department of Computer Architecture UPC Roger Espasa Intel DEG Barcelona
2 Introduction Graphics rendering and GPUs have become a key component on consumer computers Game PCs Game PCs Microsoft VISTA Microsoft VISTA Videogame consoles Videogame consoles Portable videogame consoles Portable videogame consoles Mobile phones Mobile phones Meanwhile the GPU is also reaching the HPC segment General Purpose GPU ( General Purpose GPU ( ClawHMMer: A Streaming HMMer-Search Implementation, Supercomputing 2005, Daniel Reiter Horn, Mike Houston, and Pat Hanrahan ClawHMMer: A Streaming HMMer-Search Implementation, Supercomputing 2005, Daniel Reiter Horn, Mike Houston, and Pat Hanrahan A complete, accurate and flexible framework for GPU research becomes essential
3 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work
4 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work
5 ATTILA Our implementation of current GPUs Inspired in both NVIDIA and ATI Inspired in both NVIDIA and ATI Not exact to either pipeline Not exact to either pipeline Lack of detailed micro architecture information Educated guessing on our side Implemented Features 2D Homogeneous Recursive Rasterization 2D Homogeneous Recursive Rasterization Tiled Rasterization Tiled Rasterization Hierarchical Z Hierarchical Z Texture compression Texture compression Anisotropic filtering Anisotropic filtering Depth compression, fast z/stencil and color clear Depth compression, fast z/stencil and color clear
6 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work
7 Vertex Shader Vertex Shader Vertex Shader Vertex Shader Primitive Assembly Clipping Triangle Setup Rasterization Fragment Shader Fragment Shader Fragment Shader Fragment Shader ROP HierarchicalZ Vertex Fetch Memory Controller Memory Controller Memory Controller Memory Controller Attila Classic Specialized Shaders
8 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work
9 Memory Controller Memory Controller Memory Controller Memory Controller ROP Shader Vertex Fetch Primitive Assembly Clipping Triangle Setup Rasterization HierarchicalZ Scheduler Distributor Attila Unified Unified Shader Pool
10 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion
11 Signals and Boxes The simulator was implemented using a signal and box paradigm Based on Asim: A Performance Model Framework, Joel Emer et al., IEEE Computer February 2002 Based on Asim: A Performance Model Framework, Joel Emer et al., IEEE Computer February 2002 The different hardware units and/or stages are implemented as boxes Store data, control the data flux, perform computations Store data, control the data flux, perform computationsRegistersQueuesALUs The communication between the hardware stages is implemented as signals Carry a limited amount of data between two boxes with a delay Carry a limited amount of data between two boxes with a delayWiresBuses
12 Boxes Each box has a unique name Boxes have a large number of configuration parameters (up to 20) The box state is updated every clock cycle A box implements what can be done or accessed in a single cycle in a real chip A box implements what can be done or accessed in a single cycle in a real chip The communication with other boxes is carried through signals A box has a set of input signals A box has a set of input signals A box has a set of output signals A box has a set of output signals Boxes implement the data storage Boxes perform calls to emulator classes to perform the hardware functionality at the required point
13 Signals Each signal has a unique name Signal parameters: Latency : Fixed or Variable Latency : Fixed or Variable Bandwidth Bandwidth The signal state is updated each time a box reads from or writes to the signal May not happen every cycle May not happen every cycle Carries a limited number of data objects from a source box to a destination with a time delay A signal performs redundant checks with every operation to avoid data losses or override the restrictions A signal performs redundant checks with every operation to avoid data losses or override the restrictions The traffic carried through signals can be dumped to a file each cycle for debugging
14 Signals and Boxes Shader Fetch Shader Decode Execute next instruction feedback from decode shader inputfeedback to producer shader output feedback from consumer ALU latency texture access feedback from texture unit texture sample
15 ATTILA Boxes Streamer Fetch Streamer Output Cache Streamer Commit Streamer Loader Primitive Assembly Clipper Triangle Setup Fragment Generator Hierarchical Z Shader Fetch Shader Decode Execute Texture Unit Fragment FIFO Interpolator Z Stencil Test Color Write DAC Command Processor Memory Controller STREAMER/VERTEX FETCH SHADER
16 Emulation Libraries The emulation of the rendering operations is implemented in a separated emulation classes The simulator boxes perform calls to the emulation classes to emulate the GPU functionality at same points a real GPU performs them Emulation Classes ClipperEmulator ClipperEmulator RasterizerEmulator RasterizerEmulator ShaderEmulator ShaderEmulator TextureEmulator TextureEmulator FragmentOpEmulator FragmentOpEmulator
17 Support Classes OptimizedDynamicMemory Cheap memory allocation for dynamic data objects Cheap memory allocation for dynamic data objectsDynamicObject Abstract class for the data objects carried through the signals Abstract class for the data objects carried through the signals Stores information to be used for the signal traffic dump Stores information to be used for the signal traffic dumpSignalBinder Signal name directory Signal name directory Supports dumping the signal traffic every cycle Supports dumping the signal traffic every cycleStatistics Abstract class for simulator statistics Abstract class for simulator statisticsStatisticManager Statistics name directory Statistics name directory Supports dumping the statistics values to a file every N cycles Supports dumping the statistics values to a file every N cycles
18 Simulator Class Hierarchy Optimized DynamicMemory SignalBoxStatisctisDynamicObjectFragmentInputSignalBinder Fragment Generator ShaderFetchTextureUnitStatisctisManagerFragment Rasterizer Emulator Shader Emulator TextureEmulator *
19 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion
20 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK!
21 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! Trace GL Interceptor Capture a trace of OpenGL API calls from a real gameCapture a trace of OpenGL API calls from a real game Gather statistics from the game executionGather statistics from the game execution
22 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! Trace GL Player Verification of the captured traceVerification of the captured trace
23 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! OpenGL Library Transforms Fixed Function into Shader codeTransforms Fixed Function into Shader code 200 API Calls supported200 API Calls supported ARB Vertex and Fragment extensionsARB Vertex and Fragment extensions Alpha and Fog emulated via Shader codeAlpha and Fog emulated via Shader code Low Level Driver Low level accessLow level access Attila memory managementAttila memory management Trace
24 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! ATTILA Simulator Detailed cycle-by-cycle simulation of all pipeline stagesDetailed cycle-by-cycle simulation of all pipeline stages 20 boxes, modeling a 100-deep pipeline20 boxes, modeling a 100-deep pipeline functionality embedded at each pipeline functionality embedded at each pipeline stage Trace
25 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! Signal Trace Visualizer (STV) Visualization of the signal traffic between the simulator boxesVisualization of the signal traffic between the simulator boxes Debug the simulator performanceDebug the simulator performance Trace
26
27 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture GPU Architecture Trends GPU Architecture Trends ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion
28 Case Study Evaluate the performance drop when downgrading the texturing capabilities of a middle level GPU architecture Unified architecture Unified architecture 3 quad shader processors 3 quad shader processors In order queue based shader Out of order thread based shader A single quad fragment ROP pipeline A single quad fragment ROP pipeline 2 x 64 bit channels to GDDR2 memory (simplified) 2 x 64 bit channels to GDDR2 memory (simplified)Benchmarks: UT2004 Primeval timedemo: 40 frames UT2004 Primeval timedemo: 40 frames Doom3 trDemo2 timedemo: 40 frames Doom3 trDemo2 timedemo: 40 frames
29 The in order queue based shader had a bug at the time which explained most of the performance lose. Without the bug it should be at around 70% of the performance of the out of order thread based shader. The performance drop for these two benchmarks is limited when the 3 shader processors share 2 Texture Units but significative when only one Texture Unit is shared by all three
30
31
32 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture GPU Architecture Trends GPU Architecture Trends ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion
33 Future Work Simulator Improve the memory model and implement a new memory controller Improve the memory model and implement a new memory controller Hardware support for SM 4 level features Hardware support for SM 4 level features Branching support in the shader Geometry shader Float point textures Render to target HDR framebuffer Implement antialiasing techniques Implement antialiasing techniquesMSAA Rotated Grid
34 Future Work Framework Support for more OpenGL games and applications Support for more OpenGL games and applications Full support Chronicles of Riddick Serius Sam 2 And future OpenGL games: Quake Wars, Return to Castle Wolfestein II? Implement a framework for the Direct3D 9 and Direct3D 10 APIs Implement a framework for the Direct3D 9 and Direct3D 10 APIsD3DInterceptorD3DPlayer D3D9 and D3D10 library Support for glSlang shader programs Support for glSlang shader programs
35 Outline ATTILA GPU Architecture Classic Non Unified Shader Architecture Classic Non Unified Shader Architecture Unified Shader Architecture Unified Shader Architecture GPU Architecture Trends GPU Architecture Trends ATTILA Simulator ATTILA OpenGL Framework Experiments & Statistics Future Work Conclusion
36 Conclusion We have presented a complete framework for the research of future GPU microarchitectures Covers all the aspects involved in GPU research Capture traces of high level graphics API function calls from real applications Capture traces of high level graphics API function calls from real applications Translate the high level API to low level GPU operations Translate the high level API to low level GPU operations A cycle-by-cycle execution-driven simulation of the GPU microarchitecture A cycle-by-cycle execution-driven simulation of the GPU microarchitecture Debug and evaluate the GPU microarchitecture using the generated statistics and the signal traffic trace Debug and evaluate the GPU microarchitecture using the generated statistics and the signal traffic trace
37 Questions