Download presentation
Presentation is loading. Please wait.
1
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department of Computer Architecture UPC Roger Espasa Intel DEG Barcelona
2
2 Introduction Shaders in GPUs evolving towards general programming Branches, generic loads, scatter Branches, generic loads, scatter New types of shaders: geometry in DX10 Current specialized shaders Area hungry Area hungry Unbalancing leads to inefficiencies Unbalancing leads to inefficiencies This paper: unify all shaders ~8% higher performance with less area & resources ~8% higher performance with less area & resources
3
3 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results
4
4 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results
5
5 ATTILA Our implementation of current GPUs Inspired in both NVIDIA and ATI Inspired in both NVIDIA and ATI Not exact to either pipeline Not exact to either pipeline Lack of detailed micro architecture information Educated guessing on our side Implemented Features 2D Homogeneous Recursive Rasterization 2D Homogeneous Recursive Rasterization Tiled Rasterization Tiled Rasterization Hierarchical Z Hierarchical Z Texture compression Texture compression Anisotropic filtering Anisotropic filtering Depth compression, fast z/stencil and color clear Depth compression, fast z/stencil and color clear
6
6 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results
7
7 Vertex Shader Vertex Shader Vertex Shader Vertex Shader Primitive Assembly Clipping Triangle Setup Rasterization Fragment Shader Fragment Shader Fragment Shader Fragment Shader ROP HierarchicalZ Vertex Fetch Memory Controller Memory Controller Memory Controller Memory Controller Attila Classic Specialized Shaders
8
8 Specialized Shader Issues Unbalancing In fragment shading limited scenarios (typical) up to 30% of the processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders) In fragment shading limited scenarios (typical) up to 30% of the processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders) In vertex shading limited scenarios up to 70% of the processing power remains idle. In vertex shading limited scenarios up to 70% of the processing power remains idle. Dedicated Area 4 unused vertex shaders have the same processing power than one 1 fragment shader 4 unused vertex shaders have the same processing power than one 1 fragment shader 4 vertex shaders require 66% the area of a fragment shader 4 vertex shaders require 66% the area of a fragment shader Different Designs Increases the complexity of the micro architecture Increases the complexity of the micro architecture Increases development and verification time Increases development and verification time
9
9 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results
10
10 Memory Controller Memory Controller Memory Controller Memory Controller ROP Shader Vertex Fetch Primitive Assembly Clipping Triangle Setup Rasterization HierarchicalZ Scheduler Distributor Attila Unified Unified Shader Pool
11
11 Unified Shader Architecture Benefits Unified programming model Unified programming model DX10/SM4 and OpenGL/GLSlang are already pushing for it The same features for all the program targets The same features for all the program targets Texturing, branching, outputs Not just vertex and fragment programs Not just vertex and fragment programs DX10 => geometry shader General Purpose GPU or Stream Processor Workload balance Workload balance Shading resources allocated as required at any point of the rendering
12
12 Unified Shader Architecture Costs Scheduler Scheduler Select which kind of workload must be processed next Partly implemented with multithreading in the fragment shader to hide texture access latency Larger instruction memory and constant bank Larger instruction memory and constant bank Rerouting required Rerouting required All the paths cross the shader pool
13
13 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results
14
14 ATTILA Framework OpenGL Interceptor tool OpenGL library for Attila GPU Driver for our Attila GPU Attila GPU simulator Signal Visualizer Tool
15
15 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK!
16
16 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLInterceptor Capture a trace of OpenGL API alls from a real game
17
17 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLPlayer Reproduce the captured trace
18
18 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! OpenGL Library - Transforms Fixed Function into Shader code - 200 API Calls supported - ARB Vertex and Fragment extensions - Alpha and Fog emulated via Shader code Driver - Low level access - Attila memory management
19
19 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! ATTILA Simulator - Detailed cycle-by-cycle simulation of all pipeline stages - 20 boxes, modeling a 100-deep pipeline - Execute@Execute: functionality embedded at each pipeline stage
20
20 Find the differences Find the differences NVIDIA GeForce FX 5900XT Attila
21
21 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results
22
22 Benchmark Unreal Tournament 2004 Fixed function OpenGL API Fixed function OpenGL API Vertex and fragments shaders generated by our library 1024x768 resolution 1024x768 resolution 8x Anisotropic Filtering 8x Anisotropic Filtering 160 of 450 frames simulated 160 of 450 frames simulated 40 frames ~ 1 day simulation 40 frames ~ 1 day simulation On a Xeon P4 @ 2.0Ghz On a Xeon P4 @ 2.0Ghz
23
23 Baseline Configuration Four Vertex Shaders (only for Attila- Classic) Fragment and Unified shader configuration: 32 threads 32 threads 4 fragments/vertices per thread 16 128-bit FP registers available for temporal storage per thread n SIMD ALUs n SIMD ALUs 1 scalar ALU (optional) 1 scalar ALU (optional) 1 Texture Unit per Shader Unit 1 Texture Unit per Shader Unit 16 KB texture cache Single cycle bilinear and two cycle trilinear AF up to 16x Geometry and Rasterization pipelines limited to 1 vertex and 1 triangle per cycle Two ROPs: 8 z and 8 color values written per cycle Four 64-bit DDR buses: peak bandwidth 64 bytes/cycle
24
24 “Classic” Performance 8% improvement for 2-way Near linear improvement for 4 shaders Sublinear improvement for 6 and 8 shaders Limited by memory bandwidth and latency Limited by memory bandwidth and latency 8sh 6sh 4sh 2sh ~75% ~45% ~40% 7% 8%
25
25 Vertex shader and fragment shader workload for 4 vertex shader units and 2 fragment shader units Frame 330 – Detailed Zoom Vertex shading limited
26
26 Unified Shader Performance Unified improvement ranges from 1% (2 shaders) to 8% (eight 1-way shaders) Fragment shading limited Vertex fetch limited Geometry pipeline limited 8sh 6sh 4sh 2sh
27
27 Area Estimation ATI R400 ATI RV400 Transistors (millions) 160120 Vertex Shaders 64 Fragment Shaders 42 Hardware Element Estimated Area Millions of Transistors Vertex Shader 2.5 Fragment Shader 15 Additional SIMD ALU +15% Additional scalar ALU +5% 160 – 120 = 40 = 2 vertex shader * 2.5 + 2 fragments shader * 15 + 5 (other)
28
28 Shader Scaling vs Transistors 8sh 6sh 4sh 2sh Linear for 4 shader units, sublinear for more than 4 shader units Up to 30% more efficient per area for the unified architecture (two 1- way shaders)
29
29 Conclusion Attila Unified architecture has better performance than Attila Classic with less hardware Up to 8% better performance Up to 8% better performance 8% to 25% less area required 8% to 25% less area required 10% to 30% better performance per area 10% to 30% better performance per area Up to 8% better performance for 2-way shader units 160% better performance from 2 to 8 fragment or unified shader units Memory bandwidth limited beyond 4 shaders Memory bandwidth limited beyond 4 shaders
30
30 Questions
31
31 Performance of Attila Unified vs Classic Attila
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.