1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Workload Characterization of 3D Games
Status – Week 250 Victor Moya. Summary Current State. Current State. Next Tasks. Next Tasks. Future Work. Future Work. Creditos investigación. Creditos.
Status – Week 274 Victor Moya. Simulator model Boxes. Boxes. Perform the actual work. Perform the actual work. A box can only access its own data, external.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
Status – Week 259 Victor Moya. Summary OpenGL Traces. OpenGL Traces. DirectX Traces. DirectX Traces. Proxy CPU. Proxy CPU. Command Processor. Command.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Status – Week 231 Victor Moya. Summary Primitive Assembly Primitive Assembly Clipping triangle rejection. Clipping triangle rejection. Rasterization.
Status – Week 277 Victor Moya.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.
Status – Week 208 Victor Moya. Summary Traces. Traces. Planification. Planification.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.
The Graphics Pipeline CS2150 Anthony Jones. Introduction What is this lecture about? – The graphics pipeline as a whole – With examples from the video.
1 Attila Research Group Computer Architecture Department Univ Politècnica de Catalunya (UPC)
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shader generation and compilation for a programmable GPU Student: Jordi Roca Monfort Advisor: Agustín Fernández Jiménez Co-advisor: Carlos González Rodríguez.
1 ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca,
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
1 Attila Research Group attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC)
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Ray Tracing using Programmable Graphics Hardware
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Postmortem: Deferred Shading in Tabula Rasa Rusty Koonce NCsoft September 15, 2008.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
A Crash Course on Programmable Graphics Hardware
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Introduction to OpenGL
Chapter 6 GPU, Shaders, and Shading Languages
GPGPU Applications Introduction
GRAPHICS PROCESSING UNIT
Graphics Processing Unit
RADEON™ 9700 Architecture and 3D Performance
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Graphics Processing Unit
Introduction to OpenGL
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department of Computer Architecture UPC Roger Espasa Intel DEG Barcelona

2 Introduction Shaders in GPUs evolving towards general programming Branches, generic loads, scatter Branches, generic loads, scatter New types of shaders: geometry in DX10 Current specialized shaders Area hungry Area hungry Unbalancing leads to inefficiencies Unbalancing leads to inefficiencies This paper: unify all shaders ~8% higher performance with less area & resources ~8% higher performance with less area & resources

3 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

4 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

5 ATTILA Our implementation of current GPUs Inspired in both NVIDIA and ATI Inspired in both NVIDIA and ATI Not exact to either pipeline Not exact to either pipeline Lack of detailed micro architecture information Educated guessing on our side Implemented Features 2D Homogeneous Recursive Rasterization 2D Homogeneous Recursive Rasterization Tiled Rasterization Tiled Rasterization Hierarchical Z Hierarchical Z Texture compression Texture compression Anisotropic filtering Anisotropic filtering Depth compression, fast z/stencil and color clear Depth compression, fast z/stencil and color clear

6 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

7 Vertex Shader Vertex Shader Vertex Shader Vertex Shader Primitive Assembly Clipping Triangle Setup Rasterization Fragment Shader Fragment Shader Fragment Shader Fragment Shader ROP HierarchicalZ Vertex Fetch Memory Controller Memory Controller Memory Controller Memory Controller Attila Classic Specialized Shaders

8 Specialized Shader Issues Unbalancing In fragment shading limited scenarios (typical) up to 30% of the processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders) In fragment shading limited scenarios (typical) up to 30% of the processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders) In vertex shading limited scenarios up to 70% of the processing power remains idle. In vertex shading limited scenarios up to 70% of the processing power remains idle. Dedicated Area 4 unused vertex shaders have the same processing power than one 1 fragment shader 4 unused vertex shaders have the same processing power than one 1 fragment shader 4 vertex shaders require 66% the area of a fragment shader 4 vertex shaders require 66% the area of a fragment shader Different Designs Increases the complexity of the micro architecture Increases the complexity of the micro architecture Increases development and verification time Increases development and verification time

9 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

10 Memory Controller Memory Controller Memory Controller Memory Controller ROP Shader Vertex Fetch Primitive Assembly Clipping Triangle Setup Rasterization HierarchicalZ Scheduler Distributor Attila Unified Unified Shader Pool

11 Unified Shader Architecture Benefits Unified programming model Unified programming model DX10/SM4 and OpenGL/GLSlang are already pushing for it The same features for all the program targets The same features for all the program targets Texturing, branching, outputs Not just vertex and fragment programs Not just vertex and fragment programs DX10 => geometry shader General Purpose GPU or Stream Processor Workload balance Workload balance Shading resources allocated as required at any point of the rendering

12 Unified Shader Architecture Costs Scheduler Scheduler Select which kind of workload must be processed next Partly implemented with multithreading in the fragment shader to hide texture access latency Larger instruction memory and constant bank Larger instruction memory and constant bank Rerouting required Rerouting required All the paths cross the shader pool

13 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

14 ATTILA Framework OpenGL Interceptor tool OpenGL library for Attila GPU Driver for our Attila GPU Attila GPU simulator Signal Visualizer Tool

15 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK!

16 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLInterceptor Capture a trace of OpenGL API alls from a real game

17 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLPlayer Reproduce the captured trace

18 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! OpenGL Library - Transforms Fixed Function into Shader code API Calls supported - ARB Vertex and Fragment extensions - Alpha and Fog emulated via Shader code Driver - Low level access - Attila memory management

19 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! ATTILA Simulator - Detailed cycle-by-cycle simulation of all pipeline stages - 20 boxes, modeling a 100-deep pipeline - functionality embedded at each pipeline stage

20 Find the differences Find the differences NVIDIA GeForce FX 5900XT Attila

21 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

22 Benchmark Unreal Tournament 2004 Fixed function OpenGL API Fixed function OpenGL API Vertex and fragments shaders generated by our library 1024x768 resolution 1024x768 resolution 8x Anisotropic Filtering 8x Anisotropic Filtering 160 of 450 frames simulated 160 of 450 frames simulated 40 frames ~ 1 day simulation 40 frames ~ 1 day simulation On a Xeon 2.0Ghz On a Xeon 2.0Ghz

23 Baseline Configuration Four Vertex Shaders (only for Attila- Classic) Fragment and Unified shader configuration: 32 threads 32 threads 4 fragments/vertices per thread bit FP registers available for temporal storage per thread n SIMD ALUs n SIMD ALUs 1 scalar ALU (optional) 1 scalar ALU (optional) 1 Texture Unit per Shader Unit 1 Texture Unit per Shader Unit 16 KB texture cache Single cycle bilinear and two cycle trilinear AF up to 16x Geometry and Rasterization pipelines limited to 1 vertex and 1 triangle per cycle Two ROPs: 8 z and 8 color values written per cycle Four 64-bit DDR buses: peak bandwidth 64 bytes/cycle

24 “Classic” Performance 8% improvement for 2-way Near linear improvement for 4 shaders Sublinear improvement for 6 and 8 shaders Limited by memory bandwidth and latency Limited by memory bandwidth and latency 8sh 6sh 4sh 2sh ~75% ~45% ~40% 7% 8%

25 Vertex shader and fragment shader workload for 4 vertex shader units and 2 fragment shader units Frame 330 – Detailed Zoom Vertex shading limited

26 Unified Shader Performance Unified improvement ranges from 1% (2 shaders) to 8% (eight 1-way shaders) Fragment shading limited Vertex fetch limited Geometry pipeline limited 8sh 6sh 4sh 2sh

27 Area Estimation ATI R400 ATI RV400 Transistors (millions) Vertex Shaders 64 Fragment Shaders 42 Hardware Element Estimated Area Millions of Transistors Vertex Shader 2.5 Fragment Shader 15 Additional SIMD ALU +15% Additional scalar ALU +5% 160 – 120 = 40 = 2 vertex shader * fragments shader * (other)

28 Shader Scaling vs Transistors 8sh 6sh 4sh 2sh Linear for 4 shader units, sublinear for more than 4 shader units Up to 30% more efficient per area for the unified architecture (two 1- way shaders)

29 Conclusion Attila Unified architecture has better performance than Attila Classic with less hardware Up to 8% better performance Up to 8% better performance 8% to 25% less area required 8% to 25% less area required 10% to 30% better performance per area 10% to 30% better performance per area Up to 8% better performance for 2-way shader units 160% better performance from 2 to 8 fragment or unified shader units Memory bandwidth limited beyond 4 shaders Memory bandwidth limited beyond 4 shaders

30 Questions

31 Performance of Attila Unified vs Classic Attila