1 Attila Research Group Computer Architecture Department Univ Politècnica de Catalunya (UPC)

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Graphics Pipeline.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Workload Characterization of 3D Games
Damon Rocco.  Tessellation: The filling of a plane with polygons such that there is no overlap or gap.  In computer graphics objects are rendered as.
Tools for Investigating Graphics System Performance
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Status – Week 250 Victor Moya. Summary Current State. Current State. Next Tasks. Next Tasks. Future Work. Future Work. Creditos investigación. Creditos.
Status – Week 259 Victor Moya. Summary OpenGL Traces. OpenGL Traces. DirectX Traces. DirectX Traces. Proxy CPU. Proxy CPU. Command Processor. Command.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Status – Week 231 Victor Moya. Summary Primitive Assembly Primitive Assembly Clipping triangle rejection. Clipping triangle rejection. Rasterization.
Status – Week 242 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.
The Graphics Pipeline CS2150 Anthony Jones. Introduction What is this lecture about? – The graphics pipeline as a whole – With examples from the video.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
CHAPTER 4 Window Creation and Control © 2008 Cengage Learning EMEA.
Enhancing GPU for Scientific Computing Some thoughts.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
® GDC’99 Performance Tuning with Intel ® Graphics Tools Larry Wickstrom Sr. Software Engineer Judith Stanley Application Engineer Intel Corporation March.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
1 ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca,
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
1 Attila Research Group attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC)
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Ray Tracing using Programmable Graphics Hardware
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Computer Engg, IIT(BHU)
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
The Graphics Rendering Pipeline
Understanding Theory and application of 3D
GRAPHICS PROCESSING UNIT
Mattan Erez The University of Texas at Austin
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
NVIDIA Fermi Architecture
Graphics Processing Unit
RADEON™ 9700 Architecture and 3D Performance
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Graphics Processing Unit
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

1 Attila Research Group Computer Architecture Department Univ Politècnica de Catalunya (UPC)

2 Attila Project Started 2003 Research on GPUs –Focus on the microarchitecture –Use real games as workloads –Analyze bandwidth/latency/threading tradeoffs Spent large fraction of time developing tools Currently three PhDs in progress Funding from –CICYT / Ministry of Education, Spain(2) –Intel(1) 2 Students spent 6 months with ATI

3 Attila Team Faculty –Agustín Fernández 3 Ph.D. Students –Victor Moya-- Hired by Intel / VCG ’06 –Carlos González -- 6 months internship at ATI (Jun’07) –Jordi Roca-- 6 months internship at ATI (Jun’07) Master Thesis –Chema Solis – DX9 Driver Development Alumni –David Abella – DX9 Player and PIX reader –Christian Perez – Color Compression in Attila Industrial Advisor –Roger Espasa, Intel VCG

4 Attila Publications Conference Papers –Workload Characterization of 3D Games. Jordi Roca, Victor Moya, Carlos González, Chema Solis, Agustín Fernández and Roger Espasa. IEEE International Symposium on Workload Characterization (IISWC-2006), pp. -, January –ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), March –Shader Performance Analysis on a Modern GPU Architecture. Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa. The 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38), November –A Single (Unified) Shader GPU Microarchitecture for Embedded Systems. Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2005), November Master Thesis –Caracterización e implementación de algoritmos de compresión en la GPU ATILA (Text in Spanish) Christian Perez. Master Thesis for the Graduate Studies, January –Extensión a Direct3D del driver de un simulador de GPU (Text in Spanish) Chema Solis Master Thesis for the Graduate Studies, July –Librería Direct3D (Text in Catalan) David Abella Master Thesis for the Graduate Studies, July 2007 –Shader generation and compilation for a programmable GPU (Text in Spanish) Jordi Roca. Master Thesis for the Graduate Studies, July –Support tools for a 3D graphics processor simulation framework (Text in Spanish) Carlos González. Master Thesis for the Graduate Studies, June 2004.

5 Outline Attila Tracing Environment Attila Architecture & Simulator Current Research –Micropolygons –Memory Hierarchy –DX9 Driver Development –Unified Shader Architecture

6 CollectVerifySimulateAnalyze OGL/D3D App OGL/D3DCapturer Vendor OGL/D3D Driver Trace ATI R600/NVIDIA G80 Framebuffer OGL/D3DPlayer µ-Arch Statistics Signal Traffic or Microsoft PIX Capturer or Attila Pix Player Framebuffer ATI R600/NVIDIA G80 Vendor OGL/D3D Driver Framebuffer ATTILA Simulator ATTILA OGL/D3D Driver Signal Trace Visualizer Internal traces (mem,$,…) Detailed cycle-to-cycle visualization CHECK API Stats

7 Collect VerifySimulateAnalyze OGL/D3D App OGL/D3DCapturer Vendor OGL/D3D Driver Trace ATI R600/NVIDIA G80 Framebuffer OGL/D3DPlayer µ-Arch Statistics Signal Traffic or Microsoft PIX Capturer or Attila Pix Player Framebuffer ATI R600/NVIDIA G80 Vendor OGL/D3D Driver Framebuffer ATTILA Simulator ATTILA OGL/D3D Driver Signal Trace Visualizer Internal traces (mem,$,…) Detailed cycle-to-cycle visualization CHECK API Stats API Capturers Capture API calls from a real game Gather API level statistics

8 SimulateAnalyze OGL/D3D App OGL/D3DCapturer Vendor OGL/D3D Driver Trace ATI R600/NVIDIA G80 µ-Arch Statistics Signal Traffic or Microsoft PIX Capturer Framebuffer ATTILA Simulator ATTILA OGL/D3D Driver Signal Trace Visualizer Internal traces (mem,$,…) Detailed cycle-to-cycle visualization CHECK API Stats API Players Trace checking/integrity Batch-to-batch playing (helps debug) OGL/D3DPlayer or Attila Pix Player Framebuffer ATI R600/NVIDIA G80 Vendor OGL/D3D Driver Framebuffer CollectVerify

9 Simulate Analyze OGL/D3D App OGL/D3DCapturer Vendor OGL/D3D Driver Trace ATI R600/NVIDIA G80 µ-Arch Statistics Signal Traffic or Microsoft PIX Capturer Signal Trace Visualizer Internal traces (mem,$,…) Detailed cycle-to-cycle visualization CHECK API Stats Simulation Attila Drivers AOGL (90%) AD3D9 (60%) Attila Simulator Detailed cycle-to-cycle simulation 20 Boxes modeling 100-deep pipeline Functionality embedded at each pipeline stage Framebuffer Collect OGL/D3DPlayer or Attila Pix Player ATI R600/NVIDIA G80 Vendor OGL/D3D Driver Verify Framebuffer ATTILA Simulator ATTILA OGL/D3D Driver Framebuffer

10 Analyze OGL/D3D App OGL/D3DCapturer Vendor OGL/D3D Driver ATI R600/NVIDIA G80 or Microsoft PIX Capturer CHECK API Stats Framebuffer Collect OGL/D3DPlayer or Attila Pix Player ATI R600/NVIDIA G80 Vendor OGL/D3D Driver VerifySimulate Framebuffer ATTILA Simulator ATTILA OGL/D3D Driver µ-Arch Statistics Signal Traffic Signal Trace Visualizer Internal traces (mem,$,…) Detailed cycle-to-cycle visualization Framebuffer Trace Simulation output Micro-architectural statistics Traffic for cache, mem, … Signal trace (input for STV tool) Debug simulation performance

11 Attila Drivers OpenGL driver –200 API calls supported. –80% OpenGL 2.0 fixed functionality DirectX9 driver –About 50 calls supported. –60% API functionality. ATTILA Architecture HAL Attila OpenGL Driver (GLLIB) Attila DX9 Driver (D3DLIB)

12 Unified Driver Architecture Currently stalled due to lack of resources Runs basics traces –Non-textured torus with simple vtx shader. ATTILA Architecture HAL ACDLX ACDL AOGL*AGL/ES ADX9*ADX10AREY

13 Outline Attila Tracing Environment Attila Architecture & Simulator Current Research –Micropolygons –Memory Hierarchy –DX9 Driver Development –Unified Shader Architecture

14 Attila Architecture Memory Controller Memory Controller Memory Controller Memory Controller ROP Shader Vertex Fetch Primitive Assembly Clipping Triangle Setup Rasterization HierarchicalZ Scheduler Distributor Unified shaders, multithreaded …GDDR4 detailed protocol, selectable memory schedulers…

15 Attila Simulator Implementation Using Boxes & Signals Streamer Fetch Streamer Output Cache Streamer Commit Streamer Loader Primitive Assembly Clipper Triangle Setup Fragment Generator Hierarchical Z Shader Fetch Shader Decode Execute Texture Unit Fragment FIFO Interpolator Z Stencil Test Color Write DAC Command Processor Memory Controller STREAMER/VERTEX FETCH SHADER Data-driven & cycle-accurate

16 Lots of configurable parameters GPU UnitParamsExamples COMMAND PROCESSOR1 Batch pipelining MEMORY CONTROLLER42 Size, channels and banks (number and interleaving). STREAMER13 Fetched indices and attributes per cycle PRIMITIVE ASSEMBLY4 Assembled triangles per cycle CLIPPER5 Clipping latency SETUP + RASTERIZER43 MSAA samples/cycle, Enabled HZ UNIFIED SHADER UNIT39 Fetch Instrs/cycle, temp regs, scalar ALU TEXTURE CACHE19 Line size, ways, port width ROP (Z + COLOR)47 Compression, cache size. DAC9 Refresh rate TOTAL222

17 Statistics – High Level API level µ-arch level “Workload Characterization of 3D Games”, IEEE International Symposium on WC 2006

18 Statistics – Zooming In Stencil passShading passStencil passShading pass Light 0Light 1 Fine-grain stats at configurable fractions of i.e: 100, 1K, 10K or 100K execution cycles.

Statistics – Cycle Level ZStencilTest 0: 64 bytes Cycle BW = 64 bytes (the whole request is transmited in one cycle) Cycle WRITE_CT bank=0 row=16 col=0 Precharge bank=0 Active bank=0 row=16 Write bank=0 col=0 WL=5 Data pins transmission 19

Simulator Facts Simulation speed –1 1280x1024 Lines of code –Simulator: 142,697 lines –Library, driver and trace tools: 217,266 lines ACDL : 37,791 lines OpenGL : 35,960 lines D3D9: 17,348 lines 20

21 Prey Riddick Quake 4 Doom 3 UT2004 Half Life 2 Supported workloads and upcoming D3D games …

22 Outline Attila Tracing Environment Attila Architecture & Simulator Current Research –Micropolygons –Memory Hierarchy –DX9 Driver Development –Unified Shader Architecture

23 Micropolygon Rendering Jordi Roca

24 Past work 1.OpenGL Fixed Function to ARB vp/fp 1.0 translator. 2.Workload Characterization of 3D Games (IISWC´06): –Extensive analysis of current games in terms of both API call and µarchitectural level stats. 3.Multi-GPU performance evaluation project (at ATI 2007´s internship): –Hybrid SFR/AFR modes. –Alternatives for RTT surface synchronization. –Scaling of current PCIe BW. (Related paper is currently submitted at the IISWC 2008).

25 Micropolygon rendering Understanding and characterizing the pipeline backend unbalance due to very small polygons. –Newer games tend to render outsides, thus projecting polygons of a few pixels size. Synthetic micropolygon test: Fills the screen with 1 pixel aligned quads: Raster Input: 1 triangle/clock Raster Output: 15/16 empty slots/clock (high-end cards).

26 Research on: Proposal #1: µpolygon grid traversal scheme: –An alternative rasterization path to detect and efficiently traverse grids of adjacent pixel-size primitives: Fill backend slots combining fragments of different primitives. Reuse triangle setup and traversal computations for pixel proximate primitives. Proposal #2: Dynamic balancing of rasterization workload: –Assign & schedule shader threads for rasterization.

GPU Memory Hierarchy Optimizations Carlos González 27

Previous Work 1.Initial Attila’s Boxes & Signals framework 2.Tracing Framework –GLInterceptor & GLPlayer Tools –OpenGL Driver for Attila –Signal Trace Visualizer tool 3.New highly-detailed Memory Controller for Attila -GDDR 4 based on Hynix HY5FS123235AFCP 4.Internship at ATI (6 months, 07’) –Work mainly focused on the MC block –Analysis of bandwidth and latency by means of simulation techniques –Some contributions to the initial system Mechanisms to pinpoint sources of latency and analyze bandwidth over time slices 28

Today’s GPUs remarks Tremendous bandwidth available –Core 2: 12 GB/sec VS NVIDIA G80 > 100 GB/sec But… –Dozens of clients accessing memory simultaneously –Unbalance and inefficient scheduling of memory transactions can lead to poor performance Workload unbalance –Total available BW decreases Inefficient scheduling –Latency increases (DDR protocol overhead) Overall performance degradation  29

Thesis Goals 1.Optimize bank mapping and load balancing among memory channels. Also, propose multiple separated address spaces (per client) 2.Propose efficient memory controller scheduling algorithms Also: Measure DRAM chips consumption of our proposals 3.Propose new cache hierarchies for ROP and Texture units 4.Research in interconnection topologies 30

Some experiments… 17’8 % 17’6 % 17’1 % 13’6 % 17’1 % Channel Interleaving Analysis Some config. parameters of the simulation 4 channels of 64-bit 8 banks per 32-bit IO chip Channel interleaving = 256 bytes Bank interleaving = 1024 bytes 4 unified shaders (4x) Texture cache line (L1) = 64 bytes Texture cache ways (L1) = 16 Texture cache lines (L1) = 16 Color and Zstencil caches: 4 ways line size = 256 bytes - 16 cache lines Some config. parameters of the experiment 8 channels of 32-bit 8 banks per 32-bit IO chip Bank interleaving fixed to unified shaders (4x) Memory Scheduling Analysis 31

32 DX9 Driver Development Chema Solís

Project target Project target is to use D3D9 games as workload for ATTILA GPU simulator. Two main tasks: –Trace D3D9 calls executed by the games. –Build a D3D9 driver on top of GPU simulator. D3D application Microsoft D3D9ATTILA D3D9 driver D3D9 Trace 33

PixRun Player Executes traces of calls to D3D9 captured by Microsoft PIX. Analyse how the game is using D3D9. 34

D3D9 Driver D3D9 functionality is being added progressively. The driver is close to support commercial games. 35

36 Unified Shader Architecture Victor Moya

Unified Shader Architecture Evaluated performance of an unified vertex and fragment shader architecture on legacy applications –Evaluated area vs performance Evaluated the performance of implementing Triangle Setup on the shader for embedded GPU architectures Evaluated bottleneck of GPU architectures with high shader ALU to texture. 37

Current Research Evaluate thread and resource scheduling in an unified shader architecture Implementation of blending on the shader 38

39 Thanks & Visit our attila.ac.upc.edu