Graphics Hardware CMSC 491/691.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

DirectX11 Performance Reloaded
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Graphic Architecture introduction and analysis
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Geometric Objects and Transformations. Coordinate systems rial.html.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Cg Programming Mapping Computational Concepts to GPUs.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Ray Tracing using Programmable Graphics Hardware
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CMSC 611: Advanced Computer Architecture
EECE571R -- Harnessing Massively Parallel Processors ece
Week 2 - Friday CS361.
A Crash Course on Programmable Graphics Hardware
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
CS427 Multicore Architecture and Parallel Computing
5.2 Eleven Advanced Optimizations of Cache Performance
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
The Graphics Rendering Pipeline
Lecture 5: GPU Compute Architecture
GRAPHICS PROCESSING UNIT
Joshua Barczak CMSC435 UMBC
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
Graphics Processing Unit
UMBC Graphics for Games
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
RADEON™ 9700 Architecture and 3D Performance
6- General Purpose GPU Programming
Balancing the Graphics Pipeline for Optimal Performance
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Graphics Hardware CMSC 491/691

A Graphics Pipeline Transform Shade Vertex Clip Project Rasterize Triangle Interpolate Texture Fragment Z-buffer

Computation and Bandwidth Based on: • 100 Mtri/sec (1.6M/frame@60Hz) • 80 Bytes vertex data • 96 Bytes interpolated • 8 Bytes fragment output • 9x depth complexity • 4 4-Byte textures • 223 ops/vert • 1664 ops/frag • No caching • No compression Vertex 24 GB/s 310 GFLOPS Triangle 29 GB/s 2.6 TB/s Texture 160 GB/s Fragment Fragment 21 TFLOPS

Data Parallel Distribute Task Task Task Task Merge

Sorting Objects Distribute objects or vertices Vertex Merge & Redistribute by triangle Triangle Merge & Redistribute by screen location Fragment Screen

Screen Subdivision Tiled Interleaved

Graphics Processing Unit (GPU) Fixed-Function HW for clip/cull, raster, texturing, Ztest Programmable stages Commands in, pixels out

GPU Computation … Vertex More Parallel Parallel Triangle Pixel Pipeline … Parallel More Parallel More Pipeline

Architecture: Latency CPU: Make one thread go very fast Avoid the stalls Branch prediction Out-of-order execution Memory prefetch Big caches GPU: Make 1000 threads go very fast Hide the stalls HW thread scheduler Swap threads to hide stalls

Architecture (MIMD vs SIMD) MIMD(CPU-Like) SIMD (GPU-Like) CTRL ALU CTRL ALU CTRL ALU ALU ALU ALU CTRL ALU ALU ALU ALU CTRL ALU ALU ALU ALU Flexibility Horsepower Ease of Use

SIMD Branching Threads agree, issue if Threads agree, issue else if( x ) // mask threads // issue instructions else // invert mask // unmask Threads agree, issue if Threads agree, issue else Threads disagree, issue if AND else

SIMD Looping They all run ‘till the last one’s done…. Useful Useless while(x) // update mask // do stuff They all run ‘till the last one’s done….

GPU graphics processing model CPU Texture/Buffer Vertex Geometry Rasterize Fragment Z-Buffer Displayed Pixels

NVIDIA GeForce 6 Vertex Rasterize Fragment Z-Buffer Displayed Pixels [Kilgaraff and Fernando, GPU Gems 2]

GPU graphics processing model CPU Texture/Buffer Vertex Geometry Rasterize Fragment Z-Buffer Displayed Pixels

GPU graphics processing model CPU Vertex Texture/Buffer Geometry Fragment Rasterize Displayed Pixels Z-Buffer

General Purpose Registers SIMD Units 2x2 Quads (4 per SIMD) 20 ALU/Quad (5 per thread) “Wavefront” of 64 Threads, executed over 8 clocks 2 Waves interleaved Interleaving + multi-cycling hides ALU latency. Wavefront switching hides memory latency. GPR Usage determines wavefront count. General Purpose Registers 4x32bit (THOUSANDS of them)

AMD/ATI R600 [Tom’s Hardware]

NVIDIA Maxwell [NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014]

Maxwell SIMD Processing Block 32 Cores 8 Special Function NVIDIA Terminology: Warp = interleaved threads Want at least 4-8 Thread Block = Warps*Cores Flexible Registers Trade registers for warps

Maxwell Streaming Multiprocessor (SMM) 4 SIMD blocks Share L1 Caches Share memory Share tessellation HW

Maxwell Graphics Processing Cluster (GPC) 4 SMM Share raster

Full NVIDIA Maxwell 4 GPC Share L2 Share dispatch

NVIDIA Maxwell Stats 16 SM * 4 SIMD Blocks* 32 cores (2048 total) 4.6 TFLOPS, 144 Gtex/s, 224 MB/s 2MB L2 Cache Compress between Memory and L2 Saves 25%

GPU Performance Tips

Graphics System Architecture Your Code Display GPU GPU GPU(s) API Driver Produce Consume Current Frame (Buffering Commands) Previous Frame(s) (Submitted, Pending Execution)

GPU Performance Tips API and Driver Reading Results Derails the train….. Occlusion Queries  Death When used poorly …. Framebuffer reads  DEATH!!! Almost always… CPU-GPU communication should be one way If you must read, do it a few frames later…

GPU Performance Tips API and Driver Minimize shader/texture/constant changes Minimize Draw calls Minimize CPUGPU traffic Use static vertex/index buffers if you can Use dynamic buffers if you must With discarding locks…

GPU Performance Tips Shaders, Generally NO unnecessary work! Precompute constant expressions Div by constant  Mul by reciprocal Compiler will not turn (x+.4)*.5 into x*.5+.2 Former is 2 instructions, latter is one multiply/add Minimize fetches Prefer compute (generally) If ALU/TEX < 4+, ALU is under-utilized If combining static textures, bake it all down…

GPU Performance Tips Shaders, Generally Careful with flow control Avoid divergence Flatten small branches Prefer simple control structure Double-check the compiler Shader compilers are getting better… But floating point rules can tie their hands Look over artists’shoulders Material editors give them lots of rope….

GPU Performance Tips Vertex Processing Use the right data format Cache-optimized index buffer Small, 16-byte aligned vertices Cull invisible geometry Coarse-grained (few thousand triangles) is enough “Heavy” Geometry load is ~2MTris and rising

GPU Performance Tips Pixel Processing Small triangles hurt performance 2x2 pixel quads  Waste at edge pixels Respect the texture cache Adjacent pixels should touch adjacent texels Use the smallest possible texture format Avoid dependent texture reads Do work per vertex There’s usually less of those (see small triangles)

GPU Performance Tips Pixel Processing HW is very good at Z culling Early Z, Hierarchical Z If possible, submit geometry front to back “Z Priming” is commonplace Render with simple shader to z-buffer Then render with real shader

GPU Performance Tips Blending/Backend Turn off what you don’t need Alpha blending Color/Z writes Minimize redundant passes Multiple lights/textures in one pass Use the smallest possible pixel format Consider clip/discard in transparent regions