GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation.

Slides:



Advertisements
Similar presentations
Using Graphics Processors for Real-Time Global Illumination UK GPU Computing Conference 2011 Graham Hazel.
Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Is There a Real Difference between DSPs and GPUs?
DSPs Vs General Purpose Microprocessors
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
Some Thoughts on Technology and Strategies for Petaflops.
X86 and 3D graphics. Quick Intro to 3D Graphics Glossary: –Vertex – point in 3D space –Triangle – 3 connected vertices –Object – list of triangles that.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
Raghu Machiraju Slides: Courtesy - Prof. Huamin Wang, CSE, OSU
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
* Definition of -RAM (random access memory) :- -RAM is the place in a computer where the operating system, application programs & data in current use.
Under the Hood: 3D Pipeline. Motherboard & Chipset PCI Express x16.
Background image by chromosphere.deviantart.com Fella in following slides by devart.deviantart.com DM2336 Programming hardware shaders Dioselin Gonzalez.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Graphics Hardware and Graphics in Video Games COMP136: Introduction to Computer Graphics.
Computer Graphics Graphics Hardware
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
® GDC’99 Performance Tuning with Intel ® Graphics Tools Larry Wickstrom Sr. Software Engineer Judith Stanley Application Engineer Intel Corporation March.
My great Computer TOMMY H. My Great Computer  Its main function of the is to play game, can show high equality picture  Can process the application.
A High-Performance Scalable Graphics Architecture Daniel R. McLachlan Director, Advanced Graphics Engineering SGI.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Latest Generations of Multi Core Processors
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Copyright © Curt Hill Video Hardware Evolution.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Advanced Rendering Technology The AR250 A New Architecture for Ray Traced Rendering.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
General Purpose computing on Graphics Processing Units
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
ATI Semiconductor technology corporation based in Markham, Ontario, Canada, that specialized in the development of graphics processing units and chipsets.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Petri Nordlund Chief Architect Bitboys Oy
CMSC 611: Advanced Computer Architecture
Petri Nordlund Chief Architect Bitboys Oy
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
Graphics Processing Unit
The Small batch (and Other) solutions in Mantle API
NVIDIA Fermi Architecture
Computer Graphics Graphics Hardware
Direct Rambus DRAM (aka SyncLink DRAM)
Graphics Processing Unit
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation

My Goals Survey history and direction of GPU/PC system architecture Demonstrate the process of system level architectural problem solving Motivate some of you to become architects

Disclaimers I work for NVIDIA Public Info All numbers and dates approximate Rounding is our friend No bus/processor is 100% efficient, etc, etc All examples are meant to be illustrative Not comprehensive “ there were >40 gfx companies in 1995”

About Me I love games and graphics I love building things

Structure Intro to PC and GPU Architecture A Sampling of Architectures Voodoo Graphics / Pentium GeForce 256 / P GeForce 6800 / P Geforce GTX280 / Core2 Ideas for the future of the platform

What do architects do? Impose structure on complex design problems Make tradeoffs Validate high risk design bets Structure verification

Why this is a great time to be an Architect Radical design mobility I have contributed to 10 completely new processor designs 7 of which shipped in millions of units. Steep competition Not for everybody Changing the World…no…really! Heterogeneous many core computing is here to stay and it has changed the nature of computing

Design Tension Fixed Function vs. Programmable Scalar vs. Vector Bandwidth vs. Latency In Order vs. Out of Order Limited vs. Unlimited ( virtualized ) resources

Technology Trends CPUs get faster GPUs get faster Interconnects get faster Memory gets faster Memory gets denser Latency increases Feature load increases Physics intrudes more and more All at different rates

The long time horizon The Awesome ideas of now take 2+ years to reach market Awesome depreciates rapidly Predictable Silicon Process Roadmap PC Arch Roadmap 3 rd Party Component Roadmap Your capabilities and resources Unpredictable Market Shifts ( commodity prices, supply shocks ) 3 rd Party Strategic Errors ( os/platform/partner slips ) Innovative Competition ( N-way struggle for design initiative )

GPU Memory GPU CPU Ultra Simplified PC Anatomy CPU Core Logic GPU GPU Memory System Memory

Processor DRAM MGMT Ultra Simplified GPU Anatomy Host Logic DRAM MGMT

Ultra Simplified GPU Anatomy (2) Geom Gather Geom Proc Geom Proc Triangle Proc Triangle Proc Pixel Proc Pixel Proc Z / Blend Memory

GPU Prehistory 1960s – 1970s Single Purpose BIG IRON E&S, GE, Lockheed, … 1980s – 1990s General Purpose BIG IRON Custom ASICs, Workstations SGI, Sun, Intergraph, Maybe we can fit this on a single consumer add-in card?

Fast consumer CPUs with floating point Try 3D rendering in fixed point! PCI VGA and VESA Id Software’s DOOM Contract Fabrication facilities offering.6 micron ASIC design Tools Enabling Technologies in 1994

1996 3dfx - Voodoo Graphics PIO Programming Model Pure Pipelined Graphics Partial Triangle Setup – FP32 Fixed Point Integer Texture Mapping and Gouraud Shading Z Buffer and Full OpenGL Blending All at 1 PPC, all the time, with no caches 32-bit PCI -.09 GB/s 128-bit EDO 50 Mhz DRAM -.8 GB/s

Voodoo Graphics System Architecture Geom Gather Geom Proc Geom Proc Triangle Proc Triangle Proc Pixel Proc Pixel Proc Z / Blend CPU Core Logic FBI FB Memory System Memory TMU TEX Memory GPUCPU

Arch Decision – Triangle Setup Target 3D Triangle with texture and Gouraud shading 3 * XYW RGBA ST = 72 bytes/triangle pre setup 32-bit PCI 33Mhz – 90 MB/s 1.25 M triangles / second speed of light ( 1M is magic ) Observe that post setup 3 * XY WRGBAST start values + screen space derivatives + Area 76 bytes/triangle – 1.18M Tris ( still magic ) Setup can be coded on Pentium in ~100 clocks 1M triangles on P100 ( mktg happy ) Data-limited setup on chip - >10% die cost Typical game scenes <<1000 triangles/frame

2000 Nvidia GeForce 256 Decoupled input queuing Hardware Transform & Lighting FP32 FF Transform FP22 FF Lighting Complex fixed function pixel shading 4 Pipelines AGP4X – 1.06 GB/s 256 Bit DDR 300 Mhz Memory – 19.2 GB/s

GeForce 256 System Architecture Geom Gather Geom Proc Geom Proc Triangle Proc Triangle Proc Pixel Proc Pixel Proc Z / Blend CPU Core Logic GPU GPU Memory System Memory GPUCPU

Architecture Detail – Combiners Logical fixed function extension of OpenGL Machine Surface Color = Diffuse * Texture + Specular Diffuse Color Texture Specular

Multi Texture If one texture is good, more are better Diff * ( Tex1 + Tex 2 ) + Spec or Diff * Tex1 * Tex2 or … Diffuse Color Texture 0.0 Texture Specular Diffuse Color Texture Texture2 1.0 Specular

Combiners Cascading Mux / SOP / Mux / SOP pipeline Very, flexible, harder to program with deeper nesting Everything is full speed! A MUX B MUX AB Partial C MUX D MUX CD Partial Inputs for Next Stage of Pipeline Texture Fog Light

Programmable Shading But the future was obviously Renderman-like shaders normal surfaceN; color C = { 1.0, 0.5, 0.0 }; normal lightDirection; Ci = C * dot ( surfaceN, lightDirection );

2004 Nvidia GeForce 6800 Fully general Vertex and Pixel ISA 6 Geometry Processors 16 Pixel Processors Deep recirculating pipelines to hide latency FP32 datapath end to end AGP8X – 2.11 GB/s 256 Bit 700 Mhz GDDR3 – 44 GB/s

GeForce 6800 System Architecture Geom Gather Geom Proc Geom Proc Triangle Proc Triangle Proc Pixel Proc Pixel Proc Z / Blend CPU Core Logic GPU GPU Memory System Memory GPUCPU Physics and AI Scene Mgmt

Architecture Decision – Tex/Shader Structure Problem: Build a general programmable pipeline Optimize for common workloads TEX – BLEND – FOG Common Game Shaders ( eg. Doom 3 )

Plan A – Uncoupled Elegant Small fundamental unit Many “passes” for common shaders TBF TEXMTH TEX BLND Registers Texture Math

Less Elegant Larger Fundamental Unit Single pass for common shaders Good scaling for longer shaders Big perf / area win given workloads Not forward looking Plan B - Coupled Registers Math Texture Math

GeForce GTX280 Fully unified programmable architecture 240 instances of the same processor IEEE FP32 and FP64 Gen2 PCIE – 8GB/s 512 bit 1100 Mhz GDDR3 – 144 GB/s

GeForce GTX280 System Architecture Geom Gather Geom Proc Geom Proc Triangle Proc Triangle Proc Pixel Proc Pixel Proc Z / Blend CPU Core Logic GPU GPU Memory System Memory GPUCPU Physics and AI Scene Mgmt

Architecture Decision – Heterogeneous Computing Support Build a bigger Chip Radically improve ability of GPU to share work with the CPU Thread Local Memory Grid 0... Global Memory... Grid 1 Sequential Grids in Time Block Shared Memory Register File

Computing Support Add Efficient Thread Launching Add General Load / Store Instructions and Datapath Add Shared Memory Add computational loads to performance design requirements

Future Graphics Directions Higher density Higher refresh Higher dynamic range Ubiquity Lower Power Shaving off the last burrs Global Illumination Higher quality modeling Virtualized resources at interactive rates

Future PC Architecture Directions Highly Integrated – Low Cost Require a minimum visual feature set Web/video/run today’s apps And everyone else Differentiated PCs More bandwidth and more parallel horsepower More mature unified programming models C on CUDA DX11 OpenCL More resource virtualization

Q & A