Issues And Challenges In Compiling for Graphics Processors Norm Rubin Fellow AMD Graphics Products Group.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

DSPs Vs General Purpose Microprocessors
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6: Multicore Systems
The University of Adelaide, School of Computer Science
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Computer Organization and Architecture
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CS 104 Introduction to Computer Science and Graphics Problems
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
OPTIMIZING AND DEBUGGING GRAPHICS APPLICATIONS WITH AMD'S GPU PERFSTUDIO 2.5 GPG Developer Tools Gordon Selley Peter Lohrmann GDC 2011.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Basics and Architectures
Havok. ©Copyright 2006 Havok.com (or its licensors). All Rights Reserved. HavokFX Next Gen Physics on ATI GPUs Andrew Bowell – Senior Engineer Peter Kipfer.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
RISC Architecture RISC vs CISC Sherwin Chan.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
What’s going on here? Can you think of a generic way to describe both of these?
GPU Architecture and Its Application
Advanced Architectures
Visit for more Learning Resources
William Stallings Computer Organization and Architecture 8th Edition
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 6 GPU, Shaders, and Shading Languages
/ Computer Architecture and Design
Lecture 5: GPU Compute Architecture
The Small batch (and Other) solutions in Mantle API
Vector Processing => Multimedia
Instruction Scheduling for Instruction-Level Parallelism
Lecture 5: GPU Compute Architecture for the last time
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
* From AMD 1996 Publication #18522 Revision E
UMBC Graphics for Games
RADEON™ 9700 Architecture and 3D Performance
A Level Computer Science Topic 5: Computer Architecture and Assembly
6- General Purpose GPU Programming
Presentation transcript:

Issues And Challenges In Compiling for Graphics Processors Norm Rubin Fellow AMD Graphics Products Group

CGO 2008 About Norm Rubin (Inserted by KH) Norm Rubin has over twenty years experience delivering commercial compilers for processors ranging from embedded (ARM), desktop (HP, ALPHA) and supercomputer (KSR), and is a recognized expert in the field. For the last five years, he has been architecting compilers for ATI, one of the world's largest vendors of graphics chips. Norm holds a PhD from the Courant Institute of NYU. Besides his work in compilers, he is well known for his work in related parts of the tool chain, binary translators and dynamic optimizers.

CGO 2008 Overview What is the difference between a GPU and a CPU? What is the programming model for graphics –How do 10,000 people easily write high performance parallel code? Where does this compiler fit?

CGO 2008 ATI WhiteOut Demo [2007] GPU machine model

CGO 2008 Chip Design Focus PointCPU Lots of instructions little data –Out of order exec –Branch prediction Reuse and locality Task parallel Needs OS Complex sync Latency machines GPU GPU Few instructions lots of data –SIMD –Hardware threading Little reuse Data parallel No OS Simple sync Throughput machines

CGO 2008 Performance is King Cinematic world: Pixar uses 100,000 min of compute per min of image Blinn's Law (the flip side of Moore's Law): –Time per frame is ~constant –Audience expectation of quality per frame is higher every year! –Expectation of quality increases with compute increases GPUs are real time – 1 min of compute per min of image so users want 100,000 times faster machines Cars. Courtesy of Pixar Animation Studios

CGO 2008 GPU vs. CPU performance thread: // load r1 = load (index) //series of adds r1 = r1 + r1 … Run lots of threads Can you get peak performance/multi-core/cluster? Peak performance = do float ops every cycle

CGO 2008 Typical CPU Operation Wait for memory, gaps prevent peak performance Gap size varies dynamically Hard to tolerate latency One iteration at a time Single CPU unit Cannot reach 100% Hard to prefetch data Multi-core does not help Cluster does not help Limited number of outstanding fetches

CGO 2008 GPU THREADS (Lower Clock – Different Scale) Overlapped fetch and alu Many outstanding fetches Lots of threads Fetch unit + ALU unit Fast thread switch In-order finish ALU units reach 100% utilization Hardware sync for final Output

CGO 2008 GPU Internals ATI MedViz Demo [2007]

CGO 2008 One wavefront is 64 threads Two Wavefronts/simd (running) 16 pe/simd 4 simd engines 8 program counters 512 threads running Once enough resources are available a thread goes into the run queue 16 instructions finish per simd per cycle, Each instruction is 5 way vliw

CGO 2008 Threads in Run Queue Each simd has 256 sets of registers 64 registers in a set (each holds 128 bits) If each thread needs 5 (128 bit) registers, then 256/5 = 51 wavefronts can get into run queue 51 wavefronts = 3264 threads per SIMD or running or waiting threads 256 * 64 * 4 vector registers 256 * 64 * 4 *4 (32 bit registers) = 262,144 registers Or 1 meg byte of register space

CGO 2008 Implications CPU: Loads determine performance  Compiler works hard to –Minimize ALU code –Reduce memory overhead –Try to use prefetch and other magic to reduce the amount of time waiting for memory GPU: Threads determine performance  Compiler works hard to –Minimize ALU code –Maximize threads –Try to reorder instructions to reduce synchronization and other magic to reduce the amount of time waiting for threads

CGO 2008 Graphics Programming Model CPU part for each frame (sequential) { build vertex buffer set uniform inputs draw } This is producer consumer parallelism Internal queue of pending draw commands (often hundreds)

CGO 2008 Programming Model – GPU Part foreach vertex in buffer (parallel) { call vertex kernel/shader } foreach set of 3 vertex outputs (a triangle) (seq) { fixed function rasterize foreach pixel (parallel) { call pixel kernel/shader }} Nothing about number of cores Nothing about sync Developer just writes kernels (in RED)

CGO 2008 Rasterizer Vertex shader output is directed to rasterizer Rasterizer interpolates per-vertex values Interpolated data sent per-pixel to the pixel shader program

CGO 2008 float4 ambient; float4 diffuse; float4 specular; float Ka, Ks, Kd, N; float4 main( float4 Diff : COLOR0, float3 Normal: TEXCOORD0, float3 Light : TEXCOORD1, float3 View : TEXCOORD2 ) : COLOR { // Compute the reflection vector: float3 vReflect = normalize(2*dot(Normal, Light)*Normal - Light); // Final color is composed of ambient, diffuse and specular // contributions: float4 FinalColor = Ka * ambient + Kd * diffuse * dot( Normal, Light ) + Ks * specular * pow( max( dot( vReflect, View), 0), N ) ; return FinalColor; } 20 statements in byte code Pixel Shader

CGO 2008 What is the parallelism model?

CGO 2008 Programming model Vertex and pixel kernels (shaders) Parallel loops are implicit Performance aware code does not know how many cores or how many threads All sorts of queues maintained under covers All kinds of sync done implicitly Programs are very small

CGO 2008 Parallelism Model All parallel operations are hidden via domain specific API calls Developers write sequential code + kernels Kernel operate on one vertex or pixel Developers never deal with parallelism directly No need for auto parallel compilers

CGO 2008 Ruby Demo Series Four versions – each done by experts to show of features of the chip as well as develop novel forward-looking graphics techniques First 3 written in DirectX9, fourth in DirectX10

CGO 2008

Shader Compiler (SC) Developers ship games in byte code –Each time a game starts the shader is compiled Compiler is hidden in driver –No user gets to set options or flags Compiler updates with new driver (once a month) Compile done each time game is run Like a JIT but we care about performance SC runs on consoles/phones/laptops/desktops etc

CGO 2008 Relations to Std CPU Compiler About ½ code is traditional compiler, all the usual stuff SSA form Graph coloring register allocator Instruction scheduler But there is a lot of special stuff!

CGO 2008 Some Odd Features HLSL compiler is written by Microsoft and has its own idea of how to optimize a program  Each compiler fights the other, so SC undoes the ms optimizations Hardware updates frequently so  SC supports large number of targets, internally (picture gcc done in c++ classes)  One version of the compiler supports all chips

CGO 2008 Compiler Statistics

CGO 2008 Register Allocation A local allocator run as part of instruction scheduling A global – Graph coloring allocator (even though this is a JIT) Two allocators for speed

CGO 2008 New Issues Less registers are better, because of threading Load/store multiple takes same time as load/store so spill costs are different Registers hold 4 element vectors

CGO 2008 Register Allocation Two register allocators – hlsl/manual vs SC SC, which knows the machine usually needs less registers Register delta is not related to program size But bigger programs need more registers

CGO shaders

CGO 2008 Open Problem Path aware allocation If (run-time-const) { call s1; } else { call s2; } Can we allocate high numbered regs to s2?

CGO 2008 Scheduler Modified list scheduler Issues – Grouping loads/fetches Compiler control of thread switch Loss of state at thread switch Multiple ways to code operations, final instruction selection in scheduler

CGO 2008 Up is better Num cycles changed by scheduling

CGO way VLIW Packing Rate Bioshock: 2290 ps Call of Juarez: 594 ps Crysis: 284 ps LostPlanet: 83 ps Best would be 5.0

CGO 2008 Dependence graph

CGO 2008

Bug Report Graphics Shadow acne Compiler Round off error

CGO 2008 Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. © 2006 Advanced Micro Devices, Inc. All rights reserved. Questions? amd.com