Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.

Slides:

Advertisements

Similar presentations

Is There a Real Difference between DSPs and GPUs?

Advertisements

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

February 11, Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.

Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.

Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

Brook for GPUs. May 6, Status –Current efforts toward supporting Reservoir RStream compiler –Brook version 0.2 spec:

GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.

Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.

Enhancing GPU for Scientific Computing Some thoughts.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &

Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

GPU Computation Strategies & Tricks Ian Buck Stanford University.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Cg Programming Mapping Computational Concepts to GPUs.

1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,

General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.

The programmable pipeline Lecture 3.

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

Ray Tracing using Programmable Graphics Hardware

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.

GPU Architecture and Its Application

CS427 Multicore Architecture and Parallel Computing

Graphics Processing Unit

Chapter 6 GPU, Shaders, and Shading Languages

Graphics Processing Unit

Data Parallel Computing on Graphics Hardware

Ray Tracing on Programmable Graphics Hardware

Presentation transcript:

Data Parallel Computing on Graphics Hardware Ian Buck Stanford University

July 27th, Brook General purpose Streaming language DARPA Polymorphous Computing Architectures –Stanford - Smart Memories –UT Austin - TRIPS Processor –MIT - RAW Processor Stanford Streaming Supercomputer Brook: general purpose streaming language –Language developed at Stanford –Compiler in development by Reservoir Labs Study of GPUs as Streaming processor

July 27th, Why graphics hardware Raw Performance: Pentium 4 SSE Theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) Fragment Shader Obtained: MULR R0, R0, R0: 20 GFLOPS Equivalent to a 10 GHz P4 And getting faster: 3x improvement over NV30 (6 months) 2002 R&D Costs: Intel: $4 Billion NVIDIA: $150 Million *from Intel P4 Optimization Manual GeForce FX

July 27th, GPU: Data Parallel –Each fragment shaded independently No dependencies between fragments –Temporary registers are zeroed –No static variables –No Read-Modify-Write textures Multiple “pixel pipes” –Data Parallelism Support ALU heavy architectures Hide Memory Latency [Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

July 27th, Arithmetic Intensity Lots of ops per word transferred Graphics pipeline –Vertex BW: 1 triangle = 32 bytes; OP: f32-ops / triangle –Rasterization Create fragments per triangle –Fragment BW: 1 fragment = 10 bytes OP: i8-ops/fragment Courtesy of Pat Hanrahan

July 27th, Arithmetic Intensity Compute-to-Bandwidth ratio High Arithmetic Intensity desirable –App limited by ALU performance, not off-chip bandwidth –More chip real estate for ALUs, not caches Courtesy of Bill Dally

July 27th, Brook General purpose Streaming language Stream Programming Model –Enforce Data Parallel computing –Encourage Arithmetic Intensity –Provide fundamental ops for stream computing

July 27th, Brook General purpose Streaming language Demonstrate GPU streaming coprocessor –Make programming GPUs easier Hide texture/pbuffer data management Hide graphics based constructs in CG/HLSL Hide rendering passes –Highlight GPU areas for improvement Features required general purpose stream computing

July 27th, Streams & Kernels Streams –Collection of records requiring similar computation Vertex positions, voxels, FEM cell, … –Provide data parallelism Kernels –Functions applied to each element in stream transforms, PDE, … –No dependencies between stream elements Encourage high Arithmetic Intensity

July 27th, Brook C with Streams –API for managing streams –Language additions for kernels Stream Create/Store stream s = CreateStream (float, n, ptr); StoreStream (s, ptr);

July 27th, Brook Kernel Functions –Pos update in velocity field –Map a function to a set kernel void updatepos (stream float3 pos, float3 vel[100][100][100], float timestep, out stream float newpos) { newpos = pos + vel[pos.x][pos.y][pos.z]*timestep; } s_pos = CreateStream(float3, n, pos); s_vel = CreateStream(float3, n, vel); updatepos (s_pos, s_vel, timestep, s_pos);

July 27th, Fundamental Ops Associative Reductions KernelReduce(func, s, &val) –Produce a single value from a stream –Examples: Compute Max or Sum

July 27th, Fundamental Ops Associative Reductions KernelReduce(func, s, &val) –Produce a single value from a stream –Examples: Compute Max or Sum Gather: p = a[i] –Indirect Read –Permitted inside kernels Scatter: a[i] = p –Indirect Write ScatterOp(s_index, s_data, s_dst, SCATTEROP_ASSIGN) –Last write wins rule

July 27th, GatherOp & ScatterOp Indirect read/write with atomic operation GatherOp: p = a[i]++ GatherOp(s_index, s_data, s_src, GATHEROP_INC) ScatterOp: a[i] += p ScatterOp(s_index, s_data, s_dst, SCATTEROP_ADD) Important for building and updating data structures for data parallel computing

July 27th, Brook C with streams –kernel functions –CreateStream, StoreStream –KernelReduce –GatherOp, ScatterOp

July 27th, Implementation Streams –Stored in 2D fp textures / pbuffers –Managed by runtime Kernels –Compiled to fragment programs –Executed by rendering quad

July 27th, Implementation Compiler: brcc foo.br foo.cg foo.fp foo.c Source to Source compiler –Generate CG code Convert array lookups to texture fetches Perform stream/texture lookups Texture address calculation –Generate C Stub file Fragment Program Loader Render code

July 27th, Gromacs Molecular Dynamics Simulator Force Function (~90% compute time): Energy Function: Acceleration Structure: Eric Lindhal, Erik Darve, Yanan Zhao

July 27th, Ray Tracing Tim Purcell, Bill Mark, Pat Hanrahan

July 27th, Finite Volume Methods W i = ∂W/∂I i Joseph Teran, Victor Ng-Thow-Hing, Ronald Fedkiw

July 27th, Applications Sparse Matrix Multiply Batcher Bitonic Sort

July 27th, Summary GPUs are faster than CPUs –and getting faster Why? –Data Parallelism –Arithmetic Intensity What is the right programming model? –Stream Computing –Brook for GPUs

July 27th, GPU Gotchas NVIDIA NV3x: Register usage vs. GFLOPS Time Registers Used

July 27th, GPU Gotchas ATI Radeon 9800 Pro Limited dependent texture lookup 96 instructions 24-bit floating point Texture Lookup Math Ops Texture Lookup Math Ops Texture Lookup Math Ops Texture Lookup Math Ops

July 27th, Summary “All processors aspire to be general-purpose” –Tim van Hook, Keynote, Graphics Hardware 2001

July 27th, GPU Issues Missing Integer & Bit Ops Texture Memory Addressing –Address conversion burns 3 instr. per array lookup –Need large flat texture addressing Readback still slow CGC Performance –Hand code performance critical code No native reduction support

July 27th, GPU Issues No native Scatter Support –Cannot do p[i] = a (indirect write) –Requires CPU readback. –Needs: Dependent Texture Write Set x,y inside fragment program No programmable blend –GatherOp / ScatterOp

July 27th, GPU Issues Limited Output –Fragment program can only output single 4- component float or 4x4 component float (ATI) –Prevents multiple kernel outputs and large data types.

July 27th, Implementation Reduction –O(lg(n)) Passes Gather –Dependent texture read Scatter –Vertex shader (slow) GatherOp / ScatterOp –Vertex shader with CPU sort (slow)

July 27th, Acknowledgments NVIDIA Fellowship program DARPA PCA Pat Hanrahan, Bill Dally, Mattan Erez, Tim Purcell, Bill Mark, Eric Lindahl, Erik Darve, Yanan Zhao

July 27th, Status Compiler/Runtime work complete Applications in progress Release open source in fall Other streaming architectures –Stanford Streaming Supercomputer –PCA Architectures (DARPA)