Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.

Slides:



Advertisements
Similar presentations
A. BobbioBertinoro, March 10-14, Dependability Theory and Methods 3. State Enumeration Andrea Bobbio Dipartimento di Informatica Università del Piemonte.
Advertisements

May International and European M.Eng. Introduction of 5th year technical options Master of Engineering for 2003 Introduction for students to the.
Milano 25/2/20031 Bandwidth Estimation for TCP Sources and its Application Prepared for QoS IP 2003 R. G. Garroppo, S.Giordano, M. Pagano, G. Procissi,
Joe Touch USC/ISI July 10, The X-Bone ICB Meeting July 10, 2003 Joe Touch Director, Postel Center for Experimental Networking Computer Networks.
25 July Navigating Doors. 25 July Starting the Program A Doors shortcut icon is on the computer desktop Double-click the icon to start Doors.
28 July Doors Creating Time Zones. 28 July What is a Time Zone? A designated period of time in which access can be granted to a secure area.
July 30, Pro Rata Budgeting. July 30, Pro Rata Budgeting Pro Rata Detail by Funds reports are available on the Internet no later than October.
Is There a Real Difference between DSPs and GPUs?
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
March 6, 20031WHAT WILL SUN DO FOR SUPERCOMPUTING IN THIS DECADE? Sun’s HPC Approach John L. Gustafson, Ph.D. Sun Labs HPC Workload Characterization and.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.
February 11, Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004.
December 10, 2002 SS-FQ02-W: 1 Stanford Streaming Supercomputer (SSS) Fall Quarter 2002 Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
Enhancing GPU for Scientific Computing Some thoughts.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
GPU Computation Strategies & Tricks Ian Buck Stanford University.
Cg Programming Mapping Computational Concepts to GPUs.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.
History of Microprocessor MPIntroductionData BusAddress Bus
Morgan Kaufmann Publishers
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
High-Level Languages for GPUs Ian Buck Stanford University.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Compiling Metaprogrammed Shaders to Stream GPUs Michael D. McCool Computer Graphics Lab University of Waterloo Graphics Hardware 2003.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
Data Parallel Computing on Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003

December 10th, Brook: general purpose streaming language developed for stanford streaming supercomputing project –architecture: Merrimac –compiler: RStream Reservoir Labs –Center for Turbulence Research –NASA –DARPA PCA Program Stanford: SmartMemories UT Austin: TRIPS MIT: RAW –Brook version 0.2 spec: Stream Execution Unit Stream Register File Memory System Network Interface Scalar Execution Unit text DRDRAM Network

December 10th, why graphics hardware? GeForce FX

December 10th, why graphics hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader obtained: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) *from Intel P4 Optimization Manual GeForce FX

December 10th, gpu: data parallel & arithmetic intensity –data parallelism each fragment shaded independently better alu usage hide memory latency

December 10th, gpu: data parallel & arithmetic intensity –data parallelism each fragment shaded independently better alu usage hide memory latency –arithmetic intensity compute-to-bandwidth ratio lots of ops per word transferred app limited by alu performance, not off-chip bandwidth more chip real estate for alu’s, not caches 64bit fpu

December 10th, Brook: general purpose streaming language stream programming model –enforce data parallel computing streams –encourage arithmetic intensity kernels C with streams

December 10th, Brook for gpus demonstrate gpu streaming coprocessor –explicit programming abstraction

December 10th, Brook for gpus demonstrate gpu streaming coprocessor –make programming gpus easier hide texture/pbuffer data management hide graphics based constructs in CG/HLSL hide rendering passes virtualize resources

December 10th, Brook for gpus demonstrate gpu streaming coprocessor –make programming gpus easier hide texture/pbuffer data management hide graphics based constructs in CG/HLSL hide rendering passes virtualize resources –performance! … on applications that matter

December 10th, Brook for gpus demonstrate gpu streaming coprocessor –make programming gpus easier hide texture/pbuffer data management hide graphics based constructs in CG/HLSL hide rendering passes virtualize resources –performance! … on applications that matter –highlight gpu areas for improvement features required general purpose stream computing

December 10th, system outline.br Brook source files brcc source to source compiler brt Brook run-time library

December 10th, Brook language streams streams –collection of records requiring similar computation particle positions, voxels, FEM cell, … float3 positions ; float3 velocityfield ;

December 10th, Brook language streams streams –collection of records requiring similar computation particle positions, voxels, FEM cell, … float3 positions ; float3 velocityfield ; –similar to arrays, but… index operations disallowed: position[i] read/write stream operators streamRead (positions, p_ptr); streamWrite (velocityfield, v_ptr); –encourage data parallelism

December 10th, Brook language kernels kernels –functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; }

December 10th, Brook language kernels kernels –functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a ; float b ; float c ; foo(a,b,c);

December 10th, Brook language kernels kernels –functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a ; float b ; float c ; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i];

December 10th, Brook language kernels kernels –functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } –no dependencies between stream elements encourage high arithmetic intensity

December 10th, Brook language kernels kernels arguments –input/output streams kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } a,b: Read-only input streams result: Write-only output stream

December 10th, Brook language kernels kernels arguments –input/output streams –constant parameters kernel void foo (float a<>, float b<>, float t, out float result<>) { result = a + t*b; } float a ; float b ; float c ; foo(a,b,3.2f,c);

December 10th, Brook language kernels kernels arguments –input/output streams –constant paramters –gather streams kernel void foo (float a<>, float b<>, float t, float array[], out float result<>) { result = array[a] + t*b; } float a ; float b ; float c ; float array foo(a,b,3.2f,array,c); gpu bonus

December 10th, Brook language kernels kernels arguments –input/output streams –constant parameters –gather streams –iterator streams kernel void foo (float a<>, float b<>, float t, float array[], iter float n<>, out float result<>) { result = array[a] + t*b + n; } float a ; float b ; float c ; float array iter float n = iter(0, 10); foo(a,b,3.2f,array,n,c); gpu bonus

December 10th, Brook language kernels example –position update in velocity field kernel void updatepos (float2 pos<>, float2 vel[100][100], float timestep, out float2 newpos<>) { newpos = pos + vel[pos]*timestep; } updatepos (positions, velocityfield, 10.0f, positions);

December 10th, Brook language reductions

December 10th, Brook language reductions reductions –compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r; sum(a,r);

December 10th, Brook language reductions reductions –compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i];

December 10th, Brook language reductions reductions –associative operations only (a+b)+c = a+(b+c) sum, multiply, max, min, OR, AND, XOR matrix multiply

December 10th, Brook language reductions multi-dimension reductions –stream “shape” differences resolved by reduce function

December 10th, Brook language reductions multi-dimension reductions –stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r ; sum(a,r);

December 10th, Brook language reductions multi-dimension reductions –stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r ; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j];

December 10th, Brook language reductions multi-dimension reductions –stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r ; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j];

December 10th, Brook language stream repeat & stride kernel arguments of different shape –resolved by repeat and stride

December 10th, Brook language stream repeat & stride kernel arguments of different shape –resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a ; float b ; float c ; foo(a,b,c);

December 10th, Brook language stream repeat & stride kernel arguments of different shape –resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a ; float b ; float c ; foo(a,b,c); foo(a[0], b[0], c[0]) foo(a[2], b[0], c[1]) foo(a[4], b[1], c[2]) foo(a[6], b[1], c[3]) foo(a[8], b[2], c[4]) foo(a[10], b[2], c[5]) foo(a[12], b[3], c[6]) foo(a[14], b[3], c[7]) foo(a[16], b[4], c[8]) foo(a[18], b[4], c[9])

December 10th, Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); M V V V T =

December 10th, Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); RT sum

December 10th, brcc compiler infrastructure

December 10th, brcc compiler infrastructure based on ctool – parser –build code tree –extend C grammar to accept Brook convert –tree transformations codegen –generate cg & hlsl code –call cgc, fxc –generate stub function

December 10th, brcc compiler kernel compilation kernel void updatepos (float2 pos<>, float2 vel[100][100], float timestep, out float2 newpos<>) { newpos = pos + vel[pos]*timestep; } float4 main (uniform float4 _workspace : register (c0), uniform sampler _tex_pos : register (s0), float2 _tex_pos_pos : TEXCOORD0, uniform sampler vel : register (s1), uniform float4 vel_scalebias : register (c1), uniform float timestep : register (c2)) : COLOR0 { float4 _OUT; float2 pos; float2 newpos; pos = tex2D(_tex_pos, _tex_pos_pos).xy; newpos = pos + tex2D(vel,(pos).xy*vel_scalebias.xy+vel_scalebias.zw).xy * timestep; _OUT.x = newpos.x; _OUT.y = newpos.y; _OUT.z = newpos.y; _OUT.w = newpos.y; return _OUT; }

December 10th, brcc compiler kernel compilation static const char __updatepos_ps20[] = "ps_2_ static const char __updatepos_fp30[] = "!!fp void updatepos (const __BRTStream& pos, const __BRTStream& vel, const float timestep, const __BRTStream& newpos) { static const void *__updatepos_fp[] = { "fp30", __updatepos_fp30, "ps20", __updatepos_ps20, "cpu", (void *) __updatepos_cpu, "combine", 0, NULL, NULL }; static __BRTKernel k(__updatepos_fp); k->PushStream(pos); k->PushGatherStream(vel); k->PushConstant(timestep); k->PushOutput(newpos); k->Map(); }

December 10th, brcc runtime streams

December 10th, brt runtime streams streams pos vel texture 1 texture 2 separate texture per stream

December 10th, brt runtime kernels kernel execution –set stream texture as render target –bind inputs to texture units –issue screen size quad texture coords provide stream positions b a result vel foo kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; }

December 10th, brt runtime reductions reduction execution –multipass execution –associativity required

December 10th, research directions demonstrate gpu streaming coprocessor –compiling Brook to gpus –evaluation –applications

December 10th, research directions applications –linear algebra –image processing –molecular dynamics (gromacs) –FEM –multigrid –raytracer –volume renderer –SIGGRAPH / GH papers

December 10th, research directions virtualize gpu resources –texture size and formats packing streams to fit in 2D segmented memory space float matrix ;

December 10th, research directions virtualize gpu resources –texture size and formats support complex formats typedef struct { float3 pos; float3 vel; float mass; } particle; kernel void foo (particle a<>, float timestep, out particle b<>); float a [8][2];

December 10th, research directions virtualize gpu resources –multiple outputs simple: let cgc or fxc do dead code elimination better: compute intermediates separately kernel void foo (float3 a<>, float3 b<>, …, out float3 x<>, out float3 y<>) kernel void foo1(float3 a<>, float3 b<>, …, out float3 x<>) kernel void foo2(float3 a<>, float3 b<>, …, out float3 y<>)

December 10th, research directions virtualize gpu resources –limited instructions per kernel generalize RDS algorithm for kernels –compute ideal # of passes for intermediate results –hard ???

December 10th, research directions auto vectorization kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } kernel void foo_faster (float4 a<>, float4 b<>, out float4 result<>) { result = a + b; }

December 10th, research directions Brook v0.2 support –stream operators stencil, group, domain, repeat, stride, merge, … –building and manipulating data structures scatterOp a[i] += p gatherOp p = a[i]++ gpu primitives

December 10th, research directions gpu areas of improvement –reduction registers –texture constraints –scatter capabilities –programmable blending gatherOp, scatterOp

December 10th, Brook status team –Jeremy Sugerman –Daniel Horn –Tim Foley –Ian Buck beta release –December 15th –sourceforge

December 10th, Questions? Fly-fishing fly images from The English Fly Fishing ShopThe English Fly Fishing Shop