February 11, 2004 1 Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004.

Slides:

Advertisements

Similar presentations

Is There a Real Difference between DSPs and GPUs?

Advertisements

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

The University of Adelaide, School of Computer Science

CS-378: Game Technology Lecture #9: More Mapping Prof. Okan Arikan University of Texas, Austin Thanks to James O’Brien, Steve Chenney, Zoran Popovic, Jessica.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

9/25/2001CS 638, Fall 2001 Today Shadow Volume Algorithms Vertex and Pixel Shaders.

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.

June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.

Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.

Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.

Enhancing GPU for Scientific Computing Some thoughts.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &

Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Cg Programming Mapping Computational Concepts to GPUs.

General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.

RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Compiling Metaprogrammed Shaders to Stream GPUs Michael D. McCool Computer Graphics Lab University of Waterloo Graphics Hardware 2003.

A User-Programmable Vertex Engine Erik Lindholm Mark Kilgard Henry Moreton NVIDIA Corporation Presented by Han-Wei Shen.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

Lab: Vertex Shading Chris Wynn Ken Hurley. Objective Hands-On vertex shader programming Start with simple programs … Part 1: Textured-Lit Teapot.

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

David Luebke 1 1/25/2016 Programmable Graphics Hardware.

09/25/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Shadows Stage 2 outline.

Mesh Skinning Sébastien Dominé. Agenda Introduction to Mesh Skinning 2 matrix skinning 4 matrix skinning with lighting Complex skinning for character.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

CS427 Multicore Architecture and Parallel Computing

Graphics Processing Unit

Stream Architecture: Rethinking Media Processor Design

Lecture on High Performance Processor Architecture (CS05162)

Introduction to Programmable Hardware

Where does the Vertex Engine fit?

Data Parallel Computing on Graphics Hardware

Ray Tracing on Programmable Graphics Hardware

CSE 502: Computer Architecture

Presentation transcript:

February 11, Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004

2 To Exploit VLSI Technology We Need: Parallelism –To keep 100s of ALUs per chip (thousands/board millions/system) busy Latency tolerance –To cover 500 cycle remote memory access time Locality –To match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth Moore’s Law –Growth of transistors, not performance Arithmetic is cheap, global bandwidth is expensive Local << global on-chip << off-chip << global system Courtesy of Bill Dally

February 11, Arithmetic Intensity Lots of ops per word transferred Compute-to-Bandwidth ratio High Arithmetic Intensity desirable –App limited by ALU performance, not off-chip bandwidth –More chip real estate for ALUs, not caches Courtesy of Pat Hanrahan

February 11, Brook: Stream programming Model –Enforce Data Parallel computing –Encourage Arithmetic Intensity –Provide fundamental ops for stream computing

February 11, Streams & Kernels Streams –Collection of records requiring similar computation Vertex positions, voxels, FEM cell, … –Provide data parallelism Kernels –Functions applied to each element in stream transforms, PDE, … No dependencies between stream elements –Encourage high Arithmetic Intensity

February 11, Vectors vs. Streams Vectors: v: array of floats Instruction sequence LD v0 LD v1 ADD v0, v1, v2 ST v2  Large set of temps Streams: s: stream of records Instruction sequence LD s0 LD s1 CALLS f, s0, s1, s2 ST s2  Small set of temps Higher arithmetic intensity: |f|/|s|  |+|/|v|

February 11, Imagine –Stream processor for image and signal processing –16mm die in 0.18um TI process –21M transistors 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

February 11, Merrimac Processor 90nm tech (1 V) ASIC technology 1 GHz (37 FO4) 128 GOPs Inter-cluster switch between clusters mm2 (small ~12x10) –Stanford Imagine is 16mm x 16mm –MIT Raw is 18mm x 18mm 25 Watts (P4 = 75 W) –~41W with memories Network 12.5 mm m m Microcontroller C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r 1 6 R D R A M I n t e r f a c e s F o r w a r d E C C A d d r e s s G e n R e o r d e r B u f f e r A d d r e s s G e n $ bank A d d r e s s G e n R e o r d e r B u f f e r A d d r e s s G e n M e m s w i t c h Mips64 20kc Mips64 20kc

February 11, Merrimac Streaming Supercomputer

February 11, Streaming Applications Finite volume – StreamFLO (from TFLO) Finite element - StreamFEM Molecular dynamics code (ODEs) - StreamMD Model (elliptic, hyperbolic and parabolic) PDEs PCA Applications: FFT, Matrix Mul, SVD, Sort

February 11, StreamFLO StreamFLO is the Brook version of FLO82, a FORTRAN code written by Prof. Jameson, for the solution of the inviscid flow around an airfoil. The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations. The structure of the code is similar to TFLO and the algorithm is found in many compressible flow solvers.

February 11, StreamFEM A Brook implementation of the Discontinuous Galerkin (DG) Finite Element Method (FEM) in 2D triangulated domains.

February 11, StreamMD: motivation Application: study the folding of human proteins. Molecular Dynamics: computer simulation of the dynamics of macro molecules. Why this application? –Expect high arithmetic intensity. –Requires variable length neighborlists. –Molecular Dynamics can be used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet. Test case chosen for initial evaluation: box of water molecules. DNA molecule Human immunodeficiency virus (HIV)

February 11, Summary of Application Results ApplicationSustained GFLOPS 1 FP Ops / Mem Ref LRF RefsSRF RefsMem Refs StreamFEM2D (Euler, quadratic) M (93.6%) 10.3M (5.7%) 1.4M (0.7%) StreamFEM2D (MHD, cubic) M (94.0%) 43.8M (5.6%) 3.2M (0.4%) StreamMD M (97.5%) 1.6M (1.7%) 0.7M (0.8%) StreamFLO M (95.7%) 7.2M (2.9%) 3.4M (1.4%) 1. Simulated on a machine with 64GFLOPS peak performance 2. The low numbers are a result of many divide and square-root operations

February 11, Streaming on graphics hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) *from Intel P4 Optimization Manual GeForce FX Pentium 4 NV30 NV35

February 11, GPU Program Architecture Input Registers Program Output Registers Constants Registers Texture

February 11, Example Program Simple Specular and Diffuse Lighting !!VP1.0 # # c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector (infinite viewer) # c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat. # c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat. # c[36] = specular color # c[38].x = specular power # outputs homogenous position and color # DP4 o[HPOS].x, c[0], v[OPOS]; # Compute position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Compute normal. DP3 R0.y, c[5], v[NRML]; DP3 R0.z, c[6], v[NRML]; # R0 = N' = transformed normal DP3 R1.x, c[32], R0; # R1.x = Ldir DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting values MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END

February 11, Cg/HLSL: High level language for GPUs Specular Lighting // Lookup the normal map float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5); // Multiply 3 X 2 matrix generated using lightDir and halfAngle with // scaled normal followed by lookup in intensity map with the result. float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz), dot(I.halfAngle.xyz, normal.xyz)); float4 intensity = tex2D(intensityMap, intensCoord); // Lookup color float4 color = tex2D(colorMap, I.texCoord3.xy); // Blend/Modulate intensity with color return color * intensity;

February 11, GPU: Data Parallel –Each fragment shaded independently No dependencies between fragments –Temporary registers are zeroed –No static variables –No Read-Modify-Write textures Multiple “pixel pipes” –Data Parallelism Support ALU heavy architectures Hide Memory Latency [Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

February 11, GPU: Arithmetic Intensity Lots of ops per word transferred Graphics pipeline –Vertex BW: 1 triangle = 32 bytes; OP: f32-ops / triangle –Rasterization Create fragments per triangle –Fragment BW: 1 fragment = 10 bytes OP: i8-ops/fragment Shader Programs Courtesy of Pat Hanrahan

February 11, Streaming Architectures SDRAM Stream Register File ALU Cluster

February 11, Streaming Architectures SDRAM Stream Register File ALU Cluster MAD R3, R1, R2; MAD R5, R2, R3; Kernel Execution Unit

February 11, Streaming Architectures SDRAM Stream Register File ALU Cluster MAD R3, R1, R2; MAD R5, R2, R3; Kernel Execution Unit Parallel Fragment Pipelines

February 11, Streaming Architectures SDRAM Stream Register File ALU Cluster MAD R3, R1, R2; MAD R5, R2, R3; Kernel Execution Unit Parallel Fragment Pipelines Stream Register File: Texture Cache? F-Buffer [Mark et al.]

February 11, Conclusions The problem is bandwidth – arithmetic is cheap Stream processing & architectures can provide VLSI- efficient scientific computing –Imagine –Merrimac GPUs are first generation streaming architectures –Apply same stream programming model for general purpose computing on GPUs GeForce FX