Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009.

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis Imagery taken from Nvidia Dawn Demo Slide on GPUs, CUDA and Programming Models by Apollo Ellis.
GCAFE 28 Feb Real-time REYES Jeremy Sugerman.
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Extending GRAMPS Shaders Jeremy Sugerman June 2, 2009 FLASHG.
GRAMPS Overview and Design Decisions Jeremy Sugerman February 26, 2009 GCafe.
Tools for Investigating Graphics System Performance
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Microtriangle thesis mid-status September 22th, 2010.
Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
TEMPLATE DESIGN © Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier'
GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.
Chapter 13 Embedded Systems
GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Pixel Shader Vertex Shader The Real-time Graphics Pipeline Input Assembler Rasterizer Output Merger.
GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.
Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.
Doing More With GRAMPS Jeremy Sugerman 10 December 2009 GCafe.
FLASHG 15 Oct Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
CHAPTER 4 Window Creation and Control © 2008 Cengage Learning EMEA.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Havok. ©Copyright 2006 Havok.com (or its licensors). All Rights Reserved. HavokFX Next Gen Physics on ATI GPUs Andrew Bowell – Senior Engineer Peter Kipfer.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS – PART 2 SPRING 2015 DR. MICHAEL J. REALE.
Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.
Accelerating image recognition on mobile devices using GPGPU
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
1 Ray Tracing with Existing Graphics Systems Jeremy Sugerman, FLASHG 31 January 2006.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
1 Saarland University, Germany 2 DFKI Saarbrücken, Germany.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Real-Time Operating Systems RTOS For Embedded systems.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Texas Instruments TDA2x and Vision SDK
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
The Graphics Rendering Pipeline
Graphics Processing Unit
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
TensorFlow: A System for Large-Scale Machine Learning
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009

2 Introduction  Evaluation of what/where GRAMPS is today  Planned next steps –New graphs: MapReduce and Cloth Sim  Speculative potpourri, outside interests, issues

3 3 Background: Context Problem Statement:  Many-core systems are arising in many flavours: homogenous, heterogeneous, programmable cores, and fixed-function units.  Build a programming model / run-time system / tools that enable efficient development for and usage of these hardware architectures. Status Quo:  GPU Pipeline (GPU-only, Good for GL, otherwise hard)  CPU / C run-time (No programmer guidance, fast is hard)  GPGPU / CUDA / OpenCL (Good for data-parallel, regular)

4 4 Background: History  GRAMPS: A Programming Model for Graphics Pipelines, ACM TOG, January 2009  Chips combining future CPU and GPU cores  Renderers for Larrabee and future ‘normal’ GPUs  Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan  My current interest: (continue) convincing people a GRAMPS-like organization should inform future app and hardware designs.

5 5  Express apps as graphs of stages and queues  Expose Producer-consumer parallelism  Facilitate task and data-parallelism  GRAMPS handles inter-stage scheduling, data-flow GRAMPS 1.0 Input Fragment Queue Output Fragment Queue = Thread Stage = Shader Stage = Fixed-func Stage = Queue = Stage Output Ray Tracer Frame Buffer Ray Queue Ray Hit Queue Fragment Queue CameraIntersect Shade FB Blend Raster Graphics Frame Buffer Shade FB Blend Rasterize

6 Design Goals  Large Application Scope– Preferable to roll-your-own  High Performance– Competitive with roll-your-own  Optimized Implementations– Informs HW design  Multi-Platform– Suits a variety of many-core systems Also:  Tunable– Expert users can optimize their apps

7 7 GRAMPS’s Role  Target users: engine, middleware, SDK, etc. systems savvy developers  Example: A ‘graphics pipeline’ is now an app! Developer owns: –Identifying a good separation into stages –Implementing optimized kernels for each stage GRAMPS owns: –Handling all inter-stage interaction (e.g., queues, buffers) –Filling the machine while controlling working set

8 8 What We’ve Built (System)

9 What We’ve Built (Run-time)  Setup API; Thread, Shader, Fixed stage environments  Basic scheduler driven by static inputs –Application graph topology –Per-queue packet (‘chunk’) size –Per-queue maximum depth / high-watermark –Ignores: online queue depths, execution history –Policy: run consumers, pre-empt producers

10 What We’ve Built (Apps) Direct3D Pipeline (with Ray-tracing Extension) Ray-tracing Graph IA 1 VS 1 RO Rast Trace IA N VS N PS Frame Buffer Vertex Buffers Sample Queue Set Ray Queue Primitive Queue Input Vertex Queue 1 Primitive Queue 1 Input Vertex Queue N … … OM PS2 Fragment Queue Ray Hit Queue Ray-tracing Extension Primitive Queue N Tiler Shade FB Blend Frame Buffer Sample Queue Tile Queue Ray Queue Ray Hit Queue Fragment Queue Camera Sampler Intersect = Thread Stage = Shader Stage = Fixed-func = Queue = Stage Output = Push Output

11 Taking Stock: What Did We Learn?  At a high level, the whole thing seems to work! –Nontrivial proof-of-concept apps are expressible –Heterogeneity works –Performance results do not preclude viability  Stage scheduling is an arbitrarily hard problem.  There are many additional details it would help to simulate.  (Conventional) GPU vendors want much more comprehensive analysis.  Role of producer-consumer is often overlooked

12 Digression: Some Kinds of Parallelism Task (Divide) and Data (Conquer)  Subdivide algorithm into a DAG (or graph) of kernels.  Data is long lived, manipulated in-place.  Kernels are ephemeral and stateless.  Kernels only get input at entry/creation. Producer-Consumer (Pipeline) Parallelism  Data is ephemeral: processed as it is generated.  Bandwidth or storage costs prohibit accumulation.

13 Possible Next Steps  Increase persuasiveness of graphics applications –Model texture, buffer bandwidth –Sophisticated scheduling –Robust overflow / error handling –Handle multiple passes / graph state change –…  Follow-up other ideas and known defects –Model locality / costs for cross-core migration –Prototype on real hardware –Demonstrate REYES, non-rendering workloads

14 Design Goals (Revisited)  Application Scope: okay– only (multiple) renderers  High Performance: so-so– only (simple) simulation  Optimized Implementations: good  Multi-Platform: good  (Tunable: good, but that’s a separate talk)  Strategy: Broad, not deep. Broader applicability means more impact for optimized implementations.

15 Broader Applicability: New Graphs  “App” 1: MapReduce –Popular parallelism-rich idiom –Enables a variety of useful apps  App 2: Cloth Simulation (Rendering Physics) –Inspired by the PhysBAM cloth simulation –Demonstrates basic mechanics, collision detection –The graph is still very much a work in progress…

16 MapReduce Specification “ProduceReduce”: Minimal simplifications / constraints  Produce/Split (1:N)  Map (1:N)  (Optional) Combine (N:1)  Reduce (N:M, where M << N or M=1 often)  Sort (N:N conceptually, implementations vary) (Aside: REYES is MapReduce, OpenGL is MapCombine)

17 MapReduce Graph  Map output is a dynamically instanced queue set.  Combine might motivate a formal reduction shader.  Reduce is an (automatically) instanced thread stage.  Sort may actually be parallelized. = Thread Stage = Shader Stage = Queue = Stage Output = Push Output Intermediate Tuples Map Output Produce Combine (Optional) ReduceSort Initial Tuples Intermediate Tuples Final Tuples

18  Update is not producer-consumer!  Broad Phase will actually be either a (weird) shader or multiple thread instances.  Fast Recollide details are also TBD. Cloth Simulation Graph = Thread Stage = Shader Stage = Queue = Stage Output = Push Output ResolutionProposed Update Update Mesh Fast Recollide Resolve Narrow Collide Broad Collide Collision Detection BVH Nodes Moved Nodes Collisions Candidate Pairs

19 Potpourri Projects  Dynamic Scheduling– at least current queue depths  Memory system– more real access times, compute / cap memory bandwidth  Locality/Scalability (maybe)– validate the overheads of the run-time, model data/core migration costs.  Standalone GRAMPS– decouple run-time from simulated hardware, perhaps port to real hardware

20 Outside Interests  Many PPL efforts are interested in GRAMPS: –Example consumer for the OS / Run-time interface research. –Example workload for (hardware) scheduling of many-threaded chips. –Example implementation of graphics and irregular workloads to challenge Sequoia II.  Everyone wants to fit it above/below their layer (too many layers!)  All would profit from Standalone GRAMPS

21 Overlapping Interests  REYES is the third major rendering scheme (in addition to OpenGL/Direct3D and ray tracing).  During GRAMPS 1.0, “Real-time REYES” was always on our minds.  Forked into the micropolygon pipeline project –(Kayvon, Matt, Ed, etc.)  Expect overlap in discussion and/or implementation as they consider parallelization.

22 That’s All Folks  Thank you for listening. Any questions?  Actively interested in collaborators –(Future) Owners or experts in some parallel application, engine, toolkit, pipeline, etc. –Anyone interested in scheduling or porting to / simulating interesting hardware configurations  