Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Slides:

Advertisements

Similar presentations

Introduction to Direct3D 10 Course Porting Game Engines to Direct3D 10: Crysis / CryEngine2 Carsten Wenzel.

Advertisements

Starfish: A Self-tuning System for Big Data Analytics.

Sven Woop Computer Graphics Lab Saarland University

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis Imagery taken from Nvidia Dawn Demo Slide on GPUs, CUDA and Programming Models by Apollo Ellis.

GCAFE 28 Feb Real-time REYES Jeremy Sugerman.

Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.

Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.

Extending GRAMPS Shaders Jeremy Sugerman June 2, 2009 FLASHG.

GRAMPS Overview and Design Decisions Jeremy Sugerman February 26, 2009 GCafe.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

TEMPLATE DESIGN © Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier'

GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.

Chapter 13 Embedded Systems

GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277.

Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.

Pixel Shader Vertex Shader The Real-time Graphics Pipeline Input Assembler Rasterizer Output Merger.

GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Common System Components

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.

Doing More With GRAMPS Jeremy Sugerman 10 December 2009 GCafe.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009.

1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.

CHAPTER 4 Window Creation and Control © 2008 Cengage Learning EMEA.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.

Cg Programming Mapping Computational Concepts to GPUs.

Matrices from HELL Paul Taylor Basic Required Matrices PROJECTION WORLD VIEW.

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

Emergent Game Technologies Gamebryo Element Engine Thread for Performance.

1 Ray Tracing with Existing Graphics Systems Jeremy Sugerman, FLASHG 31 January 2006.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.

Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

1 Saarland University, Germany 2 DFKI Saarbrücken, Germany.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.

Martin Kruliš by Martin Kruliš (v1.0)1.

Real-Time Operating Systems RTOS For Embedded systems.

GPU Architecture and Its Application

TensorFlow– A system for large-scale machine learning

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

CUDA Interoperability with Graphical Environments

Real-Time Ray Tracing Stefan Popov.

Graphics Processing Unit

CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

2 2 Introduction  Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan  Initial work appearing in ACM TOG in January, 2009 Our starting point:  CPU, GPU trends… and collision?  Two research areas: –HW/SW Interface, Programming Model –Future Graphics API

3 3 Background Problem Statement / Requirements:  Build a programming model / primitives / building blocks to drive efficient development for and usage of future many-core machines.  Handle homogeneous, heterogeneous, programmable cores, and fixed-function units. Status Quo:  GPU Pipeline (Good for GL, otherwise hard)  CPU / C run-time (No guidance, fast is hard)

4 4  Apps: Graphs of stages and queues  Producer-consumer, task, data-parallelism  Initial focus on real-time rendering GRAMPS Input Fragment Queue Output Fragment Queue = Thread Stage = Shader Stage = Fixed-func Stage = Queue = Stage Output Ray Tracer Frame Buffer Ray Queue Ray Hit Queue Fragment Queue CameraIntersect Shade FB Blend Raster Graphics Frame Buffer Shade FB Blend Rasterize

5 Design Goals  Large Application Scope– preferable to roll-your-own  High Performance– Competitive with roll-your-own  Optimized Implementations– Informs HW design  Multi-Platform– Suits a variety of many-core systems Also:  Tunable– Expert users can optimize their apps

6 6 As a Graphics Evolution  Not (unthinkably) radical for ‘graphics’  Like fixed → programmable shading –Pipeline undergoing massive shake up –Diversity of new parameters and use cases  Bigger picture than ‘graphics’ –Rendering is more than GL/D3D –Compute is more than rendering –Some ‘GPUs’ are losing their innate pipeline

7 7 As a Compute Evolution (1)  Sounds like streaming: Execution graphs, kernels, data-parallelism  Streaming: “squeeze out every FLOP” –Goals: bulk transfer, arithmetic intensity –Intensive static analysis, custom chips (mostly) –Bounded space, data access, execution time

8 8 As a Compute Evolution (2)  GRAMPS: “interesting apps are irregular” –Goals: Dynamic, data-dependent code –Aggregate work at run-time –Heterogeneous commodity platforms  Streaming techniques fit naturally when applicable

9 9 GRAMPS’ Role  A ‘graphics pipeline’ is now an app!  Target users: engine/pipeline/run-time authors, savvy hardware-aware systems developers.  Compared to status quo: –More flexible, lower level than a GPU pipeline –More guidance than bare metal –Portability in between –Not domain specific

10 GRAMPS Entities (1)  Data access via windows into queues/memory  Queues: Dynamically allocated / managed –Ordered or unordered –Specified max capacity (could also spill) –Two types: Opaque and Collection  Buffers: Random access, Pre-allocated –RO, RW Private, RW Shared (Not Supported)

11 GRAMPS Entities (2)  Queue Sets: Independent sub-queues –Instanced parallelism plus mutual exclusion –Hard to fake with just multiple queues

12 Design Goals (Reminder)  Large Application Scope  High Performance  Optimized Implementations  Multi-Platform  (Tunable)

13 What We’ve Built (System)

14 GRAMPS Scheduling  Static Inputs: –Application graph topology –Per-queue packet (‘chunk’) size –Per-queue maximum depth / high-watermark  Dynamic Inputs (currently ignored): –Current per-queue depths –Average execution time per input packet  Simple Policy: Run consumers, pre-empt producers

15 GRAMPS Scheduler Organization  Tiered Scheduler: Tier-N, Tier-1, Tier-0  Tier-N only wakes idle units, no rebalancing  All Tier-1s compete for all queued work.  ‘Fat’ cores: software tier-1 per core, tier-0 per thread  ‘Micro’ cores: single shared hardware tier-1+0

16 What We’ve Built (Apps) Direct3D Pipeline (with Ray-tracing Extension) Ray-tracing Graph IA 1 VS 1 RO Rast Trace IA N VS N PS Frame Buffer Vertex Buffers Sample Queue Set Ray Queue Primitive Queue Input Vertex Queue 1 Primitive Queue 1 Input Vertex Queue N … … OM PS2 Fragment Queue Ray Hit Queue Ray-tracing Extension Primitive Queue N Tiler Shade FB Blend Frame Buffer Sample Queue Tile Queue Ray Queue Ray Hit Queue Fragment Queue Camera Sampler Intersect = Thread Stage = Shader Stage = Fixed-func = Queue = Stage Output = Push Output

17 Initial Renderer Results  Queues are small (< 600 KB CPU, < 1.5 MB GPU)  Parallelism is good (at least 80%, all but one 95+%)

18 Scheduling Can Clearly Improve

19 Taking Stock: High-level Questions  Is GRAMPS a suitable GPU evolution? –Enable pipeline competitive with bare metal? –Enable innovation: advanced / alternative methods?  Is GRAMPS a good parallel compute model? –Does it fulfill our design goals?

20 Possible Next Steps  Simulation / Hardware fidelity improvements –Memory model, locality  GRAMPS Run-Time improvements –Scheduling, run-time overheads  GRAMPS API extensions –On-the-fly graph modification, data sharing  More applications / workloads –REYES, physics, finance, AI, … –Lazy/adaptive/procedural data generation

21 Design Goals (Revisited)  Application Scope: okay– only (multiple) renderers  High Performance: so-so– limited simulation detail  Optimized Implementations: good  Multi-Platform: good  (Tunable: good, but that’s a separate talk)  Strategy: Broaden available apps and use them to drive performance and simulation work for now.

22 Digression: Some Kinds of Parallelism Task (Divide) and Data (Conquer)  Subdivide algorithm into a DAG (or graph) of kernels.  Data is long lived, manipulated in-place.  Kernels are ephemeral and stateless.  Kernels only get input at entry/creation. Producer-Consumer (Pipeline) Parallelism  Data is ephemeral: processed as it is generated.  Bandwidth or storage costs prohibit accumulation.

23 Three New Graphs  “App” 1: MapReduce Run-time –Popular parallelism-rich idiom –Enables a variety of useful apps  App 2: Cloth Simulation (Rendering Physics) –Inspired by the PhysBAM cloth simulation –Demonstrates basic mechanics, collision detection –Graph is still very much a work in progress…  App 3: Real-time REYES-like Renderer (Kayvon)

24 MapReduce: Specific Flavour “ProduceReduce”: Minimal simplifications / constraints  Produce/Split (1:N)  Map (1:N)  (Optional) Combine (N:1)  Reduce (N:M, where M << N or M=1 often)  Sort (N:N conceptually, implementations vary) (Aside: REYES is MapReduce, OpenGL is MapCombine)

25 MapReduce Graph  Map output is a dynamically instanced queue set.  Combine might motivate a formal reduction shader.  Reduce is an (automatically) instanced thread stage.  Sort may actually be parallelized. = Thread Stage = Shader Stage = Queue = Stage Output = Push Output Intermediate Tuples Map Output Produce Combine (Optional) ReduceSort Initial Tuples Intermediate Tuples Final Tuples

26  Update is not producer-consumer!  Broad Phase will actually be either a (weird) shader or multiple thread instances.  Fast Recollide details are also TBD. Cloth Simulation Graph = Thread Stage = Shader Stage = Queue = Stage Output = Push Output ResolutionProposed Update Update Mesh Fast Recollide Resolve Narrow Collide Broad Collide Collision Detection BVH Nodes Moved Nodes Collisions Candidate Pairs

27 That’s All Folks  Thank you for listening. Any questions?  Actively interested in new collaborators –Owners or experts in some application domain (or engine / run-time system / middleware). –Anyone interested in scheduling or details of possible hardware / core configurations.  TOG Paper:

28 Backup Slides / More Details

29 Designing A Good Graph  Efficiency requires “large chunks of coherent work”  Stages separate coherency boundaries –Frequency of computation (fan-out / fan-in) –Memory access coherency –Execution coherency  Queues allow repacking, re-sorting of work from one coherency regime to another.

30 GRAMPS Interfaces  Host/Setup: Create execution graph  Thread: Stateful, singleton  Shader: Data-parallel, auto-instanced

31 GRAMPS Graph Portability  Portability really means performance.  Less portable than GL/D3D –GRAMPS graph is (more) hardware sensitive  More portable than bare metal –Enforces modularity –Best case, just works –Worst case, saves boiler plate

32 Possible Next Steps: Implementation  Better scheduling –Less bursty, better slot filling –Dynamic priorities –Handle graphs with loops better  More detailed costs –Bill for scheduling decisions –Bill for (internal) synchronization  More statistics

33 Possible Next Steps: API  Important: Graph modification (state change)  Probably: Data sharing / ref-counting  Maybe: Blocking inter-stage calls (join)  Maybe: Intra/inter-stage synchronization primitives

34 Possible Next Steps: New Workloads  REYES, hybrid graphics pipelines  Image / video processing  Game Physics –Collision detection or particles  Physics and scientific simulation  AI, finance, sort, search or database query, …  Heavy dynamic data manipulation -k-D tree / octree / BVH build -lazy/adaptive/procedural tree or geometry