Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

2 2 Introduction  Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan  Initial work appearing in ACM TOG in January, 2009 Our starting point:  CPU, GPU trends… and collision?  Two research areas: –HW/SW Interface, Programming Model –Future Graphics API

3 3 Background Problem Statement / Requirements:  Build a programming model / primitives / building blocks to drive efficient development for and usage of future many-core machines.  Handle homogeneous, heterogeneous, programmable cores, and fixed-function units. Status Quo:  GPU Pipeline (Good for GL, otherwise hard)  CPU / C run-time (No guidance, fast is hard)

4 4  Apps: Graphs of stages and queues  Producer-consumer, task, data-parallelism  Initial focus on real-time rendering GRAMPS Input Fragment Queue Output Fragment Queue = Thread Stage = Shader Stage = Fixed-func Stage = Queue = Stage Output Ray Tracer Frame Buffer Ray Queue Ray Hit Queue Fragment Queue CameraIntersect Shade FB Blend Raster Graphics Frame Buffer Shade FB Blend Rasterize

5 Design Goals  Large Application Scope– preferable to roll-your-own  High Performance– Competitive with roll-your-own  Optimized Implementations– Informs HW design  Multi-Platform– Suits a variety of many-core systems Also:  Tunable– Expert users can optimize their apps

6 6 As a Graphics Evolution  Not (unthinkably) radical for ‘graphics’  Like fixed → programmable shading –Pipeline undergoing massive shake up –Diversity of new parameters and use cases  Bigger picture than ‘graphics’ –Rendering is more than GL/D3D –Compute is more than rendering –Some ‘GPUs’ are losing their innate pipeline

7 7 As a Compute Evolution (1)  Sounds like streaming: Execution graphs, kernels, data-parallelism  Streaming: “squeeze out every FLOP” –Goals: bulk transfer, arithmetic intensity –Intensive static analysis, custom chips (mostly) –Bounded space, data access, execution time

8 8 As a Compute Evolution (2)  GRAMPS: “interesting apps are irregular” –Goals: Dynamic, data-dependent code –Aggregate work at run-time –Heterogeneous commodity platforms  Streaming techniques fit naturally when applicable

9 9 GRAMPS’ Role  A ‘graphics pipeline’ is now an app!  Target users: engine/pipeline/run-time authors, savvy hardware-aware systems developers.  Compared to status quo: –More flexible, lower level than a GPU pipeline –More guidance than bare metal –Portability in between –Not domain specific

10 GRAMPS Entities (1)  Data access via windows into queues/memory  Queues: Dynamically allocated / managed –Ordered or unordered –Specified max capacity (could also spill) –Two types: Opaque and Collection  Buffers: Random access, Pre-allocated –RO, RW Private, RW Shared (Not Supported)

11 GRAMPS Entities (2)  Queue Sets: Independent sub-queues –Instanced parallelism plus mutual exclusion –Hard to fake with just multiple queues

12 Design Goals (Reminder)  Large Application Scope  High Performance  Optimized Implementations  Multi-Platform  (Tunable)

13 What We’ve Built (System)

14 GRAMPS Scheduling  Static Inputs: –Application graph topology –Per-queue packet (‘chunk’) size –Per-queue maximum depth / high-watermark  Dynamic Inputs (currently ignored): –Current per-queue depths –Average execution time per input packet  Simple Policy: Run consumers, pre-empt producers

15 GRAMPS Scheduler Organization  Tiered Scheduler: Tier-N, Tier-1, Tier-0  Tier-N only wakes idle units, no rebalancing  All Tier-1s compete for all queued work.  ‘Fat’ cores: software tier-1 per core, tier-0 per thread  ‘Micro’ cores: single shared hardware tier-1+0

16 What We’ve Built (Apps) Direct3D Pipeline (with Ray-tracing Extension) Ray-tracing Graph IA 1 VS 1 RO Rast Trace IA N VS N PS Frame Buffer Vertex Buffers Sample Queue Set Ray Queue Primitive Queue Input Vertex Queue 1 Primitive Queue 1 Input Vertex Queue N … … OM PS2 Fragment Queue Ray Hit Queue Ray-tracing Extension Primitive Queue N Tiler Shade FB Blend Frame Buffer Sample Queue Tile Queue Ray Queue Ray Hit Queue Fragment Queue Camera Sampler Intersect = Thread Stage = Shader Stage = Fixed-func = Queue = Stage Output = Push Output

17 Initial Renderer Results  Queues are small (< 600 KB CPU, < 1.5 MB GPU)  Parallelism is good (at least 80%, all but one 95+%)

18 Scheduling Can Clearly Improve

19 Taking Stock: High-level Questions  Is GRAMPS a suitable GPU evolution? –Enable pipeline competitive with bare metal? –Enable innovation: advanced / alternative methods?  Is GRAMPS a good parallel compute model? –Does it fulfill our design goals?

20 Possible Next Steps  Simulation / Hardware fidelity improvements –Memory model, locality  GRAMPS Run-Time improvements –Scheduling, run-time overheads  GRAMPS API extensions –On-the-fly graph modification, data sharing  More applications / workloads –REYES, physics, finance, AI, … –Lazy/adaptive/procedural data generation

21 Design Goals (Revisited)  Application Scope: okay– only (multiple) renderers  High Performance: so-so– limited simulation detail  Optimized Implementations: good  Multi-Platform: good  (Tunable: good, but that’s a separate talk)  Strategy: Broaden available apps and use them to drive performance and simulation work for now.

22 Digression: Some Kinds of Parallelism Task (Divide) and Data (Conquer)  Subdivide algorithm into a DAG (or graph) of kernels.  Data is long lived, manipulated in-place.  Kernels are ephemeral and stateless.  Kernels only get input at entry/creation. Producer-Consumer (Pipeline) Parallelism  Data is ephemeral: processed as it is generated.  Bandwidth or storage costs prohibit accumulation.

23 Three New Graphs  “App” 1: MapReduce Run-time –Popular parallelism-rich idiom –Enables a variety of useful apps  App 2: Cloth Simulation (Rendering Physics) –Inspired by the PhysBAM cloth simulation –Demonstrates basic mechanics, collision detection –Graph is still very much a work in progress…  App 3: Real-time REYES-like Renderer (Kayvon)

24 MapReduce: Specific Flavour “ProduceReduce”: Minimal simplifications / constraints  Produce/Split (1:N)  Map (1:N)  (Optional) Combine (N:1)  Reduce (N:M, where M << N or M=1 often)  Sort (N:N conceptually, implementations vary) (Aside: REYES is MapReduce, OpenGL is MapCombine)

25 MapReduce Graph  Map output is a dynamically instanced queue set.  Combine might motivate a formal reduction shader.  Reduce is an (automatically) instanced thread stage.  Sort may actually be parallelized. = Thread Stage = Shader Stage = Queue = Stage Output = Push Output Intermediate Tuples Map Output Produce Combine (Optional) ReduceSort Initial Tuples Intermediate Tuples Final Tuples

26  Update is not producer-consumer!  Broad Phase will actually be either a (weird) shader or multiple thread instances.  Fast Recollide details are also TBD. Cloth Simulation Graph = Thread Stage = Shader Stage = Queue = Stage Output = Push Output ResolutionProposed Update Update Mesh Fast Recollide Resolve Narrow Collide Broad Collide Collision Detection BVH Nodes Moved Nodes Collisions Candidate Pairs

27 That’s All Folks  Thank you for listening. Any questions?  Actively interested in new collaborators –Owners or experts in some application domain (or engine / run-time system / middleware). –Anyone interested in scheduling or details of possible hardware / core configurations.  TOG Paper: http://graphics.stanford.edu/papers/gramps-tog/

28 Backup Slides / More Details

29 Designing A Good Graph  Efficiency requires “large chunks of coherent work”  Stages separate coherency boundaries –Frequency of computation (fan-out / fan-in) –Memory access coherency –Execution coherency  Queues allow repacking, re-sorting of work from one coherency regime to another.

30 GRAMPS Interfaces  Host/Setup: Create execution graph  Thread: Stateful, singleton  Shader: Data-parallel, auto-instanced

31 GRAMPS Graph Portability  Portability really means performance.  Less portable than GL/D3D –GRAMPS graph is (more) hardware sensitive  More portable than bare metal –Enforces modularity –Best case, just works –Worst case, saves boiler plate

32 Possible Next Steps: Implementation  Better scheduling –Less bursty, better slot filling –Dynamic priorities –Handle graphs with loops better  More detailed costs –Bill for scheduling decisions –Bill for (internal) synchronization  More statistics

33 Possible Next Steps: API  Important: Graph modification (state change)  Probably: Data sharing / ref-counting  Maybe: Blocking inter-stage calls (join)  Maybe: Intra/inter-stage synchronization primitives

34 Possible Next Steps: New Workloads  REYES, hybrid graphics pipelines  Image / video processing  Game Physics –Collision detection or particles  Physics and scientific simulation  AI, finance, sort, search or database query, …  Heavy dynamic data manipulation -k-D tree / octree / BVH build -lazy/adaptive/procedural tree or geometry

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Similar presentations

Presentation on theme: "Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Similar presentations

Presentation on theme: "Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008."— Presentation transcript:

Similar presentations

About project

Feedback