Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008
2 Background, Outline Stanford Graphics / Architecture Research –Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan To appear in ACM Transactions on Graphics CPU, GPU trends… and collision? Two research areas: –HW/SW Interface, Programming Model –Future Graphics API
3 Problem Statement Drive efficient development and execution in many- /multi-core systems. Support homogeneous, heterogeneous cores. Inform future hardware Status Quo: GPU Pipeline (Good for GL, otherwise hard) CPU (No guidance, fast is hard)
4 Software defined graphs Producer-consumer, data-parallelism Initial focus on rendering GRAMPS Input Fragment Queue Output Fragment Queue Rasterization Pipeline Ray Tracing Graph = Thread Stage = Shader Stage = Fixed-func Stage = Queue = Stage Output Frame Buffer Ray Queue Ray Hit Queue Fragment Queue CameraIntersect Shade FB Blend Frame Buffer Shade FB Blend Rasterize
5 As a Graphics Evolution Not (too) radical for ‘graphics’ Like fixed → programmable shading –Pipeline undergoing massive shake up –Diversity of new parameters and use cases Bigger picture than ‘graphics’ –Rendering is more than GL/D3D –Compute is more than rendering –Some ‘GPUs’ are losing their innate pipeline
6 As a Compute Evolution (1) Sounds like streaming: Execution graphs, kernels, data-parallelism Streaming: “squeeze out every FLOP” –Goals: bulk transfer, arithmetic intensity –Intensive static analysis, custom chips (mostly) –Bounded space, data access, execution time
7 As a Compute Evolution (2) GRAMPS: “interesting apps are irregular” –Goals: Dynamic, data-dependent code –Aggregate work at run-time –Heterogeneous commodity platforms Naturally allows streaming when applicable
8 GRAMPS’ Role A ‘graphics pipeline’ is now an app! GRAMPS models parallel state machines. Compared to status quo: –More flexible than a GPU pipeline –More guidance than bare metal –Portability in between –Not domain specific
9 GRAMPS Interfaces Host/Setup: Create execution graph Thread: Stateful, singleton Shader: Data-parallel, auto-instanced
GRAMPS Entities (1) Accessed via windows Queues: Connect stages, Dynamically sized –Ordered or unordered –Fixed max capacity or spill to memory Buffers: Random access, Pre-allocated –RO, RW Private, RW Shared (Not Supported)
GRAMPS Entities (2) Queue Sets: Independent sub-queues –Instanced parallelism plus mutual exclusion –Hard to fake with just multiple queues
12 What We’ve Built (System)
13 GRAMPS Scheduler Tiered Scheduler ‘Fat’ cores: per-thread, per-core ‘Micro’ cores: shared hw scheduler Top level: tier N
14 What We’ve Built (Apps) Direct3D Pipeline (with Ray-tracing Extension) Ray-tracing Graph IA 1 VS 1 RO Rast Trace IA N VS N PS Frame Buffer Vertex Buffers Sample Queue Set Ray Queue Primitive Queue Input Vertex Queue 1 Primitive Queue 1 Input Vertex Queue N … … OM PS2 Fragment Queue Ray Hit Queue Ray-tracing Extension Primitive Queue N Tiler Shade FB Blend Frame Buffer Sample Queue Tile Queue Ray Queue Ray Hit Queue Fragment Queue Camera Sampler Intersect = Thread Stage = Shader Stage = Fixed-func = Queue = Stage Output = Push Output
15 Initial Results Queues are small, utilization is good
16 GRAMPS Visualization
17 GRAMPS Visualization
18 GRAMPS Portability Portability really means performance. Less portable than GL/D3D –GRAMPS graph is (more) hardware sensitive More portable than bare metal –Enforces modularity –Best case, just works –Worst case, saves boiler plate
19 High-level Challenges Is GRAMPS a suitable GPU evolution? –Enable pipeline competitive with bare metal? –Enable innovation: advanced / alternative methods? Is GRAMPS a good parallel compute model? –Map well to hardware, hardware trends? –Support important apps? –Concepts influence developers?
20 What’s Next: Implementation Better scheduling –Less bursty, better slot filling –Dynamic priorities –Handle graphs with loops better More detailed costs –Bill for scheduling decisions –Bill for (internal) synchronization More statistics
21 What’s Next: Programming Model Yes: Graph modification (state change) Probably: Data sharing / ref-counting Maybe: Blocking inter-stage calls (join) Maybe: Intra/inter-stage synchronization primitives
22 What’s Next: Possible Workloads REYES, hybrid graphics pipelines Image / video processing Game Physics –Collision detection or particles Physics and scientific simulation AI, finance, sort, search or database query, … Heavy dynamic data manipulation -k-D tree / octree / BVH build -lazy/adaptive/procedural tree or geometry