FLASHG 15 Oct 20071 Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian.

FLASHG 15 Oct 20071 Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian

FLASHG 15 Oct 20072 Background  Context: Broader research investigation generalizing GPU/Cell/”compute” cores and combining them with CPUs.  Fundamental Beliefs: –Real data parallel apps still have performance critical non-data parallel pieces –Existing parallel programming models are too constrained (GPUs) or too hard/vague (CPUs) –Queues are an excellent idiom to capture producer- consumer parallelism– thread and data –Fixed function execution units are not a problem, but fixed control paths are

FLASHG 15 Oct 20073 Compute Cores  CPUs designed for single threads per core  Minimal FLOPS per core  Compute cores design for lots of math per core  Many “threads” per core  Sometimes wider SIMD per thread  SIMD width * # hardware threads ops / core  And, more compute than CPU cores fit per chip  Many examples: GPU, Cell, Niagara, Larrabee

FLASHG 15 Oct 20074 Simplified Direct3D Pipeline  Application launches some drawing… 1.Vertex Assembly (Fixed, Non-Data Parallel) 2.Vertex Processing (Programmable, Data Parallel) 3.Primitive Assembly (Fixed, Non-Data Parallel) 4.Primitive Processing (Programmable, Data Parallel) 5.Fragment Assembly (Fixed, Non-Data Parallel) 6.Fragment Processing (Programmable, Data Parallel) 7.Pixel / Image Assembly (Fixed, Non-Data Parallel)  Only Data Parallel stages are programmable!

FLASHG 15 Oct 20075 Direct3D Pipeline Properties  There is a reason only data parallel stages are programmable.  ‘Shader’ stages are inherently per-element (e.g. vertex / primitive / fragment) and stateless between them.  ‘Assembly’ stages also run on many elements, but they have inter-element dependencies –State can be remembered (vertex caching) –Inputs can be used by multiple outputs (strips)  Programmable ‘Assembly’ requires heavier (more serial) threads than ‘Shaders’.

FLASHG 15 Oct 20076 Question  Can fixed-function control be decoupled from efficient graphics performance on a compute- heavy architecture?  Does not necessarily exclude fixed-function execution blocks (eg. rasterizer, texture units…)

FLASHG 15 Oct 20077 This Talk  GRAMPS: Our current model for programming compute cores.  Implementing Direct3D 10 “in software” with GRAMPS.  (Potentially) thoughts about how REYES, ray tracers map to GRAMPS.  No explicit discussion of heterogeneous cores.  No fancy scheduling algorithms (yet?)

FLASHG 15 Oct 20078 Example: Simple 3D Pipeline Input Vertices Transformed Vertices Vertex Shading Primitive Assembly Primitives Image Assembly Fragment Shading Rasterize (Assemble) Framebuffer Pixels Shaded Fragments

FLASHG 15 Oct 20079 GRAMPS  General Runtime/Architecture for Multicore Parallel Systems  Models execution graph of queues connected by threads  Graph specified by host program  Simulator for exploring compute cores –Currently conflates “hardware” and runtime –# of cores, thread contexts, SIMD width are all parameters

FLASHG 15 Oct 200710 Simple GRAMPS core Thread 0 Thread 1 Thread T-1 Thread 2 … ALU 0ALU 1ALU 2ALU 3ALU 4ALU S-1 … L1 data cache (or scratchpad)  T - threads/core  S - SIMD ALUs/core  R - registers/thread  1 thread runs in each clock  Threads issue vector instructions (think S-wide SSE) R

FLASHG 15 Oct 200711 D3D10 Setup  App defines 3 shading environments –Vertex, geometry, fragment –Attach programs and resources  App configure fixed function units –Fixed number of “modes” –Attach resources  App submits work (vertices) to pipeline  Graphics runtime executes until completion

FLASHG 15 Oct 200712 GRAMPS Setup  App defines a set of queues  App defines a set of thread environments  App attaches queues as thread inputs and outputs  App bootstraps computation by inserting data into queue  Runtime executes threads until completion

FLASHG 15 Oct 200713 GRAMPS Entities: Execution  Threads: Assemble, Shader, Fixed –Assemble: Stateful, akin to a regular thread –Fixed: Special purpose hardware wrapped to appear an Assemble thread –Shader: Stateless and data parallel

FLASHG 15 Oct 200714 GRAMPS Entities: Data  Queues for producer-consumer parallelism  Queues for aggregating coherent work  Queues support push and reserve/commit for in- place Assembly  Chunks are the units / granularity at which Queues are manipulated.

FLASHG 15 Oct 200715 GRAMPS Scheduling  GRAMPS assigns Threads to hw contexts –Based on graph, current Queue contents  Tiered scheduling model  Tier-0: Trivially puts threads onto hw threads  Tier-1: Builds schedules for Tier-0.  Tier-N: Arbitrarily clever. Doesn’t exist.

FLASHG 15 Oct 200716 System (how it works today)

FLASHG 15 Oct 200717 D3D10 on GRAMPS Index queuepostVtxShade queue preVtxShade queue vtxShade prePrimAssemble queue primShade primAssemble prePrimShade queue postPrimShade queue rastAssemble preRast queue tri setup / clip / cull tri queue 0 rasterize preFragShade queue fragShade postFragShade queue blend / ztest = shader thread idxVtxAssemble = assemble thread = fixed function in GPU tri queue 1 rasterize preFragShade queue fragShade postFragShade queue blend / ztest tri queue 2 rasterize preFragShade queue fragShade postFragShade queue blend / ztest tri queue N rasterize preFragShade queue fragShade postFragShade queue blend / ztest

FLASHG 15 Oct 200718 Internal Queues  Queues just memory + state struct (see below) –For now: Queues are finite –Queues are contiguous array of chunks  Chunks = granularity of manipulation queue { BYTE ptr[num_chunks * chunk_byte_width]; int num_chunks; int chunk_byte_width; int head; int tail; int reclaim; bool done[num_chunks]; };

FLASHG 15 Oct 200719 Ex: GRAMPS has chunks Index queuepostVtxShade queue preVtxShade queue vtxShade idxVtxAssemble index_queue chunks contain vertex indices preVtxShade_queue chunks contain 16 pre-transformed vertices postVtxShade_queue chunks contain 16 transformed vertices

FLASHG 15 Oct 200720 Ex: GRAMPS has chunks preFragshade_queue chunks contain: Interpolated inputs for 16 fragments liveness mask per fragment x,y position per quad uniform data shared across all fragments rasterize preFragShade queue fragShade

FLASHG 15 Oct 200721 Queue API  Window = view into a contiguous range of chunks for assemble threads  Symmetric for producing/consuming access qwin { BYTE* ptr; int num; int id; };  Shader threads just have “push”

FLASHG 15 Oct 200722 Queue manipulation void produce() “push” qwin* reserve(qwin* q, int num_chunks) qwin* commit(qwin* q, int num_chunks) (Assemble shader only) (All threads)

FLASHG 15 Oct 200723 Internal threads  Defines a “type” of thread ThreadEnv { type = {shader, assemble, fixed-func} Program Code uniforms/constant data sampler/texture/resource id bindings List of input queues List of output queues };

FLASHG 15 Oct 200724 Shader threads  Shading language unchanged (HLSL) –Still write shaders in terms of single elements –Compilation produces code to operate on chunks void hlsl_likefn(const element* inputEl, element* outputEl, const sampler foo, const tex3d tex)

FLASHG 15 Oct 200725 Internal shader threads  Shader thread code processes chunks  Input: –GRAMPS pre-reserved chunks from in/out queues –Environment info (uniforms, consts, etc) void shaderFn(const chunk* in_chunks[], chunk* out_chunks[], const env* env)  Dispatched shader threads run to completion  Completion implies: inChunks are released outChunks are commited

FLASHG 15 Oct 200726 Assemble threads  Assemble threads build chunks  Access queue data via windows  Commit/reserve/consume may block thread void assembleFn(qwin* in_win[], qwin* out_win[], const env* env)

FLASHG 15 Oct 200727 Ex: primitive assembly  Input chunks = 16 verts  Output chunks = 16 prims  Prim structure depends on type of prim –Points lines, triangles, triangle /w adj, etc  Creating prims from verts dependent on topology –Strips or lists –Triangle strip: data for output chunk comes from multiple input chunks prePrimAssemble queue primAssemble prePrimShade queue

FLASHG 15 Oct 200728 Ex: frag assembly (rast) For (each input triangle) { Add triangle uniform data to chunk while (chunk not full && triangle not done) { rasterize next tile of quads… for (each nonempty quad) { add 4 fragments to chunk add quad description per chunk } if (chunk is full) { qwin_out = commit(qwin_out, 1); grow window with reserve() if necessary… } Building chunks: 1. Compact valid quads 2. Data at various frequencies

FLASHG 15 Oct 200729 Execution: Tier 1 T 0 T 1 T T-1 T 2 L1 $ Thread_Done() (implicit commit) Produce() Reserve() Commit() queue shader threadEnv shader threadEnv shader threadEnv shader threadEnv assemble threadEnv assemble threadEnv assemble threadEnv assemble threadEnv ShaderThr dispatch AssembleThr resume Tier 1 to Tier 0 FIFO

FLASHG 15 Oct 200730 Execution: Tier 0 Tier 1 to Tier 0 FIFO Thread 0 Thread 1 Thread T-1 Thread 2 … ALU 0ALU 1ALU 2ALU 3ALU 4ALU S-1 … L1 data cache (or scratchpad) R Tier 0 Scheduler  Each cycle: round robin runnable threads  Thread stalls: place on wait list  When thread completes:  Pull next thread from fifo, assign to empty thread slot  Send completion message to tier 0

FLASHG 15 Oct 200731 Validation  “Fat enough” cores for assemble threads can deliver sufficient FLOPS  Assemble threads can keep compute cores + fixed-function units busy  Can give up domain-specific heuristics in the scheduling

FLASHG 15 Oct 20071 Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian.

Similar presentations

Presentation on theme: "FLASHG 15 Oct 20071 Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FLASHG 15 Oct 20071 Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian.

Similar presentations

Presentation on theme: "FLASHG 15 Oct 20071 Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian."— Presentation transcript:

Similar presentations

About project

Feedback