Microtriangle thesis mid-status September 22th, 2010.

Microtriangle thesis mid-status September 22th, 2010

My Thesis So far Summer work New ideas

So far Indentified important rasterization bottleneck in GPUs when processing small triangles(<10 pixels). Found a scalable and “almost” area-free new GPU design/pipeline to efficiently rasterize uTriangles: – Uses the shader cores: increasing GPU resource. – Independent fragment-in-triangle tests (no setup) – Crack-free by using Fixed Point arithmetic. – BB optimization reduces ≈50% shader rasterization work per uTriangle. – Switch to the traditional rasterization pipeline for macro “large” triangles.

Summer work Finished the implementation of the pipeline for mixed flow of macro/micro triangles: – Now, triangles of different sizes inside a DrawPrim*() execute in the corresponding pipeline, concurrently. – Added support in ACD to transparently manage and sync separate pixel shader programs, for macro and micro triangle fragments, in the ATTILA shader instruction memory. – Fixed some bugs found in the previous implemented data pipeline (a year ago).

Summer work Implementation is functional but performance is still poor (about 0.3x of traditional rasterizer alone) – Not related with the rasterization cost (low shader usage) – Possible pipeline bottlenecks and/or need to adjust queue sizes. New uTriangle workload to test the implementation: – Made OpenGL app which renders a highly tessellated model of the Stanford Bunny (69K triangles) approaching the camera. – Each frame projects the Bunny with increased triangle size (closer to the camera).

A new testing workload Frame 1 Bunny fills ≈ 0.4% of the viewport ≈ 8.34 tris/pix 1024x1024 RTT – 69K Tri model Frame 60 Bunny fills ≈ 2% of the viewport ≈ 3.3 tris/pix Frame 100 Bunny fills ≈ 15% of the viewport ≈ 0.22 tris/pix Frame 150 Bunny fills ≈ 22% of the viewport ≈ 0.15 tris/pix The Bunny uTriangle mesh: Will use even more extremely tessellated models: Happy buddha and Chinesse Dragon: 1 Million Triangles

New research ideas Triangles are now rasterized faster: – By a more efficient computation of the rasterization job in the fastest/widest GPU resource (Shaders). But…the GPU pipeline still remain some inefficiencies due to the data structures designed to leverage large triangles. – Overshading due to a quad (2x2 pixels) generated for each uTriangle. Second issue: memory access (framebuffer, textures) for uTriangle fragments are no longer guaranteed optimal for caches (Hilbert pattern) since now it depends on surface tessellation.

Shade uTris using stamps is inefficient When uTris fill just one pixel -> shader units used at 1/4 after rasterization. The more uTris per pixel, the lower utilization. Apply vector compaction after rast and before shading. – Form vectors of multi-triangle quads. – Fatahalian, K., Boulos, S., Hegarty, J., Akeley, K., Mark, W. R., Moreton, H., and Hanrahan, P. 2010. Reducing shading on GPUs using quad- fragment merging. In ACM SIGGRAPH 2010 Papers.Reducing shading on GPUs using quad- fragment merging – Derivatives computation must be decoupled. Sparse shader vectors

Quad overshading map

Vector Compaction VectorThread 3 VectorThread 8 Vector Thread 7 Ready-to-Execute Shader Inputs Vector Thread 7 Compact queue Vector Thread 0 Vector Thread 2 Thread fence Vector Thread 7 Vector Thread 10 Vector Thread 12 Vector Thread 11 Vector Thread 12 Wait rasterization completion of several sparse vector threads before interp/shading (use a thread fence instruction). Merge valid threads in new dense vectors that resume the execution (become ready for schedule). Shader vector slots can be freed for new inputs as result of compacting. Freed Slots Rasterization FULL-WIDTH interpolation/ fragment shading 1 1 2 2 3 3 1 1 2 2 3 3

Vector Compaction Vector compaction implies changing the thread´s initially asigned SIMD lane. Direct implication on the RF organization: – Heavily-multiported single RF: Long latency – Banked RF: Fewer ports Need to migrate register values with threads as they are compacted. Decode R F R F R F A L U A L U A L U Writeback I-Fetch SIMD Pipeline VectorThread 3 Vector Thread 7 Ready-to-Execute Shader Inputs Compact queue Sparse vector Compacted vector Vector Thread 10

Vector Compaction # banks = SIMD length # banks = 4 (fragments in a quad). RF0 RF1 Th0Th1 RF2 RF15 Th2Th15 Th16Th17Th18Th31 … RF0 RF1 Th0Th1 RF2 Th2 Th4Th5Th6 RF3 Th3 Th7 Th8Th9Th10Th11 Th12Th13Th14Th15 Th16Th17Th18Th19 Th20Th21Th22 … Compacted threads doesn´t need to migrate register values as long as they stay in the same quad lane. Top-left Pixel Top-Right Pixel Bottom-Left Pixel Bottom-Right Pixel

Vector Compaction 4 banks : ideal case for uTriangle meshes – no migration needed. > 1 pixel can still be merged in a smart way to avoid as much as possible register migration. Stamp Generation RasterizationCompaction

Texture memory access Macrotriangle fragments are rasterized in special order (Hilbert, Morton) which favors texture cache locality. Texture cache line footprint Regular macro triangle

Memory texture access With microtriangles, texture accesses depend on surface tesselation order -> cache locality is lost. Tessellated Patch Texture cache line footprint

Texture memory access Proposed idea: – For uTriangles, vector compaction can get texture locality back by preferably grouping threads that map the same cache line, together in the same compacted vector. – Probably requires a much longer compact queue.

Microtriangle thesis mid-status September 22th, 2010.

Similar presentations

Presentation on theme: "Microtriangle thesis mid-status September 22th, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microtriangle thesis mid-status September 22th, 2010.

Similar presentations

Presentation on theme: "Microtriangle thesis mid-status September 22th, 2010."— Presentation transcript:

Similar presentations

About project

Feedback