Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microtriangle thesis mid-status September 22th, 2010.

Similar presentations


Presentation on theme: "Microtriangle thesis mid-status September 22th, 2010."— Presentation transcript:

1 Microtriangle thesis mid-status September 22th, 2010

2 My Thesis So far Summer work New ideas

3 So far Indentified important rasterization bottleneck in GPUs when processing small triangles(<10 pixels). Found a scalable and “almost” area-free new GPU design/pipeline to efficiently rasterize uTriangles: – Uses the shader cores: increasing GPU resource. – Independent fragment-in-triangle tests (no setup) – Crack-free by using Fixed Point arithmetic. – BB optimization reduces ≈50% shader rasterization work per uTriangle. – Switch to the traditional rasterization pipeline for macro “large” triangles.

4 Summer work Finished the implementation of the pipeline for mixed flow of macro/micro triangles: – Now, triangles of different sizes inside a DrawPrim*() execute in the corresponding pipeline, concurrently. – Added support in ACD to transparently manage and sync separate pixel shader programs, for macro and micro triangle fragments, in the ATTILA shader instruction memory. – Fixed some bugs found in the previous implemented data pipeline (a year ago).

5 Summer work Implementation is functional but performance is still poor (about 0.3x of traditional rasterizer alone) – Not related with the rasterization cost (low shader usage) – Possible pipeline bottlenecks and/or need to adjust queue sizes. New uTriangle workload to test the implementation: – Made OpenGL app which renders a highly tessellated model of the Stanford Bunny (69K triangles) approaching the camera. – Each frame projects the Bunny with increased triangle size (closer to the camera).

6 A new testing workload Frame 1 Bunny fills ≈ 0.4% of the viewport ≈ 8.34 tris/pix 1024x1024 RTT – 69K Tri model Frame 60 Bunny fills ≈ 2% of the viewport ≈ 3.3 tris/pix Frame 100 Bunny fills ≈ 15% of the viewport ≈ 0.22 tris/pix Frame 150 Bunny fills ≈ 22% of the viewport ≈ 0.15 tris/pix The Bunny uTriangle mesh: Will use even more extremely tessellated models: Happy buddha and Chinesse Dragon: 1 Million Triangles

7 New research ideas Triangles are now rasterized faster: – By a more efficient computation of the rasterization job in the fastest/widest GPU resource (Shaders). But…the GPU pipeline still remain some inefficiencies due to the data structures designed to leverage large triangles. – Overshading due to a quad (2x2 pixels) generated for each uTriangle. Second issue: memory access (framebuffer, textures) for uTriangle fragments are no longer guaranteed optimal for caches (Hilbert pattern) since now it depends on surface tessellation.

8 Shade uTris using stamps is inefficient When uTris fill just one pixel -> shader units used at 1/4 after rasterization. The more uTris per pixel, the lower utilization. Apply vector compaction after rast and before shading. – Form vectors of multi-triangle quads. – Fatahalian, K., Boulos, S., Hegarty, J., Akeley, K., Mark, W. R., Moreton, H., and Hanrahan, P. 2010. Reducing shading on GPUs using quad- fragment merging. In ACM SIGGRAPH 2010 Papers.Reducing shading on GPUs using quad- fragment merging – Derivatives computation must be decoupled. Sparse shader vectors

9 Quad overshading map

10 Vector Compaction VectorThread 3 VectorThread 8 Vector Thread 7 Ready-to-Execute Shader Inputs Vector Thread 7 Compact queue Vector Thread 0 Vector Thread 2 Thread fence Vector Thread 7 Vector Thread 10 Vector Thread 12 Vector Thread 11 Vector Thread 12 Wait rasterization completion of several sparse vector threads before interp/shading (use a thread fence instruction). Merge valid threads in new dense vectors that resume the execution (become ready for schedule). Shader vector slots can be freed for new inputs as result of compacting. Freed Slots Rasterization FULL-WIDTH interpolation/ fragment shading 1 1 2 2 3 3 1 1 2 2 3 3

11 Vector Compaction Vector compaction implies changing the thread´s initially asigned SIMD lane. Direct implication on the RF organization: – Heavily-multiported single RF: Long latency – Banked RF: Fewer ports Need to migrate register values with threads as they are compacted. Decode R F R F R F A L U A L U A L U Writeback I-Fetch SIMD Pipeline VectorThread 3 Vector Thread 7 Ready-to-Execute Shader Inputs Compact queue Sparse vector Compacted vector Vector Thread 10

12 Vector Compaction # banks = SIMD length # banks = 4 (fragments in a quad). RF0 RF1 Th0Th1 RF2 RF15 Th2Th15 Th16Th17Th18Th31 … RF0 RF1 Th0Th1 RF2 Th2 Th4Th5Th6 RF3 Th3 Th7 Th8Th9Th10Th11 Th12Th13Th14Th15 Th16Th17Th18Th19 Th20Th21Th22 … Compacted threads doesn´t need to migrate register values as long as they stay in the same quad lane. Top-left Pixel Top-Right Pixel Bottom-Left Pixel Bottom-Right Pixel

13 Vector Compaction 4 banks : ideal case for uTriangle meshes – no migration needed. > 1 pixel can still be merged in a smart way to avoid as much as possible register migration. Stamp Generation RasterizationCompaction

14 Texture memory access Macrotriangle fragments are rasterized in special order (Hilbert, Morton) which favors texture cache locality. Texture cache line footprint Regular macro triangle

15 Memory texture access With microtriangles, texture accesses depend on surface tesselation order -> cache locality is lost. Tessellated Patch Texture cache line footprint

16 Texture memory access Proposed idea: – For uTriangles, vector compaction can get texture locality back by preferably grouping threads that map the same cache line, together in the same compacted vector. – Probably requires a much longer compact queue.


Download ppt "Microtriangle thesis mid-status September 22th, 2010."

Similar presentations


Ads by Google