Microtriangle thesis mid-status September 22th, 2010.

Slides:



Advertisements
Similar presentations
CS123 | INTRODUCTION TO COMPUTER GRAPHICS Andries van Dam © 1/16 Deferred Lighting Deferred Lighting – 11/18/2014.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis Imagery taken from Nvidia Dawn Demo Slide on GPUs, CUDA and Programming Models by Apollo Ellis.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
High-Performance Software
GCAFE 28 Feb Real-time REYES Jeremy Sugerman.
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis.
Reducing Shading on GPUs Using Quad-Fragment Merging JAEHYUN CHO
I3D Fast Non-Linear Projections using Graphics Hardware Jean-Dominique Gascuel, Nicolas Holzschuch, Gabriel Fournier, Bernard Péroche I3D 2008.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Real-Time Reyes-Style Adaptive Surface Subdivision
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Status – Week 231 Victor Moya. Summary Primitive Assembly Primitive Assembly Clipping triangle rejection. Clipping triangle rejection. Rasterization.
GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.
Status – Week 277 Victor Moya.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Enhancing GPU for Scientific Computing Some thoughts.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Sebastian Enrique Columbia University Real-Time Rendering Using CUReT BRDF Materials with Zernike Polynomials CS Topics.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
CS 480/680 Intro Dr. Frederick C Harris, Jr. Fall 2014.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Based on paper by: Rahul Khardekar, Sara McMains Mechanical Engineering University of California, Berkeley ASME 2006 International Design Engineering Technical.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GLSL Review Monday, Nov OpenGL pipeline Command Stream Vertex Processing Geometry processing Rasterization Fragment processing Fragment Ops/Blending.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CMSC 611: Advanced Computer Architecture
Week 2 - Friday CS361.
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Deferred Lighting.
Introduction to OpenGL
Real-Time Ray Tracing Stefan Popov.
From Turing Machine to Global Illumination
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Graphics Hardware CMSC 491/691.
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
Mattan Erez The University of Texas at Austin
UMBC Graphics for Games
Introduction to OpenGL
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Microtriangle thesis mid-status September 22th, 2010

My Thesis So far Summer work New ideas

So far Indentified important rasterization bottleneck in GPUs when processing small triangles(<10 pixels). Found a scalable and “almost” area-free new GPU design/pipeline to efficiently rasterize uTriangles: – Uses the shader cores: increasing GPU resource. – Independent fragment-in-triangle tests (no setup) – Crack-free by using Fixed Point arithmetic. – BB optimization reduces ≈50% shader rasterization work per uTriangle. – Switch to the traditional rasterization pipeline for macro “large” triangles.

Summer work Finished the implementation of the pipeline for mixed flow of macro/micro triangles: – Now, triangles of different sizes inside a DrawPrim*() execute in the corresponding pipeline, concurrently. – Added support in ACD to transparently manage and sync separate pixel shader programs, for macro and micro triangle fragments, in the ATTILA shader instruction memory. – Fixed some bugs found in the previous implemented data pipeline (a year ago).

Summer work Implementation is functional but performance is still poor (about 0.3x of traditional rasterizer alone) – Not related with the rasterization cost (low shader usage) – Possible pipeline bottlenecks and/or need to adjust queue sizes. New uTriangle workload to test the implementation: – Made OpenGL app which renders a highly tessellated model of the Stanford Bunny (69K triangles) approaching the camera. – Each frame projects the Bunny with increased triangle size (closer to the camera).

A new testing workload Frame 1 Bunny fills ≈ 0.4% of the viewport ≈ 8.34 tris/pix 1024x1024 RTT – 69K Tri model Frame 60 Bunny fills ≈ 2% of the viewport ≈ 3.3 tris/pix Frame 100 Bunny fills ≈ 15% of the viewport ≈ 0.22 tris/pix Frame 150 Bunny fills ≈ 22% of the viewport ≈ 0.15 tris/pix The Bunny uTriangle mesh: Will use even more extremely tessellated models: Happy buddha and Chinesse Dragon: 1 Million Triangles

New research ideas Triangles are now rasterized faster: – By a more efficient computation of the rasterization job in the fastest/widest GPU resource (Shaders). But…the GPU pipeline still remain some inefficiencies due to the data structures designed to leverage large triangles. – Overshading due to a quad (2x2 pixels) generated for each uTriangle. Second issue: memory access (framebuffer, textures) for uTriangle fragments are no longer guaranteed optimal for caches (Hilbert pattern) since now it depends on surface tessellation.

Shade uTris using stamps is inefficient When uTris fill just one pixel -> shader units used at 1/4 after rasterization. The more uTris per pixel, the lower utilization. Apply vector compaction after rast and before shading. – Form vectors of multi-triangle quads. – Fatahalian, K., Boulos, S., Hegarty, J., Akeley, K., Mark, W. R., Moreton, H., and Hanrahan, P Reducing shading on GPUs using quad- fragment merging. In ACM SIGGRAPH 2010 Papers.Reducing shading on GPUs using quad- fragment merging – Derivatives computation must be decoupled. Sparse shader vectors

Quad overshading map

Vector Compaction VectorThread 3 VectorThread 8 Vector Thread 7 Ready-to-Execute Shader Inputs Vector Thread 7 Compact queue Vector Thread 0 Vector Thread 2 Thread fence Vector Thread 7 Vector Thread 10 Vector Thread 12 Vector Thread 11 Vector Thread 12 Wait rasterization completion of several sparse vector threads before interp/shading (use a thread fence instruction). Merge valid threads in new dense vectors that resume the execution (become ready for schedule). Shader vector slots can be freed for new inputs as result of compacting. Freed Slots Rasterization FULL-WIDTH interpolation/ fragment shading

Vector Compaction Vector compaction implies changing the thread´s initially asigned SIMD lane. Direct implication on the RF organization: – Heavily-multiported single RF: Long latency – Banked RF: Fewer ports Need to migrate register values with threads as they are compacted. Decode R F R F R F A L U A L U A L U Writeback I-Fetch SIMD Pipeline VectorThread 3 Vector Thread 7 Ready-to-Execute Shader Inputs Compact queue Sparse vector Compacted vector Vector Thread 10

Vector Compaction # banks = SIMD length # banks = 4 (fragments in a quad). RF0 RF1 Th0Th1 RF2 RF15 Th2Th15 Th16Th17Th18Th31 … RF0 RF1 Th0Th1 RF2 Th2 Th4Th5Th6 RF3 Th3 Th7 Th8Th9Th10Th11 Th12Th13Th14Th15 Th16Th17Th18Th19 Th20Th21Th22 … Compacted threads doesn´t need to migrate register values as long as they stay in the same quad lane. Top-left Pixel Top-Right Pixel Bottom-Left Pixel Bottom-Right Pixel

Vector Compaction 4 banks : ideal case for uTriangle meshes – no migration needed. > 1 pixel can still be merged in a smart way to avoid as much as possible register migration. Stamp Generation RasterizationCompaction

Texture memory access Macrotriangle fragments are rasterized in special order (Hilbert, Morton) which favors texture cache locality. Texture cache line footprint Regular macro triangle

Memory texture access With microtriangles, texture accesses depend on surface tesselation order -> cache locality is lost. Tessellated Patch Texture cache line footprint

Texture memory access Proposed idea: – For uTriangles, vector compaction can get texture locality back by preferably grouping threads that map the same cache line, together in the same compacted vector. – Probably requires a much longer compact queue.