OpenGL Performance John Spitzer. 2 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Performance of Cache Memory

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Kernel memory allocation

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Segmentation and Paging Considerations

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.

Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Chapter 1 and 2 Computer System and Operating System Overview

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.

3D Rendering & Algorithms__ Sean Reichel & Chester Gregg a.k.a. “The boring stuff happening behind the video games you really want to play right now.”

University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.

© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.

Computer Graphics Graphics Hardware

GPU Computation Strategies & Tricks Ian Buck Stanford University.

Kenneth Hurley Sr. Software Engineer

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Cg Programming Mapping Computational Concepts to GPUs.

OpenGL ES Performance (and Quality) on the GoForce5500 Handheld GPU Lars M. Bishop, NVIDIA Developer Technologies.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.

The programmable pipeline Lecture 3.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

OpenGL Buffer Transfers Patrick Cozzi University of Pennsylvania CIS Spring 2012.

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

1 3D API OPENGL ES v1.0 Owned by Silicon Graphics (SGL) Control was then transferred to Khronos Group Introduction.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

CS6502 Operating Systems - Dr. J. Garrido Memory Management – Part 1 Class Will Start Momentarily… Lecture 8b CS6502 Operating Systems Dr. Jose M. Garrido.

Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

VAR/Fence: Using NV_vertex_array_range and NV_fence Cass Everitt.

SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.

Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Best Practices for Multi-threading Eric Young Developer Technology.

Computer Graphics Graphics Hardware

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Memory COMPUTER ARCHITECTURE

CSC 4250 Computer Architectures

Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,

Chapter 9 – Real Memory Organization and Management

Chapter 6 GPU, Shaders, and Shading Languages

Graphics Hardware CMSC 491/691.

Optimizing Malloc and Free

Computer Architecture

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Computer Graphics Graphics Hardware

UMBC Graphics for Games

RADEON™ 9700 Architecture and 3D Performance

Update : about 8~16% are writes

COMP755 Advanced Operating Systems

Balancing the Graphics Pipeline for Optimal Performance

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

OpenGL Performance John Spitzer

2 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering

3 Possible Performance Bottlenecks They mirror the OpenGL pipeline High order surface evaluation Data transfer from application to GPU Vertex operations (fixed function or vertex program) Texture mapping Other per-fragment operations

4 High Order Surface Evaluation Each patch requires an amount of driver setup CPU dependent for coarse tessellation Will improve as driver optimizations are completed Integer tessellation factors faster than fractional Driver may cache display-listed patches, so that setup is amortized over many frames for static patches

5

6 Transferring Geometric Data from App to GPU So many ways to do it Immediate mode Display lists Vertex arrays Compiled vertex arrays Vertex array range extension

7 Immediate Mode The old stand-by Has the most flexibility Makes the most calls Has the highest CPU overhead Varies in performance depending on CPU speed Not the most efficient

8 Display Lists Fast, but limited Immutable Requires driver to allocate memory to hold data Allows large amount of driver optimization Can sometimes be cached in fast memory Typically very fast

9 Vertex Arrays Best of both worlds Data can be changed as often as you like Data can be interleaved or in separate arrays Can use straight lists or indices Reduces number of API calls vs. immediate mode Little room for driver optimization, since data referenced by pointers can change at any time

10 Compiled Vertex Arrays Solve part of the problem Allow user to lock portions of vertex array In turn, gives driver more optimization opportunities: Shared vertices can be detected, allowing driver to eliminate superfluous operations Locked data can be copied to higher bandwidth memory for more efficient transfer to the GPU Still requires transferring data twice

11 Vertex Array Range (VAR) Extension VAR allows the GPU to pull vertex data directly Eliminates double copy Analogous to Direct3D vertex buffers VAR memory must be specially allocated from AGP or video memory Facilitates post-T&L vertex caching Introduces synchronization issues between CPU and GPU

12 VAR Memory Allocation and Issues Video memory has faster GPU access, but precious Same memory used for framebuffer and textures Could cause texture thrashing On systems without Fast Writes, writes will be slow AGP memory typically easier to use Write sequentially to take advantage of CPU’s write combiners (and maximize memory bandwidth) Usually as fast as Video memory Best to use either one large chunk of AGP or one large chunk of Video; switching is expensive NEVER read from either type of memory

13 Facilitates Post-T&L Vertex Caching All NVIDIA GPUs have post-T&L caches Replacement policy is FIFO-based 16 elements on GeForce 256 and GeForce2 GPUs 24 elements on GeForce3 Effective sizes are smaller due to pipelining Only activated when used in conjunction with: Compiled vertex arrays or VAR glDrawElements or glDrawRangeElements Vertex cache hits more important than primitive type, though strips are still faster than lists Do not create degenerate triangle strips

14 VAR Synchronization Issues CPU is writing to and the GPU reading from the same memory simultaneously Must ensure the GPU is finished reading before the memory is overwritten by the CPU Current synchronization methods are insufficient: glFinish and glFlushVertexArrayRangeNV are too heavy-handed; block CPU until GPU is all done glFlush is merely guaranteed to “complete finite time” – could be next week sometime Need a mechanism to perform a “partial finish” Introduce token into command stream Force the GPU to finish up to that token

15 Introducing: NV_fence NV_fence provides fine-grained synchronization A “fence” is a token that can be placed into the OpenGL command stream by glSetFenceNV Each fence has a condition that can be tested (only condition available is GL_ALL_COMPLETED_NV ) glFinishFenceNV forces the app (i.e. CPU) to wait until a specific fence’s condition is satisfied glTestFenceNV queries whether a particular fence has been completed yet or not, without blocking

16 How to Use VAR/fence Together Combination of VAR and fences is very powerful VAR gives the best possible T&L performance With fences, apps can achieve very efficient pipelining of CPU and GPU In memory-limited situations, VAR memory must be reused Fences can be placed in the command stream to determine when memory can be reclaimed Different strategies can be used for dynamic/static data arrays App must manage memory itself

17 VAR/fence Performance Hints Effective memory management and synchronization is key Delineate static/dynamic arrays Avoid redundant copies Do this on a per-array bases, not per-object Be clever in your use of memory Use fences to keep CPU and GPU working in parallel

18 Other Vertex Array Range Issues Separate vs. Interleaved Arrays If an object’s arrays are a mix of static and dynamic, then interleaved writing will be inefficient Interleaved arrays may be slightly faster though Only use VAR when HW T&L is used Not with vertex programs on NV1X Not with feedback/selection VAR array can be of arbitrary size, but indices must be less than 1M (1,048,576) on GeForce3 Data arrays must be 8-byte aligned

19 Vertex Operations Vertex Program or Fixed Function Pipeline? For operations that can be performed with the fixed function pipeline, that should be used, because: it will also be hardware accelerated on NV1X it will be faster on GeForce3, if lighting-heavy All other operations should be performed with a vertex program

20 Vertex Program Performance Vertices per second = ~ clock_rate/program_length

21 Fixed Function Pipeline Performance GeForce3 fixed function pipeline performance will scale in relation to NV1X in most cases Two sided lighting is an exception – it is entirely hardware accelerated in GeForce3 Performance penalty for two sided lighting vs. one sided is dependent upon number of lights, texgen, and other variables, but should be almost free for “CAD lighting” (1 infinite light/viewer)

22 Texturing Hard to optimize Speed vs. quality – make it a user settable option Pick the right filtering modes Pick the right texture formats Pick the right texture shader functions Pick the right texture blend functions Load the textures efficiently Manage your textures effectively Use multitexture to reduce number of passes Sort by texture to reduce state changes!

23 Texture Filtering and Formats Still use bilinear mipmapped filtering for best speed when multitexturing LINEAR_MIPMAP_LINEAR is as fast as bilinear when only a single 2D texture is being used on GeForce3 Use anisotropic filtering for best image fidelity Will not be free, but penalty only occurs when necessary, not on all pixels Higher levels will be slower Use texture compression, if at all practical Use single/dual component formats when possible

24 Texture Shader Programs

25 Texture Blending Use register combiners for most flexibility Final combiner is always “free” (i.e. full speed) General combiners can slow things down: 1 or 2 can be enabled for free 3 or 4 can run at one half maximum speed 5 or 6 can run at one third maximum speed 7 or 8 can run at one fourth maximum speed Maximum speed may not be possible due to choice and number of texture shaders (or other factors) As such, number of enabled general combiners will likely not be the bottleneck in most cases

26 Texture Downloads and Copies Texture downloads Always use glTexSubImage2D rather than glTexImage2D, if possible Match external/internal formats Use texture compression, if it makes sense Use SGIS_generate_mipmap instead of gluBuild2DMipmaps Texture copies Match internal format to framebuffer (e.g. 32-bit desktop to GL_RGBA8) Will likely have a facility to texture by reference soon, obviating the copy

27 Other Fragment Operations Render coarsely from front-to-back Minimizes memory bandwidth, which is often the performance bottleneck Particularly important on GeForce3, which has special early Z-culling hardware Avoid blending Most modes drop fill rates Try to collapse multiple passes with multitexture Avoid front-and-back rendering like the plague! Instead, render to back, then to front

28 Questions, comments, feedback John Spitzer,