Kenneth Hurley Sr. Software Engineer

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

Multi-monitor Game Development Thomas Fortier AMD Graphics Developer Relations
Maximizing Multi-GPU Performance
1 Optimizing compilers Managing Cache Bercovici Sivan.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 12 Reduce Miss Penalty and Hit Time
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
 The success of GL lead to OpenGL (1992), a platform-independent API that was  Easy to use  Close enough to the hardware to get excellent performance.
Tools for Investigating Graphics System Performance
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 14 Chapter 14: Server Monitoring and Optimization.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Automatic Generation of Parallel OpenGL Programs Robert Hero CMPS 203 December 2, 2004.
Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,
Fluid Simulation using CUDA Thomas Wambold CS680: GPU Program Optimization August 31, 2011.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Computer Graphics Graphics Hardware
® GDC’99 Performance Tuning with Intel ® Graphics Tools Larry Wickstrom Sr. Software Engineer Judith Stanley Application Engineer Intel Corporation March.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Use/User:LabServerField Engineer Electrical Engineer Software Engineer Mechanical Engineer Requirements: Small form factor.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
OpenGL ES Performance (and Quality) on the GoForce5500 Handheld GPU Lars M. Bishop, NVIDIA Developer Technologies.
OpenGL Performance John Spitzer. 2 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering
NVTune Kenneth Hurley. NVIDIA CONFIDENTIAL NVTune Overview What issues are we trying to solve? Games and applications need to have high frame rates Answer.
NVIDIA FX Composer 2 Shader Authoring for Everyone Philippe Rollin Aravind Kalaiah.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Introduction to Computer Graphics with WebGL Ed Angel Professor Emeritus of Computer Science Founding Director, Arts, Research, Technology and Science.
Join us on Twitter: #AU2013 Building Well-Performing Autodesk® AutoCAD® Applications Albert Szilvasy Software Architect.
VAR/Fence: Using NV_vertex_array_range and NV_fence Cass Everitt.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Boris Jabes Program Manager Visual C++ Microsoft Corporation.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
FILE I/O: Low-level 1. The Big Picture 2 Low-Level, cont. Some files are mixed format that are not readable by high- level functions such as xlsread()

Best Practices for Multi-threading Eric Young Developer Technology.
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
CSC 4250 Computer Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 5: GPU Compute Architecture
GRAPHICS PROCESSING UNIT
Chapter 8: Main Memory.
Introduction to Computer Graphics with WebGL
CSCI1600: Embedded and Real Time Software
Lecture 5: GPU Compute Architecture for the last time
Graphics Processing Unit
Computer Graphics Graphics Hardware
Software Transactional Memory Should Not be Obstruction-Free
UMBC Graphics for Games
UE4 Vulkan Updates & Tips
RADEON™ 9700 Architecture and 3D Performance
GPU accelerated application tracing
Synonyms v.p. x, process A v.p # index Map to same physical page
CSCI1600: Embedded and Real Time Software
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Kenneth Hurley Sr. Software Engineer

NVIDIA Corporation What are the problems we are seeing when 3D engines are written? Misuse of Vertex Buffers Concurrency Limitations Frame Rate Limiters Non-Optimized surface usage Cache misses Data Ordering

NVIDIA Corporation Misuse of Vertex Buffers Bad Things can happen unless you know the “right” way to use a vertex Buffer Dynamic vertex buffer vs. static vertex buffers When creating the vertex buffer, use D3DVBCABS_WRITEONLY Use D3DLOCK_DISCARDCONTENTS Use D3DLOCK_NOOVERWRITE Vertex buffer ordering Use ordered vertex buffers because of cache coherency

NVIDIA Corporation Using Vertex Buffers Correctly

NVIDIA Corporation Example vertex buffer flow CreateVB(WRITEONLY, ) A: I = 0 B: Space in VB for M vertices? Yes: Lock(NOOVERWRITE) No: GOTO C Fill in M vertices at index I Unlock(); DIPVB(I); I += M; GOTO B; C: Lock(DISCARDCONTENTS) GOTO A

NVIDIA Corporation Concurrency Why do I need it? Concurrency helps parallelism between the CPU and the GPU. OK, How do I achieve it? Use NVPAT to see if “Spin Lock” is happening. “Spin Locks” are when the driver has to stall waiting for the hardware to finish with an object These objects can be vertex buffers or texture surfaces

NVIDIA Corporation Concurrency (cont.) Use the vertex buffer and texture surface flags so the driver can give you another buffer while the hardware is using the other one.

NVIDIA Corporation Frame Rate Limiters Can cause concurrency issues Better ways to achieve constant frame rates Makes effective triangle rate much lower, because driver has to do some work with vertex data.

NVIDIA Corporation Frame Rate Limiter Problem Serialization of code loop Rescheduled for concurrency

NVIDIA Corporation Non Optimized Surface Usage Locking a texture before the GPU is finished with it causes concurrency problems by stalling the CPU inside the driver. Typical examples include locking the backbuffer to do 2D operations on it The best solution for this is to use 2 screen aligned triangles (quad) instead and put them directly in the 3D pipeline

NVIDIA Corporation Cache Misses Big slowdowns can occur here CPU cache misses can occur because of ordering of vertex data. Check these carefully with VTune. GPU has a vertex cache also. Geforce has a 16 entry cache, but optimal cache use is 10, because 6 triangles can be “in flight” at any given time. GPU vertex cache statistics will be added to NVPAT.

NVIDIA Corporation Vertex Ordering Best performance is to also order vertex data and vertex indices in sequential order. This helps both the CPU and the GPU Out of order vertices makes the CPU hit the cache more often It does the same thing to the GPU

NVIDIA Corporation How do we solve these problems? VTune GPT NVPAT

NVIDIA Corporation VTune 4.5 Will help your application optimize for CPU Works well in conjunction with NVPAT I personally use the Time-Based Sampling Wizard VTune is excellent for application specific analysis It doesn’t show where in the driver time is spent, unless you have symbols for the driver. You almost certainly don’t have driver symbols.

NVIDIA Corporation VTune 4.5 Flare Application

NVIDIA Corporation GPT 3.5 Excellent tool to help you achieve maximum performance. Works on both D3D and OpenGL Helps with application  API slowdowns Works well in conjunction with VTune and NVPAT. GPT is excellent for application to Direct3D/OpenGL analysis. It still can’t tell you what is occurring inside the driver that may be slowing your application down

NVIDIA Corporation GPT 3.5 (cont) View of alien world in Half-Life* Quad view for visual analysis modes

NVIDIA Corporation NVPAT 1.07 Analyze interaction with driver Works on NVIDIA hardware only Windows 98/Windows 2000 capable Hotkey capable Online help via F1 function key Logging Frame Rate Display Natural Extension to VTune and GPT

NVIDIA Corporation NVPAT 1.07 Demo – Flare VS NewFlare NVPAT Available free at vStaticPages.nsf/pages/StatsDriver vStaticPages.nsf/pages/StatsDriver You must be a registered NVIDIA developer

NVIDIA Corporation VTune DLL SDK Soon, all these performance tools should be integrated into VTune using the DLL SDK NVPAT will be integrated into the VTune DLL SDK VTune DLL SDK is available from Intel and gives you the ability to integrate performance tools into VTune. Common User Interface/API means less to learn for developers

NVIDIA Corporation Action Items Profile often and early in the process Use the tools available to you Some are free, the rest are reasonable Architect engine with concurrency in mind Ask for enhancements from your tool vendor

NVIDIA Corporation Questions? Comments/Suggestions? Enhancement requests for NVPAT can be sent to