Next-Gen Asset Streaming Using Runtime Statistics David Thall Insomniac Games.

Slides:



Advertisements
Similar presentations
Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
A Pipeline for Lockless Processing of Sound Data David Thall Insomniac Games.
1 Optimization of Routing Algorithms Summer Science Research By: Kumar Chheda Research Mentor: Dr. Sean McCulloch.
Chap 5 Process Scheduling. Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a.
Game Project Tuesday Sept 18,  Game Idea  Team  Understanding available engine options  Understanding the Pipeline  Start the process Cycle.
Tools for Investigating Graphics System Performance
Scheduling in Batch Systems
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 3.5 Memory and I/O Systems. Memory Management 2 Only applies to languages with explicit memory management (C, C++) Memory problems are one of.
Tornado: Maximizing Locality and Concurrency in a SMMP OS.
Virtual Memory Chapter 8.
Chapter 3.7 Memory and I/O Systems. 2 Memory Management Only applies to languages with explicit memory management (C or C++) Memory problems are one of.
CS533 Concepts of Operating Systems Class 2 The Duality of Threads and Events.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Chapter 3.4 Programming Fundamentals. 2 Data Structures Arrays – Elements are adjacent in memory (great cache consistency) – They never grow or get reallocated.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive Mat Noguchi BUNGIE Studios
Proxy Design Pattern Source: Design Patterns – Elements of Reusable Object- Oriented Software; Gamma, et. al.
CSE 381 – Advanced Game Programming 3D Game Architecture.
by Chris Brown under Prof. Susan Rodger Duke University June 2012
CMPE 421 Parallel Computer Architecture
CACHING: tuning your golden hammer By Ian Simpson from Mashery: an Intel Company.
Overview [See Video file] Architecture Overview.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
CSSE501 Object-Oriented Development. Chapter 12: Implications of Substitution  In this chapter we will investigate some of the implications of the principle.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
Real-time Graphics for VR Chapter 23. What is it about? In this part of the course we will look at how to render images given the constrains of VR: –we.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Caching Chapter 7.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Автоматизация отрасли культуры. Industry issues -Production automatization -Quality assurance -Sales forecasting -Broad range of analytics -Wine trading.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
1 Lecture 8: Virtual Memory Operating System Fall 2006.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 9 – Real Memory Organization and Management Outline 9.1 Introduction 9.2Memory Organization.
© Copyright 3Dlabs, Page 1 - PROPRIETARY & CONFIDENTIAL Virtual Textures Texture Management in Silicon Chris Hall Director, Product Marketing 3Dlabs.
CHC ++: Coherent Hierarchical Culling Revisited Oliver Mattausch, Jiří Bittner, Michael Wimmer Institute of Computer Graphics and Algorithms Vienna University.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
The Structuring of Systems Using Upcalls David D. Clark (Presented by John McCall)
Techniques for Improving Large World and Terrain Streaming Danie Conradie Trinigy Inc.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Why Events Are A Bad Idea (for high-concurrency servers)
How will execution time grow with SIZE?
CS703 - Advanced Operating Systems
Chapter 9 – Real Memory Organization and Management
CPU Scheduling G.Anuradha
Chapter 5: CPU Scheduling
ECE Dept., University of Toronto
Introduction to Database Systems
How can we ask a bird what it can hear? THE BIRD BOX
Chapter 5: CPU Scheduling
Presentation transcript:

Next-Gen Asset Streaming Using Runtime Statistics David Thall Insomniac Games

The Concept Load and unload assets from runtime statistics Simple!

Why not just use Dependency Graphs? Dependency graphs can only tell us ‘what’ to load –An expensive proposition! But what about ‘when’? –If we can answer this, we can save a lot of memory

Why not just use Dependency Graphs? Dependency graphs create hierarchical and cross- referenced interdependencies –Causes req lists to grow exponentially

Why not just use Dependency Graphs? Dependency graphs are tied to a build pipeline –If the dependencies change, so does the built data –This type of design tends to result in fixed-pipeline optimizations, such as data packing, which tend to make runtime optimizations more difficult

Why not just use Dependency Graphs? Dependency graphs don’t know about ‘new’ assets. –We’d like to be able to load unreq’d assets ‘on-the-fly’

Adding runtime statistics to the mix Examine the constraints –It is impossible to determine out-of-context that an asset will be used during gameplay For example –Will we ever load the high-mip texture? –Will we ever load the jumping animation? –Will we ever load the footstep on snow sound? –On the flipside, we want to know that we did in fact load it

Adding runtime statistics to the mix Examine the data –There are many types of assets that are triggered more often by game events and AI than by camera position and orientation Sound Visual FX Animation –Related assets tend to get rendered in spatio-temporal clusters The context is similar This is true both within and across asset types

Our focus case: Sound Assets Sound poses a particularly interesting problem, because the data is a one-dimensional function of time (not space) Questions we need to ask: –What sounds will need to be loaded (and when)? –What is our maximum latency (per sound)? Physical Latency: How long will it take? Perceptual Latency: How long can we wait? More questions (implementation details): –How much memory will we require in the active game context? –How much more expensive is it to load and unload in the runtime? We’ll answer these questions later…

What types of statistics are useful to collect? Simplest: –Sound ID –Context ID (spatiotemporal subdivisions) –Maximum Latency (user-supplied settings) Optional: –Minimum bounding box Useful if defining context is spatially-bound This tells us “A sound was played in this context with this maximum latency setting”. This is good info!

How do we collect our statistics?

A closer look at latency Latency at the game context level –Spatial: Regions (Areas, Zones) –Temporal: Throwing the grenade Latency at the sound level –High - Can be loaded on-demand from disc –Med - Can be loaded on-demand from cache –Low - Must be pre-cached into main memory On-Demand --> Event triggers load Can further subdivide at both levels

Loose-load every sound asset We loose-load every sound asset, because the latency requirements differ. –Latency setting determines ‘when’ we load (the TIMEFRAME) –The default latency is ‘high’ Sound designers override this in special cases If we’re hitting the BD/DVD too much, we override it –No complex fix-ups across contextual boundaries Loose-loaded sounds are reference counted –When we load from a context, we only pre-cache low latency sounds And we know how much time we have to do it. –If we don’t load from a context, the sounds get loaded on-demand –NOTE: We still need to complete a game with this new tech to know what percentage of sounds will require low vs. high latency Early test results follow the 80 / 20 rule!

Memory Management Some constraints: –Need contiguous memory available to handle load requests. –Need to be able to do this at any time (hence, in the runtime). –Need to do all the processing without blocking the main thread. –Must be cheap! Compact all memory as soon as possible. –Statistically-tracked sounds load in bursts or spurts Low-latency sounds are loaded from statistically-generated lists High-latency sounds are loaded less often (by definition). Low-latency  low memory / High-latency  high memory Thus, movement is minimized, both in time and space Defrag actively ‘playing’ sounds in the same buffer –Any other approach unnecessarily complicates memory management Do all loads and unloads asynchronously in a background thread. –Keep everything lock-free. Synchronize all moves with the renderer (and never block it)

Results We only require 25% to 50% of our normal memory budget. –In other words, sound designers get 2x to 4x the memory budget. Memory heap sizes are now manageable –This is true for programmers ‘and’ sound designers Programmers can trade off memory for other requirements, such as FX Sound designers directly tweak sound sizes to fit memory requirements High-latency sounds are practically free –Remember… they are loaded and unloaded from high memory Low-latency sound counts are smaller than originally thought (80/20) –Only the ‘hero’ sounds tend to have psychologically-bound latency requirements Low-latency sounds with the highest playback count in the shortest time window should get loaded first (and preempt any high-latency requests) Cull sounds from contextual loading lists after some time threshold

Questions?

Thank you!