Real-time Geometry Caches

Slides:



Advertisements
Similar presentations
Spherical Skinning with Dual-Quaternions and QTangents
Advertisements

Introduction to Direct3D 10 Course Porting Game Engines to Direct3D 10: Crysis / CryEngine2 Carsten Wenzel.
Geometry Clipmaps: Terrain Rendering Using Nested Regular Grids
Animation in Video Games presented by Jason Gregory
Multi-core/Cell Game Engine Design
T.Sharon-A.Frank 1 Multimedia Compression Basics.
Normal Map Compression with ATI 3Dc™ Jonathan Zarge ATI Research Inc.
MPEG4 Natural Video Coding Functionalities: –Coding of arbitrary shaped objects –Efficient compression of video and images over wide range of bit rates.
Riham Toulan Technical Artist.  In this session we will have a closer look at : Ryse character rigging pipeline. Character physics and Secondary animations.
Basics of MPEG Picture sizes: up to 4095 x 4095 Most algorithms are for the CCIR 601 format for video frames Y-Cb-Cr color space NTSC: 525 lines per frame.
Memory-Efficient Sliding Window Progressive Meshes Pavlo Turchyn University of Jyvaskyla.
Technion - IIT Dept. of Electrical Engineering Signal and Image Processing lab Transrating and Transcoding of Coded Video Signals David Malah Ran Bar-Sella.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 Video Coding Concept Kai-Chao Yang. 2 Video Sequence and Picture Video sequence Large amount of temporal redundancy Intra Picture/VOP/Slice (I-Picture)
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
CGDD 4003 THE MASSIVE FIELD OF COMPUTER GRAPHICS.
Lossless Compression of Floating-Point Geometry Martin Isenburg UNC Chapel Hill Peter Lindstrom LLNL Livermore Jack Snoeyink UNC Chapel Hill.
Video Compression Bee Fong. Lossy Compression  Inter Frame Compression Compression among frames Compression among frames  Intra Frame Compression Compression.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Enhancing and Optimizing the Render Cache Bruce Walter Cornell Program of Computer Graphics George Drettakis REVES/INRIA Sophia-Antipolis Donald P. Greenberg.
Compressing Polygon Mesh Connectivity
Kumar, Roger Sepiashvili, David Xie, Dan Professor Chen April 19, 1999 Progressive 3D Mesh Coding.
Topological Surgery Progressive Forest Split Papers by Gabriel Taubin et al Presented by João Comba.
Out-of-Core Compression for Gigantic Polygon Meshes Martin IsenburgStefan Gumhold University of North CarolinaWSI/GRIS at Chapel Hill Universität Tübingen.
3D Rendering & Algorithms__ Sean Reichel & Chester Gregg a.k.a. “The boring stuff happening behind the video games you really want to play right now.”
Video Compression Concepts Nimrod Peleg Update: Dec
University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.
Amit L Ahire, Alun Evans, Josep Blat Interactive Technologies Group, Universitat Pompeu Fabra, Barcelona, Spain.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
Chapter 5.2 Character Animation. CS Overview Fundamental Concepts Animation Storage Playing Animations Blending Animations Motion Extraction Mesh.
Real-time Rendering of Dynamic Vegetation Alexander Kusternig Vienna University Of Technology.
Enhancing GPU for Scientific Computing Some thoughts.
Graphics on Key by Eyal Sarfati and Eran Gilat Supervised by Prof. Shmuel Wimer, Amnon Stanislavsky and Mike Sumszyk 1.
 Coding efficiency/Compression ratio:  The loss of information or distortion measure:
Video Basics. Agenda Digital Video Compressing Video Audio Video Encoding in tools.
MPEG-1 and MPEG-2 Digital Video Coding Standards Author: Thomas Sikora Presenter: Chaojun Liang.
Invitation to Computer Science 5th Edition
CSCE 552 Fall D Models By Jijun Tang. Triangles Fundamental primitive of pipelines  Everything else constructed from them  (except lines and point.
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS D3DX introduction and math.
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
June, 1999 An Introduction to MPEG School of Computer Science, University of Central Florida, VLSI and M-5 Research Group Tao.
A hardware-Friendly Wavelet Entropy Codec for Scalable video Hendrik Eeckhaut ELIS-PARIS Ghent University Belgium.
DirectX Objects Finalised Paul Taylor Packing Your Objects
Random-Accessible Compressed Triangle Meshes Sung-eui Yoon Korea Advanced Institute of Sci. and Tech. (KAIST) Peter Lindstrom Lawrence Livermore National.
Figure 1.a AVS China encoder [3] Video Bit stream.
CSE 381 – Advanced Game Programming Animation Systems Jack Skellington, from The Nightmare Before Christmas.
Multimedia System and Networking UTD Slide- 1 University of Texas at Dallas B. Prabhakaran Rigging.
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
CSCE 552 Spring D Models By Jijun Tang. Triangles Fundamental primitive of pipelines  Everything else constructed from them  (except lines and.
All the Polygons You Can Eat
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.
1 Perception and VR MONT 104S, Fall 2008 Lecture 20 Computer Graphics and VR.
Mesh Skinning Sébastien Dominé. Agenda Introduction to Mesh Skinning 2 matrix skinning 4 matrix skinning with lighting Complex skinning for character.
CASA 2006 CASA 2006 A Skinning Approach for Dynamic Mesh Compression Khaled Mamou Titus Zaharia Françoise Prêteux.
2014 Animation Programming for Music Video Games Jessica Scott Harmonix Music Systems, Inc. October 10, 2014 #GHC
CMPT365 Multimedia Systems 1 Media Compression - Video Spring 2015 CMPT 365 Multimedia Systems.
Thomas Daede October 5, 2017 AV1 Update Thomas Daede October 5, 2017.
Computer Graphics Index Buffers
Graphics Processing Unit
Deferred Lighting.
Chapter 6 GPU, Shaders, and Shading Languages
A Brief History of 3D MESH COMPRESSION ORAL, M. ELMAS, A.A.
Standards Presentation ECE 8873 – Data Compression and Modeling
MPEG4 Natural Video Coding
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
03 | Creating, Texturing and Moving Objects
Emerging Technologies for Games Review & Revision Strategy
Presentation transcript:

Real-time Geometry Caches Axel Gneiting | axel@crytek.com | SIGGRAPH 2014 | 13th August 2014

Motivation Wanted next-gen visuals for Ryse Lots of VFX set pieces Art pipeline bottleneck (required baking to joints) Needed simpler way to get animations into engine Solution: Import Alembic Animations like cloth, water simulations, fur Alembic: No engine specific markup. One click to import and run. Outsourcing much easier

Challenge Massive data rate? Naïve approach: 56 bytes per vertex (14 floats) Position, UV, Normal, Tangent, Binormal, (Color) Sail: 30000 vertices, 30 FPS ≈ 50MB/s Ryse budget: 10MB/s More than Alembic actually, need full tangent frames 10 MB/s is for the whole scene, not only one cache

Compression Overview Only transform for rigids Vertex animation 50 Only transform for rigids Vertex animation Fewer bits per vertex Transform data Compression Restriction: Only static topology Obviously we don‘t store per frame data for rigid / non-deforming meshes Data rate is bar is always for the sail Transform data to help compression Data rate meter for specific sail asset with 30.000 vertices. Data rate meter will indicate progress on data rate reduction during the methods presented in the talk. Data Rate (MB/s)

Transforms Simplify the tree Transforms: Rotation + Translate + Scale 50 Simplify the tree Transforms: Rotation + Translate + Scale Bake down static hierarchies Bake down animated child of animated parent No support for shear Transforms: 40 instead of 48 bytes. Compared to vertex animations we can neglect this. We didn‘t optimize it a lot. Data Rate (MB/s)

Fewer Bits (Quantization) Positions: 3x uint16 Texture Coordinates: 2x int16 QTangents 4x 10 bits (int16) [FREY11] (Colors: 4x 8 bits RGBA) Only lossy step Positions defined in bbox space. mm accuracy for 64m mesh. Quantizer will use less bits if possible, artist can specify the mm precision he needs. Texture coordinates get mapped to [-1024, 1024] which leaves enough fractional digits Tangent frames mean only orthonormal tangent frames. Doesn‘t matter in practice for us. We use 16 bit shorts for 10 bit values because compressor works on bytes 16 Data Rate (MB/s)

Compression Block compress each frame Deflate (zlib) [DEUTSCH96] or LZ4 HC [COLLET13] Deflate is slow but pretty good compression LZ4 HC is usally 20% worse compression, but 10x faster decode. Almost like memcpy. Still 10MB/s for one asset, we need to do better 10 Data Rate (MB/s)

Prediction Overview Predict utilizing temporal and spatial similarities Store residuals (differences) Of course need to do same prediction at runtime than at compile time, otherwise don‘t get same result Residual symbols cluster around zero, because prediction tends to be close. The more of the same symbols, the better the compression. 10 Data Rate (MB/s)

Predictor Requirements Deterministic In-place prediction As fast as possible Always needs to predict exactly the same way (obvious) No extra memory allocations on decode Needs to decode in real time 10 Data Rate (MB/s)

Intra Prediction (i Frames) Predict positions from adjacent tris [ISENBURG02] ... Parallelogram prediction. Extend adjacent triangles to parallelograms. Blue: Last triangle(s) Orange: Parallelogram prediction Black arrow: Residual to store Can do this in place: for each vertex predict, read, add, write Also used for UVs QTangents just use average of last two vertices, because parallelogram rule makes no sense for them Savings depend on asset 8,5 Data Rate (MB/s)

Intra Prediction (I Frames) Adjacent tri must be decoded Need to reorder vertices First run vertex cache optimizer Sort vertices by first use in index buffer Per vertex search for best prev vertices Store offsets for decode Optimizer tends to order indices so mesh gets rendered in strips Offset only needs to be stored in file header, because we only support non-changing topology 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Full (index) frame data only every n frames B frames predicted using temporal similarities For us optimal index frame distance was about 10 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Original motion 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Interpolated prediction 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Motion prediction 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Combination of motion and interpolation prediction 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Select interpolation factor for interpolation t 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Fp In Fpp B Select acceleration factor for motion t 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Select extrapolation factor for combination The three factors used are the same for all vertices in mesh, so individual predictions will usually be worse than here. t 8,5 Data Rate (MB/s)

Inter Prediction (B Frames) Binary search residual entropy for factors Three bytes extra / mesh / frame Can also do this in place Vectorizable (SIMD) Much better prediction than intra Factors get quantized to three bytes per frame for all vertices Can use SIMD for this to predict 4 elements in parallel (some INT16 operations even on 8 elements) Just unpack 8xINT32 -> 2x4xUINT32, mul, shift, truncate & pack again on Jaguar per interpolate/extrapolate 4,5 Data Rate (MB/s)

Playback Overview Needs to run in real-time Streaming Parallelization Loading time would also be a problem Next gen CPU cores still not terribly fast

Disk Buffer GPU Streaming Data Flow And TimingS t = -5s Read t = 0 Convert t = -0.5s Decompress/Decode Disk reads and decompress/decode are asynchronous and non-blocking Read combining to avoid disk seeks (>1 MB chunks) Upload to GPU is asynchronous but render thread will wait for data Data in buffer stays in compact disk format until decompress starts Timings are configurable, values were choosen by experimentation

Buffering Experimented with ring allocator Problems with multiple streams Used dedicated heap instead dlmalloc based Ring buffer allocator doesn‘t work for multiple streams, because of different data rates. Would need to defragment holes. Really tried to make this work, but wasn‘t worth it in the end. Fragmentation with normal allocator wasn‘t a big problem 128 MB for both compressed and uncompressed data

... Parallel Decompress & Decode Compressed Compressed Decompressed Jobs for decompression Index frame jobs can start as soon as decompressed data is ready ... I I B B B B B B B B I I

... Parallel Decompress & Decode Compressed Compressed Decompressed First B frame has I frames and own residuals as input (First B frame does not do acceleration prediction) ... I I B B B B B B B B B I I

... Parallel Decompress & Decode Compressed Compressed Decompressed Other B frames have only last B frame and residuals as dependencies, because by induction both I frames and the last two B frames are decoded All other B-frames depend on previous one as well We do the synchronization for all jobs with lockless atomic counting: I frames get initialized to 1, first B frame after I frame to 3, all other B frames to 2 After dependency is finished will decrease counter of dependent task and launch it when counter reaches 0 ... I I B B B B B B B B B B I I

Rendering Job for GPU convert & upload Instancing for rigids SIMD Motion vectors & transforms update Interpolates between frames Instancing for rigids The conversion job gets launched for each job the geom cache actually gets rendered Conversion job could possibly be avoided if GPU would directly read quantized format The vertex shader could possibly directly support the quantized format

Future development Support for changing topology Improve compression Better predictors Better block compression Automatic skinning Support for physics Tricky with vertex animation All of the compression research tends to lead to require more and more computational power for little gains Automatic skinning would be a way to do more lossy compression

Special Thanks Sascha Herfort, Nicolas Schulz, Bogdan Coroi, Theodor Mader, Carsten Wenzel, Chris Raine, Chris Bolte & Ivo Zoltan Frey The entire Ryse team and Crytek

References [DEUTSCH96] DEFLATE compressed data format specification version 1.3. [COLLET13] Collet, Yann. "LZ4: Extremely fast compression algorithm.“ https://code.google.com/p/lz4/ [FREY11] Frey, Ivo Zoltan, and Ivo Herzeg. "Spherical skinning with dual quaternions and QTangents." SIGGRAPH Talks. 2011. [ISENBURG02] Isenburg, Martin, and Pierre Alliez. "Compressing polygon mesh geometry with parallelogram prediction." Visualization, 2002