Emergent Game Technologies Gamebryo Element Engine Thread for Performance.

Slides:



Advertisements
Similar presentations
purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Introduction to Parallel Computing
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
CS 345 Computer System Overview
THQ/Gas Powered Games Supreme Commander and Supreme Commander: Forged Alliance Thread for Performance.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Threads Load new page Page is loading Browser still responds to user (can read pages in other tabs)
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
Concurrency, Threads, and Events Robbert van Renesse.
Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.
1 CSC 2405: Computer Systems II Spring 2012 Dr. Tom Way.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
1 Advanced Computer Programming Concurrency Multithreaded Programs Copyright © Texas Education Agency, 2013.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Composition and Evolution of Operating Systems Introduction to Operating Systems: Module 2.
Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
DCE (distributed computing environment) DCE (distributed computing environment)
Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.
Games Development 2 Concurrent Programming CO3301 Week 9.
GPU Architecture and Programming
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
2.1. T HE G AME L OOP Central game update and render processes.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Java Thread and Memory Model
Processor Architecture
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Tuning Threaded Code with Intel® Parallel Amplifier.
CHAPTER 6 Threads, Handlers, and Programmatic Movement.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Processes and threads.
CS427 Multicore Architecture and Parallel Computing
William Stallings Computer Organization and Architecture 8th Edition
Processes and Threads Processes and their scheduling
Outline Other synchronization primitives
Async or Parallel? No they aren’t the same thing!
William Stallings Computer Organization and Architecture
Other Important Synchronization Primitives
Graphics Processing Unit
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
Lecture 5: GPU Compute Architecture
Chapter 6: CPU Scheduling
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
Chapter 4: Threads.
Lecture 2 The Art of Concurrency
Foundations and Definitions
Synchronization These notes introduce:
GPU Scheduling on the NVIDIA TX2:
CSC Multiprocessor Programming, Spring, 2011
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Emergent Game Technologies Gamebryo Element Engine Thread for Performance

2 Goals for Cross-Platform Threading Play well with others Take advantage of platform-specific performance features For engines/middleware, be adaptable to the needs of customers

3 Write Once, Use Everywhere Underlying multi-threaded primitives are replicated on all platforms –Define cross-platform wrappers for these Processing models can be applied on different architectures – Define cross-platform systems for these Typical developer writes once, yet code performs well on all platforms

4 Emergent's Gamebryo Element A foundation for easing cross-platform and multi-core development –Modular, customizable –Suite of content pipeline tools –Supports PC, Xbox, PS3 and Wii Booth # North Hall

5 Cross-Platform Threading Requires Common Primitives Threads –Something that executes code –Sub issues: local storage, priorities Data Locks / Critical sections –Manage contention for a resource Atomic operations –An operation that is guaranteed to complete without interruption from another thread

6 Choosing a Processing Model Architectural features drive choice –Cache coherence –Prefetch on Xbox –SPUs on PS3 –Many processing units –General purpose GPU Stream Processing fits these properties –Provide infrastructure to compute this way –Shift engine work to this model

7 Stream Processing (Formal) ‏ Wikipedia: Given a set of input and output data (streams), the principle essentially defines a series of computer-intensive operations (kernel functions) to be applied for each element in the stream. Input 1 Kernel 1 Input 2 Kernel 2 Output

8 Generalized Stream Processing Improve for general purpose computing –Partition streams into chunks –Kernels have access to entire chunk –Parameters for kernels (fixed inputs) ‏ Advantages –Reduce need for strict data locality –Enables loops, non-SIMD processing –Maps better onto hardware

9 Morphing+Skinning Example Morph Target 1 VerticesMorph Weights Morph Kernel (MK) ‏ Skin Vertices Bone Matrices Blend Weights Skinning Kernel (SK) ‏ Vertex Locations Morph Target 2 Vertices

10 Morphing+Skinning Example MW Fixed MK Instance 1 Matrices Fixed Weights Fixed Verts Part 1 MT 1 V Part 1 MT 1 V Part 2 MT 2 V Part 1 MT 2 V Part 2 MK Instance 2 Skin V Part 1 Skin V Part 2 SK Instance 1 SK Instance 2 Verts Part 2

11 Floodgate Cross platform stream processing library Optimized per-platform implementation Documented API for customer use Engine uses the same API for built in functionality –Skinning, Morphing, Particles, Instance Culling,...

12 Floodgate Basics Stream: A buffer of varying or fixed data –A pointer, length, stride, locking Kernel: An operation to perform on streams of data –Code implementing “Execute” function Task: Wrapper a kernel and IO streams Workflow: A collection of Tasks processed as a unit

13 Kernel Example: Times2 // Include Kernel Definition macros #include // Declare the Timer2Kernel NiSPDeclareKernel(Times2Kernel)‏

14 Kernel Example: Times2 #include "Times2Kernel.h" NiSPBeginKernelImpl(Times2Kernel)‏ { // Get the input stream float *pInput = kWorkload.GetInput (0); // Get the output stream float *pOutput = kWorkload.GetOutput (0); // Process data NiUInt32 uiBlockCount = kWorkload.GetBlockCount(); for (NiUInt32 ui = 0; ui < uiBlockCount; ui++)‏ { pOutput[ui] = pInput[ui] * 2; } NiSPEndKernelImpl(Times2Kernel) ‏

15 Life of a Workflow 1. Obtain Workflow from Floodgate 2. Add Task(s) to Workflow 3. Set Kernel 4. Add Input Streams 5. Add Output Streams 6. Submit Workflow … Do something else … 7. Wait or Poll when results are needed

16 Example Workflow // Setup input and output streams from existing buffers NiTSPStream inputStream(SomeInputBuffer, MAX_BLOCKS); NiTSPStream outputStream(SomeOutputBuffer, MAX_BLOCKS); // Get a Workflow and setup a new task for it NiSPWorkflow* pWorkflow = NiStreamProcessor::Get()- >GetFreeWorkflow(); NiSPTask* pTask = pWorkflow->AddNewTask(); // Set the kernel and streams pTask->SetKernel(&Times2Kernel); pTask->AddInput(&inputStream); pTask->AddOutput(&outputStream); // Submit workflow for execution NiStreamProcessor::Get()->Submit(pWorkflow); // Do other operations... // Wait for workflow to complete NiStreamProcessor::Get()->Wait(pWorkflow);

17 Floodgate Internals Partitioning streams for Tasks Task Dependency Analysis Platform specific Workflow preparation Platform specific execution Platform specific synchronization

18 Overview of Workflow Analysis Task dependencies defined by streams Sort tasks into stages of execution –Tasks that use results from other tasks run in later stages –Stage N+1 tasks depend on output of Stage N tasks Tasks in a given stage can run concurrent Once a stage has completed, the next stage can run

19 Analysis: Workflow with many Tasks Task 1 Stream A Stream B Task 2 Stream C Stream D Task 3 Stream E Stream F Task 4 Stream B Stream D Stream G Task 6 Stream G Stream F Stream I Task 7 Sync Task 5 Stream G Stream H

20 Analysis: Dependency Graph Task 1 Stream A Task 4 Stream B Task 2 Stream C Task 3 Stream E Stream D Task 5 Stream G Task 6 Stream F Sync Task Sync Task Stream H Stream I Sync Stream G Stage 0 Stage 1Stage 2Stage 3

21 Performance Notes Data is broken into blocks -> Locality –Good cache performance –Optimize size for prefetch or DMA transfers –Fits in limited local storage (PS3) ‏ Easily adapt to #cores –Can manage interplay with other systems Kernels encapsulate processing –Good target for optimization, platform-specific –Clean solution without #if

22 Usability Notes Automatically manage data dependency and simplify synchronization Hide nasty platform-specific details –Prefetch, DMA transfers, processor detection,... Learn one API, use it across platforms –Productivity gains –Helps us produce quality documentation and samples –Eases debugging

23 Exploiting Floodgate in the Engine Find tasks that operate on a single object –Skinning, morphing, particle systems,... Move these to Floodgate: Mesh Modifiers –Launch at some point during execution –After updating animation and bounds –After determining visibility –After physics finishes... –Finish them when needed –Culling –Render –etc

24 Same applications, new performance... Skinning Objects Morphing Objects 42fps 12fps 62fps 38fps Before After The big win is out-of-the-box performance –Same results could be achieved with much developer time –Hides details on different platforms (esp. PS3) ‏

25 Example CPU Utilization, Morphing Before After

26 Thread profiling, Morphing Before Some parallelization through hand-coded parallel update –Note high overhead and 85% or so in serial execution

27 Thread profiling, Morphing After Automatic parallelism in engine –4 threads for Floodgate (4 CPUs) ‏ –Roughly, 50% of old serial time replaced with 4x parallelism

28 New Issues Within the engine, resource usage peaks at certain times –e.g. Between visibility culling and rendering –Application-level work might fill in the empty spaces –Physics, global illumination,... What about single processor machines? What about variable sized output? –Instance culling, for example

29 Ongoing Improvements Improved workflow scheduling –Mechanisms to enhance application control Optimizing when tasks change –Stream lengths change –Inputs/outputs are changed More platform specific improvements Off-loading more engine work

30 Using Floodgate in a game Identify stream processing opportunities –Places where lots of data is processed with local access patterns –Places where work can be prepared early but results are not needed until later Re-factor to use Floodgate –Depending on task, could be as little as a few hours. –Hard part is enforcing locality

31 Future proofed? Both CPUs and GPUs can function as stream processors Easily extends to more processing units Potential snags are in application changes

32 Questions? Ask Stephen! Visit Emergent's booth at the show. –Booth 5716, North Hall, opposite Intel on the central aisle