Multi-threading the Oxman Game Engine Sean Oxley CS 523, Fall 2012

Slides:



Advertisements
Similar presentations
© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.
Advertisements

Executional Architecture
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
IT Systems Multiprocessor System EN230-1 Justin Champion C208 –
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Computer Systems/Operating Systems - Class 8
Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.
Operating System Support Focus on Architecture
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
1 I/O Management in Representative Operating Systems.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Operating System Concepts Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.
Kernel Locking Techniques by Robert Love presented by Scott Price.
Processes and Process Control 1. Processes and Process Control 2. Definitions of a Process 3. Systems state vs. Process State 4. A 2 State Process Model.
15.1 Threads and Multi- threading Understanding threads and multi-threading In general, modern computers perform one task at a time It is often.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Copyright © Curt Hill Concurrent Execution An Overview for Database.
CS6502 Operating Systems - Dr. J. Garrido Memory Management – Part 1 Class Will Start Momentarily… Lecture 8b CS6502 Operating Systems Dr. Jose M. Garrido.
Parallel Computing Presented by Justin Reschke
1 Memory Management n In most schemes, the kernel occupies some fixed portion of main memory and the rest is shared by multiple processes.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Fiber Based Job Systems Seth England. Preemptive Scheduling Competition for resources Use of synchronization primitives to prevent race conditions in.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Current Generation Hypervisor Type 1 Type 2.
Process Management Process Concept Why only the global variables?
Multi Threading.
Advanced Topics in Concurrency and Reactive Programming: Asynchronous Programming Majeed Kassis.
Operating Systems (CS 340 D)
Outline Other synchronization primitives
Chapter 4 – Thread Concepts
Process Management Presented By Aditya Gupta Assistant Professor
Other Important Synchronization Primitives
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Lecture 7 Processes and Threads.
Threads and Cooperation
Operating System Concepts
Operating Systems (CS 340 D)
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
The Active Object Pattern
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Smita Vijayakumar Qian Zhu Gagan Agrawal
Fast Communication and User Level Parallelism
Threads Chapter 4.
CSCE 313 – Introduction to UNIx process
Background and Motivation
Dr. Mustafa Cem Kasapbaşı
Multithreaded Programming
Operating Systems (CS 340 D)
CS510 - Portland State University
Lecture 2 The Art of Concurrency
- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts:  multiprogramming, multiprocessing, multitasking,
CS333 Intro to Operating Systems
Foundations and Definitions
Department of Computer Science University of California, Santa Barbara
Chapter 4: Threads.
CS703 – Advanced Operating Systems
CSE 153 Design of Operating Systems Winter 2019
6- General Purpose GPU Programming
Presentation transcript:

Multi-threading the Oxman Game Engine Sean Oxley CS 523, Fall 2012

Project Overview An engine is middleware for game developers, handling low-level details Adaptation of a single-threaded video game engine to multi-threaded

What's being multi-threaded? Physics Spatial queries, synchronization Animation Updating of time Keyframe retrieval, PRS generation Skeleton matrix palette generation Visibility Visibility AABB generation Spatial tree classification Camera culling pass

Challenges of multi-threaded game engines Response time At 60 frames per second (ideal framerate), one frame is ~16.7 ms Threads must compute short bursts of data in as little time as possible Synchronization and start-up costs must be minimized Transparency to developer Developer should not be expected to maintain strict access rules (preferably, none at all) Unpredictability of object behavior and data access

The approach Utilize data parallelism Pool of threads that all simultaneously work on a task Batch physics, animation, and visibility calculations into arrays High independence of work Synchronization code minimal, mostly contained in one class Theoretically scalable to larger numbers of threads Non-parallel code runs just as fast as before Little impact on developer

Caveats of the approach Third-party libraries Bullet Physics, what Oxman Engine currently uses for physics, is not thread-safe Intention was to do simultaneous queries, but forced to resort to background queries Large implications for CharacterController performance Slight developer discipline requirements Physics Waiting for queries to complete Maintaining data lifetime for the query's duration Availability Results of animation/physics not reflected until post physics/animation update phase

The details A “job” (a static function designed to be run in parallel) is launched by the application through a class called ThreadPool Jobs can be launched either blocking or non-blocking If blocking, the calling thread also participates in the job After the job is launched, the caller sleeps until the job is complete A job can utilize any number of threads, but only one job can run at a time

The details, cont. Threads are created a priori at engine init Avoids overhead of constant thread creation/destruction Threads are woken up, and busy-wait until all threads have woken up (avoids race conditions) Each thread runs the job function, then goes back to sleep The threads know their index and how many other threads are performing the job If the job was blocking, the last thread wakes the calling thread up

Issues from compiler optimization and out-of-order execution Busy-waiting on a variable to be a desired value Compiler will optimize out the load instruction Causes deadlock; must force a reload Out-of-order execution Could be done by either CPU or compiler Cannot rely on execution order without special instruction; potential for race conditions This instruction is technically not supported on all architectures, but will exist on the engine's target machines Both problems solved by the MemoryBarrier() macro on Win32 (macro for inline assembly)

Addressing performance concerns In first implementation, threads slept on a condition variable Worked fine with two threads, awful with any more Problems caused by both high kernel and high mutex contention (at least on Windows) Busy-waiting didn't work either OS was more likely to preempt, holding up the whole pool Compromise made Threads continually check for a job. If no job is there, they call Sleep(0), which gives up their time quantum Makes OS preemption much less likely

The Performance Test Test machines: Testing procedure: Intel Core i7 920 2.67 GHz (8 logical cores), 16 GB RAM, nVidia Geforce GTX 570 Pentium Dual T3400 2.16 GHz (2 logical cores), 3 GB RAM, Intel 4 Express (integrated) Testing procedure: First test: 10 characters, using CharacterController components More of a realistic scenario Second test: 200 characters, not using the CharacterController More of a stress test for the parallelized engine portions Both tests done once with rendering on, once with rendering off

Core i7 Results, Part 1

Core i7 Results, Part 2

Core i7 Performance Analysis First test results were rather inconclusive Differences in frame times could be attributed to error in measurements 8 threads actually slows down performance Second test results much more clear 1, 2, and 4 threads showed a 5 ms improvement each time 8 threads didn't provide any benefits

Pentium Dual Results, Part 1

Pentium Dual Results, Part 2

Pentium Dual Performance Analysis No improvement for the first test Benefit of two threads offset by costs of preemption Second test again fared much better A whopping 11-12 ms improvement

Conclusion As it currently stands, the engine's multi-threading has limited use Second test is a rather unlikely scenario Scalability concerns Fork-and-join might be more suited for consoles Not being able to effectively parallelize physics was a huge issue Other, more complex approaches will need to be tried Given more time, I'd try the job queue approach next

The roads not traveled Use of a thread-safe physics library Spatial queries could then be batched and executed all in parallel CPU-intensive CharacterControllers could be parallelized Different multi-threading approaches Threads dedicated to particular subsystems “Start-up time” issues reduced or eliminated Harder to synchronize, Scalability issues Smaller-granularity job queue Work is put in a queue and picked up by individual threads, instead of all threads at once Scales better as # of threads increase More flexible Difficulty of scalable job creation More game-like performance testing scenarios