Game Threading Analysis & Methodology

Game Threading Analysis & Methodology
Session: Game Threading Analysis & Methodology Script: Welcome to Game Methodology training from Intel Software College This module assumes some prior knowledge. It assumes some knowledge of a couple of threading implementations (Windows API & Intel Threading Building Blocks) and one software tool from Intel Parallel Amplifier and also assumes an higher level, overview understanding, of how games are constructed. In this two hour module – we WILL NOT be teaching you how to program using Windows API. We will not be teaching you how to program with Threading Building blocks. We will not be teaching you how to program DirectX or Direct 3D. This module is intended primarily to show a higher level method of attack for games and how to use tools to evaluate the effectiveness of the threading strategy. Intel Software College does offer a sister “module” – really a full day course – that goes into much more detail on how to thread games and specifically Destroy the Castle. In that course, we would look more in depth at how games are constructed, and how to actually parallelize one – Destroy the Castle. In THAT course we look at the Windows API functions used, where we put the calls to QueueUserWorkItem, WaitForSinglObject, etc – For THIS module will not be going into that level of detail on how we thread the game – we will be taking a higher level view. We will assume some knowledge of The Intel® Parallel Amplifier as that is the tool we use to compare the two threading strategies discussed in the module. In our typical 3 day instructor lead course, we will have spent about 2 hours covering Parallel Amplifier prior to this module The module also assumes some familiarity with a Windows Threading API called QueueUserWorkItem as well as WaitForSingleObject() and WaitForMultipleObject(). We will also assume knowledge of Intel Threading Building Blocks, as half the labs are threaded with TBB. We won’t go into low level details of TBB in this module but we will task about some of the key API’s involving TBB tasks. If you don’t already have a working knowledge of these topics – that’s ok – you can still get the high level recommendations we provide about the threading strategies – but having some knowledge of these topics will make the lessons much more rewarding. This module is intended to be a standalone module but can also be used in our larger Instructor lead class settings. The standalone module and labs/demos should take on the order of 2 hours to 2.5 hours to complete

Objectives At the end of the module you will be able to:
Describe two strategies to parallelize a game using two different threading implementations Evaluate the effectiveness of each strategy with respect to how each uses the underlying number of cores we WILL NOT be teaching you how to program using Windows API. We will not be teaching you how to program with Threading Building blocks. We will not be teaching you how to program DirectX or Direct 3D. This module is intended primarily to show a higher level method of attack for games and how to use tools to evaluate the effectiveness of the threading strategy. Script: At the end of the module you will be able to: Describe two strategies to parallelize a game using two different threading implementations Evaluate the effectiveness of each strategy with respect to how each uses the underlying number of cores we WILL NOT be teaching you how to program using Windows API. We will not be teaching you how to program with Threading Building blocks. We will not be teaching you how to program DirectX or Direct 3D. This module is intended primarily to show a higher level method of attack for games and how to use tools to evaluate the effectiveness of the threading strategy.

Agenda Introduction to Intel® Parallel Amplifier Usual Game Structure
Parallelization with Windows* Threads What is Intel® Threading Building Blocks? Parallelization with Intel® Threading Building Blocks Curriculum Application & Summary Script: Lets talk a little about the Intel Parallel Amplifier Next slide * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 3

Motivation for a threading Tool
Developing efficient multithreaded applications is hard New performance problems are caused by the interaction between concurrent threads Load imbalance Contention on synchronization objects Threading overhead Need a tool to help! Multi-core Programming: Intel Parallel Amplifier for Explicit Threads Speaker’s Notes Script; Purpose of the Slide Provide some motivation to studying the Intel Parallel Amplifier & related tools such as Parallel Inspector. Details Not only is it hard to get correctly threaded applications, but efficient applications are hard to produce by hand without help. The list is some of the performance problems posed by multithreaded apps that are not found in serial codes.

Intel® Parallel Amplifier
Component within the Intel® Parallel Studio Has 3 main charters Identify hotpots in an application Identify level of concurrency in an application Identify locks & waits In this module, we will look specifically at locks & waits Multi-core Programming: Intel Parallel Amplifier for Explicit Threads Speaker’s Notes Script Purpose of the Slide Intel Parallel Amplifier is a component from the larger collections of parallel tools from Intel call Intel Parallel Studio. Other tools in the suite include 1) Parallel Composer which provides a Visual studio compatible compiler and set of libraries including threading building blocks that can be used to quickly create parallel codes, 2) Parallel Inspector which identifies threading pitfalls such as race conditions & deadlocks as well as a memory checker, 3) Parallel Amplifier Parallel Amplifier has three main areas of value: 1) it can be used to identify hotspots in your code – so you can see where your app is spending its time and also provides a call tree summary that can be used to view your function call stack, 2) it provides the capability to determine what level of concurrency in your app and 3) it provides a locks & waits analysis that shows the synchronization objects that are holding back performance , as well as an overview of the apps over or under subscriptoin of cores. Details Optimize Performance and Scalability Intel® Parallel Amplifier makes it simple to quickly find multicore performance bottlenecks without needing to know the processor architecture or assembly code. Parallel Amplifier takes away the guesswork and analyzes performance behavior in Windows* applications, providing quick access to scaling information for faster and improved decision making. Fine-tune for optimal performance, ensuring cores are fully exploited and new capabilities are supported. Intuitive performance profiler specifically designed for threaded applications Use throughout development cycle to maximize threading performance Make significant performance gains that impact customer satisfaction Increase application headroom for richer feature sets and next-gen innovation

Amplifier Locks & waits - Synchronization View
Blue: Over – Over -utilized CPU cores Green: Ideal – Fully utilized CPU cores Orange: OK – Acceptably utilized CPU cores Red: Poor – Underutilized CPU cores Synchronization Object View: This view shows the kinds of synchronization objects and effects they are having on performance – Threads, Critical Sections, etc identifies here Script: This is another view of Parallel Amplifier showing the thread timeline. In this view, the horizontal bands represent threads and so we see three threads represented in the above diagram. The light green regions represent the duration of time during which a thread is waiting – essentially doing no useful work The dark green areas represent time where the thread is active, aka runnable, aka running The hatched light green area represent times during which the threads a busy waiting – as in a spin loop The yellow transition lines indicate that a signal to wake up other threads such as a lock transfer or perhaps sending a message to another thread Now we will look at what the vertical bars in Parallel Amplifier indicate

Amplifier Locks & waits – Wait Thread View
Wait Thread View: This shows the threads and functions that are waiting. It also depicts the relative proportion of app spent undersubscribed, fully subscribed or oversubscribed Script: This is another view of Parallel Amplifier showing the thread timeline. In this view, the horizontal bands represent threads and so we see three threads represented in the above diagram. The light green regions represent the duration of time during which a thread is waiting – essentially doing no useful work The dark green areas represent time where the thread is active, aka runnable, aka running The hatched light green area represent times during which the threads a busy waiting – as in a spin loop The yellow transition lines indicate that a signal to wake up other threads such as a lock transfer or perhaps sending a message to another thread Now we will look at what the vertical bars in Parallel Amplifier indicate Summary View: This shows a graph indicating how much time in app is spent fully subscribed to cores

Concurrency Profile Measure core utilization so user can see how parallel their program really is Relative to the system executing the application Idle: no active threads Under-subscribed: # threads > 1 && # threads < # cores OK: More than a single thread Fully-subscribed*: # threads == # cores Oversubscribed: # threads > # cores Concurrency level is the number of threads that are active (not waiting, sleeping, blocked, etc.) at a given time Script: This screenshot was taken from Parallel Amplifier during the analysis of Destroy the Castle – this analysis was against the versions threaded with the Windows threading APIs. This snapshot was taken on an 4 core machine animation Essentially what Parallel Amplifier is measuring for you is the core utilization. The core utilization may show some percentage of cores that are running idle – these are indicated by the gray vertical bar on the left The period of time during which the execution was limited to a single core is shown here as an orange bar Also – periods of time during which the number of threads is less than the number of cores will display as red bars. There is a subtle distinction between orange and red bars here – in the orange bar case only one core is active while in the case of the red bars , we at least have at least a couple of threads running concurrently but not to the full advantage (not to the full extent) of the physical number of cores available for use The green bars are our friends. They represent the duration of time that cores were fully subscribed – meaning that the number of threads is the same as the number of cores. Oversubscribed means that we actually had more threads operating than we had cores. As a general rule, we would like to structure our algorithms to minimize the idle, serial, undersubscribed and oversubscribed cases and maximize the fully subscribed cases. So – you might ask – how can my application exhibit both undersubscribed and oversubscribed behavior? The answer is that at different times during the program execution, the threading concurrency profile may be different depending on how the application was constructed and how the runtime schedules the threads. * example reflects 4 core machine

Parallelization with Windows* Threads What is Intel® Threading Building Blocks? Parallelization with Intel® Threading Building Blocks Curriculum Application & Summary Script: The Agenda for this module is as follows: First we will briefly cover the Usual Game Structure reviewing the concepts for rendering in onFrameMove and where the AI, Physics etc fit in After that – we will talk a little about the Intel Parallel Amplifier – as that is the tool we will use to anaylize our threaded code behaviour Next – we will talk about Parallelization with Windows* and POSIX Threads – This will be a high level overview – not a lot of coding detail – but at least an overview of how to think about parallelizing this game Next – We will discuss Intel® Threading Building Blocks - hi level reminder or review for context in this module when it is delivered standalone – n the Instructor lead class it will typically be preceded by a two or three hour module on TBB – so if TBB has already been covered in your viewing or teaching of this material then you may decide to skip this short section After that we will cover - Parallelization with Intel® Threading Building Blocks Then we will wrap up with ideas of where this material might fit in your curriculum Next slide * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 9

Consists of loop called “Game Loop”
Usual Game Structure Consists of loop called “Game Loop” Get Input Simulate Render Get Input Physics AI Particles Render Script: Boiled down to its essence – a typical game consists mainly of a main game loop. The main game loop is an infinite loop that can be exited only by certain events & registered callback functions. The loop simply repeats the main tasks identified here as OnFrameMove and Render Click to animate The Render step is not one that we will spend much time trying to parallelize – it already has specialized graphics functions that are handled by graphics hardware and we don’t see much extra opportunity to parallel here The OnFrameMove function is another matter OnFrameMove Is where the game catches up with changes to physics, AI, particles and other game dynamics. The Physics step is where we typically do any of the kinematics (motion) calculations including translation of objects, rotations of objects and the interaction of these 3D objects with each other. Click to Animate In the Destroy the Castle game demo that we will be using in our labs physics will govern the flight of cannon balls, stone blocks and other objects. AI stands for artificial intelligence. This step is where we do the calculations for any of our simulated on screen enemies or in the case of Destroy the Castle – it is where we do the calculations needed to move computer controlled bugs around the scene. Particles is the step where we calulate the “dust” and other particles that arise from the interaction of objects together in collisions that cause dust clouds. So – at its very simplest – the Destroy the Castle game is a game loop that repeats the render, physics, AI, particles loop – and just does it over and over again. Naturally – the real game or demo has more complexity than this – but this level of understanding should be enough to get us started in figuring out how to parallelize the game Next slide DTC uses the usual game loop

Lab Activity 1 Build Destroy the Castle
Follow the steps for Lab Activity 1A & 1B in the student guide to build & run Destroy the Castle Script: In this series of lab activities we will be using the Parallel Amplifier to compare various threading strategies. We will begin by seeing how to label or tag events in our code so that they are visible in the Parallel Amplifier. Then we will build two pre-coded versions of the Destroy the Castle demo game – 1) one that is threaded w Windows QueueUserWorkItem and synchronized perior to each frame move and 2) a version threaded primarily with threading building blocks tasks and task scheduler and in which the main computing tasks are computed asynchronously from the rendering thread. The first lesson will demonstrate how to use the compiler to instrument the binary so that it will generate a profile suitable for analysis with Parallel Amplifier and which just provides profile information about the main executable and not all the peripheral DX calls. To do this – I am going to have the compiler instrument my application using the /Qtprofile option. After a successful build, each run of the application will then generate a “tp.tp” file. This tp.tp file will contain all the needed Parallel Amplifier information to allow Parallel Amplifier to construct the threading profile of my application. In order to get this instrumentation built into my project I need to follow these steps: add /Qtprofile to the compiler build line provide Visual studio with the name of “libittnotify.lib” library

Limitation of Serial Games for Multi-Core Systems
With clock rates reaching into the multiple GHz range, further increases are becoming harder Parallel hardware has gone mainstream for desktop To exploit the performance potential of multi-core processors, applications must be threaded Script: Now that we’ve seen how to use Parallel Amplifier and the libittnotify API and before we look more closely at the parallel versions of the application – lets stop and ask ourselves why we even care about parallelizing a game? Here are some answers; With clock rates reaching into the multiple GHz range, further increases in clock rate are becoming harder – and the only avenue to performance is through parallelism. Parallel hardware has gone mainstream for desktop and really is reaching into laptops and embedded processors and into even mobile devices before long. So, to exploit the performance potential of multi-core processors, applications must be threaded. Here’s an interesting thought – Serial games get no benefits from multi-core… So – for gamers to effectively put the tremendous power of mult-icore to work – then games ultimately have to be designed to be threaded. Next foil Serial games get no benefits from multi-core

Parallelization with Windows* Threads What is Intel® Threading Building Blocks? Parallelization with Intel® Threading Building Blocks Curriculum Application & Summary Script: Now lets have a look at the first approach to parallelizing a game – we’ll be looking at already parallelized code for Destroy the Castle * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 13

Parallelization with Windows* Threads
Updating with double buffered data structures Decoupling rendering from frame processing Asynchronous update of parts of the scene Task Decomposition Render Physics Particles AI Script: Conceptually – all work is broken down to compute what is needed for a small time slice called a tick. In the diagram above – you see the broken lines running horizontally – each broken line represents a “tick” of time – then all the work done during the tick will be rendered for each frame. All our computation and thread synchronization will be done at the conclusion of each tick in a function called OnFrameMove. The approach we took when parallelizing DTC was to use double buffer the data structures decoupling the rendering phase from the frame that processes the Physics, AI, & Particles – ansynchronously updating only parts of the scene Essentially we approached the parallelization problem as a task decomposition and we generate threads to handle each of the primary tasks or functions – so we start rendering in its own thread, then use a have physics running in a different thread, Ai in a different threads, and same for particles. They have to sync with each other periodically and we will do that in the OnFrameMove function via the use of a WaitForSingleObject where we wait for all the computation threads such as Physics & AI to finish their work for the slice of time (called a tick) before the rendering goes forward The key take away here is that conceptually we assign threads to each of the main functions Render, Physics, AI, Particles – and they sync with each other periodically during a tick of time Again – to learn more detail than this – the student should download and play with the 8 hours of training we offer on Game Threading Next foil ISC has 8 hour course on how to thread DTC with Windows* Threads * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners.

DTC Baseline Analysis Parallel Amplifier shows serial code dominating the execution Locks & Waits for “Single Thread” run Serial execution dominates Terrible Frame Rate Script: The slide demonstrates a typical result of the first lab activity where we see serial behavior dominating the execution time. First animation First – what do we mean by Single Thread? The graph shows 8 or more threads? Single thread here means the game logic is running single threaded – a single thread to do AI, Physics, Particles, etc. The other threads are he normal part of gaming with threads being spwned inside dll’s for direct X and other support functions as well as the typical separation into its own thread of the Gui Not shown on the graphic here – but you will see in the lab that Parallel Amplifier can be used to specifically identify what portions of the baseline serial code are devoted to rendering, what parts are devoted to physics, AI & particles. Animate The take away is that serial code is dominating the baseline run of Destroy the Castle and hopefully you observed this in your lab

Performance Profile Task Decomposition
Locks & Waits for Multithreaded run Thread pool for Render plus 3 “simulate” threads Load balance issue Good frame rate Script: Here we show the results from running Destroy the Castle in the threaded “Task Decomposition” mode. This is where we have Render, Physics, AI, and Particles all running in separate threads. Essentially, we have one thread for the Render portion of the game loop, and three threads for the Simulate part of the game loop (one thread each for Physics, AI, & Particles). The load balance issue can be seen from this. Find the threads associated w Particles, AI, Physics – they are not doing the same amount of work – and in the bigger picture than that – threads in general here are not all working equally hard. 1st animation Ywo key observations should be made here: The graph on the right – the concurrency graph – shows that we ARE getting some parallelism – note the green bar indicating that we are fully subscribing our cores about 20% of the time in this mode. Also note that we are Undersubscribed – see the tall red bars – for a fair amount of time. Somehow we would like to offer the cores more work to do during these undersubscribed portions the graph on the bottom show that the threads do not share the work equally – some thread are quite busy while others are lazy – note the light green boxes on the bottom graph indicating threads that are not active. This load imbalance explains why we see such an underutilization of cores in the graph to the right. There is a lot of dead time where only a single core is busy. This sets up the scenario for what we would like to try next. We’d like to try to use these cores more effectively and so we will try an experiment in the next few foils to add some data decomposition to the AI thread in hopes of getting more performance. Animate The take away here is that we do get some parallelism benefit – but we also observe low utilization of 4 core Next slide Some parallelism but … low utilization of 4 cores

Data Level Parallelism
Nested parallelism Top level - task decomposition Next level - data decomposition Render Physics Particles update several AI units update several AI units AI Script: In the next step towards parallelizing the application – and you can look at the canned code during the lab – we decide to also do some extra parallelism by using a data decomposition for the AI tasks. It turns out that the bugs that crawl around our world all do similar kinds of things and their behavior is controlled in a nice for loop that can be executed in parallel. Each bug runs around the world independently form all the other bugs and so lets break the loop iterations into chunks that each processor can compute for. That way – lets say we have 99 bugs in our world – we can assign 33 bugs to one thread to update their positions, 33 to the next thread, and 33 to the last thread so that all 3 computing thread in the pool are kept busy. What we want to do is find more work to do during those periods of time when we have threads sitting around waiting. Animate To accomplish this – we will further parallelize the AI thread so that it takes advantage of data decomposition opportunities in data structures that are very loop centric within the AI code. In DTC, we can assign different numbers of threads to the data decomposition phase – In some runs we allowed 2 threads to attack the AI arrays, in some cases we allowed 4 threads to do the extra work. Now its time to analyze the effect f these exeperiments See the next foil

Lab Activity 2 Use Parallel Amplifier to Display Win version Baseline Locks & Waits Information Follow the steps for Lab Activity 2 in the student guide to analyze the baseline and Task Decomposition profiles you created in Activity 1 Script: Use Parallel Amplifier to Analyze the Baseline and Task Decomposition profiles Follow the steps for Lab Activity 2 in the student guide to analyze the baseline and Task Decomposition profiles you created in Activity 1 This activity should take 5-15 minutes Next slide

Performance Profile AI Decomp. with 2 Threads
Locks & Waits for Multithreaded Plus 2 AI threads run Better core utilization Best frame rate Starting to see oversubscription Load balancing an issue Script: So adding this data decomposition strategy has really improved our core utilization. Here we just split the AI work up to be consumed by two threads. We still have a load imbalance as seen before – some cores still doing more work than others – but we see from this view that we are spending less time in the red zone and more in orange & green and so our utilization of cores has improved More parallelism … higher utilization of 4 cores

Performance Profile AI Decomp. with 4 Threads
Locks & Waits for Multithreaded Plus 4 AI threads run Oversubscription a problem Load balancing issue Frame rate drops Script: Here we just split the AI work up to be consumed by four threads. Animate We still have a load imbalance as seen from the timeline view – some cores still doing more work than others but the balance looks better than with 2 AI threads. We see from the view on the right that our time spent fully subscribed has really climbed, and our under-subscription of cores has gone down, but we are starting to get even more thread oversubscription shown in blue. Too many threads for number of cores

One More Problem: Nested Parallelism
Software components are built from smaller components If each turtle specifies threads... Script: We have seen the two pronged higher level task decomposition plus data decomposition for the AI tasks was pretty effective at increasing our core utilization. But the approach is not without its problems. As we saw with the AI splitting in the last foil, it can be tricky to get the load balance right – how does the programmer know ahead of time how many AI threads to create? We are already dealing with nested parallelism in order to more fully subscribe the cores but this comes with increasing code complexity. In the extreme case if we have every thread create threads and those threads create threads – pretty soon it will become very difficult to manage the complexity and the synchronization costs and general thread overhead. At some point we want just enough threads running to keep all the cores fully subscribed – no more no less. This foil depicts an analogy of creating nested threads with turtles. One turtle spawns off more turtles and each of those turtles spawns of more turtles etc. At some point we have resource contention and its very difficult to keep track of the turtles. Similarly – in the prior strategy – the developer was responsible for specifying which threads spawned other threads and for analyzing the core utilization and trying to balance out the threads – in the limit – this becomes too difficult to manage for more complicated systems. Isn’t there some way to automatically manage all this thread utilization somehow? We’ll take a look at how we might approach this in the upcoming TBB section

Hard to achieve scalability
Disadvantages of … Using Windows* and POSIX Threads for Games Low-Level details (not intuitive) Hard to come up with good design Code often becomes very dependent on a particular OS’s threading facilities Load imbalance Has to be managed manually Oversubscription Multiple components create threads that compete for CPU resources Hard to manage nested parallelism Script: So overall assessment of the threading strategy – we got good results – but we would like to see if we can do better In the labs up to this point we have seen good increases in frame rate – and we did increase our core utilization fairly dramatically. Our evaluation of the threading strategy just analyzed – is that is requires knowledge of low level details and it is not too intuitive, it continues to suffer from load imbalance despite some partially effective attempts to counter this manually, and we now have do deal with the issue of oversubscribing the cores. Animation Main take away is that this approach makes it hard to achieve scalability Next foil Hard to achieve scalability * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners.

Introduction to code instrumentation Parallelization with Windows* Threads What is Intel® Threading Building Blocks? Parallelization with Intel® Threading Building Blocks Curriculum Application & Summary Script: Now lets look at threading building blocks If you are taking our three day class and have just completed the 2-3 hour lecture on TBB, then you might wish to skip this overview – otherwise – if you are taking this module as a standalone piece – you may want to refresh your memory on what Threading Building Blocks is and what is does Next slide * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 23

What is Intel® Threading Building Blocks?
It is Open Source now! Component in Intel Parallel Composer Threading Abstraction Library Relies on generic programming Provides high-level generic implementation of parallel design patterns and concurrent data structures You specify task patterns instead of threads Library maps your logical tasks onto physical threads, efficiently using cache and balancing load Full support for nested parallelism Targets threading for robust performance Designed to provide scalable performance for computationally intense portions of shrink-wrapped applications Portable across Linux*, Mac OS*, and Windows* Emphasizes scalable data parallel programming Solutions based on task decomposition usually do not scale Script: Intel® Threading Building Blocks (TBB) offers a rich and complete approach to expressing parallelism in a C++ program. It is a library that helps you take advantage of multi-core processor performance without having to be a threading expert. Threading Building Blocks is not just a threads-replacement library. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and scalability. TBB is opensource, it can be thought of as an abstraction library providing hi level generic implementation of parallel design patterns and concurrent data structures. With TBB you specify tasks rather than threads – the threads management is handled for you automatically. It supports nested parallelism. It does provide mechanisms for tackling task decomposition but it tends to emphasize data parallel programming because as a general rule – task decomposition does not scale as well as data decomposition. So how does TBB address the issue we saw in the previous example? Lets look at the next slide * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners.

Components of Intel® Threading Building Blocks
Parallel algorithms Concurrent containers Synchronization primitives Memory allocation Task scheduler Problem Intel® TBB Approach Low-Level details Operate with task patterns instead of threads Load imbalance Work-stealing balances load Oversubscription One scheduled thread per hardware thread Script: Threading Building blocks offers a number of parallel algorithms, concurrent containers, synchronization primitives, memory allocation and a task scheduler. It specifically can be used to address the challenges we faced in the previous example – the issues of too many low level details, load imbalance, and core oversubscription. TO address low-level details – we will use TBB using task patterns rather than explicitly trying to manage threads ourselves. To address load imbalance we will take advantage of TBB’s ability to do work-stealing for you automatically – this is where work from one task may be assigned to threads that are sitting around waiting for example. To address oversubscription – TBB will schedule one thread per HW thread – and it will manage the scheduling for us. Next slide

Lab Activity 3 Build the TBB version of Destroy the Castle, collect profile data and then analyze the data to compare this parallel strategy to the previous one. Script: Build the TBB version of Destroy the Castle, collect profile data and then analyze the data to compare this parallel strategy to the previous one.

Introduction to code instrumentation Parallelization with Windows* Threads What is Intel® Threading Building Blocks? Parallelization with Intel® Threading Building Blocks Curriculum Application & Summary Script: So – lets look at another strategy to parallelize Destroy the Castle * InteCul and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 27

Parallelization with TBB
Scheme of parallelization with Windows* Render Physics Particles update several AI units . . . Scheme of parallelization with Intel® TBB update several blocks update several AI units update several particles Render Script: He we will contrast the two approaches – the Previous Win thread implementation where we managed the threads ourselves versus a more task based approach we will take with TBB We need to take these results somewhat with a grain of salt – because the TBB version does not have audio currently – we are working on it! Also - The TBB version does one render frame per one physics frame, per particles frame, per AI frame. The native threading version decouples everything, meaning that it is rendering the same scene many times per second (with the only difference being user input, which can change the camera position every frame). These caveats notwithstanding – the point of this module is to demonstrate a “method” for analysis – at least at the 30,000 foot level – that can be done quickly and easily to compare methods. It just so happens that the code base had branched sufficiently before tackling this analysis that actual quantitative number, in this instance, would not make sense to compare. BUT the method for comparing is still valid The upper diagram should be familiar now – it represents the approach we took previously – task decomposition at the high level for Render, Physics, AI & Particles and then superimposes data decomposition within the AI thread – effectively nesting the parallelism. The bottom approach similar in some respects but different in key ways. In the TBB task approach – we are still going to create an explicit thread for rendering. But now we will create tasks that handle Physics, AI & Particles plus an additional Sync task. We be doing calculations on time slices to be rendered in the very near future. Each task will compute several blocks before the render step requires the data. We get rid of the explicit synchronization step from the previous example (this was a WaitForSingleObject in the OnFrameMove function) and the render thread just renders blocks whose computation is completed. The diagram above depicts the inherent work stealing that may be going on here. At any given time slice all three computations (Physics, AI, Particles) is being computed – but no thread is tied explicitly to a given task – so we see that the top thread may execute say Physics (in green) , but the next time slice another thread may tackle Physics – TBB & the runtime manages the thread pool and doles out work to waiting threads. Now lets look at a conceptual view of how the tasks are generated * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners.

Task Graph ... MainTask SyncTask PhysicsTask AITask AIFinalTask
ParticlesTask AIBodyTask AIBodyTask Script: First lets say that a task represents some code to be executed & some data to operate on as well as a thread that gets assigned to the task to execute the code and crunch its data We will start off by creating a master task or a main task 1st animation that main task will create children tasks – Physics, AI, Particles and Sync task 2nd animation The main task’s job is done – it can now go away – while its children compute 3rd animation Each child spawn off more tasks. Let look at just the AI task to see what it does – but all the tasks behave similarly to this AI example The AI task spawn off multiple AIBodyTasks – say one for each bug crawling around the world.(or perhaps just a subset of of all the bugs). The AI task also creates an AIFinalTask that does any final wrap up functions. Next animation Once the AI task has issued the order to create these children tasks – then its job is done and it can go away The AIBodyTasks eventually complete their updates and report back to the AIFinalTask who does a head count – when all the AIBodyTasks report in then the AIFinalTask – then the AIBOdyTasks can go away – their jobs are done The AIFinalTask and the PhysicsFinalTask & ParticleFinalTask will all eventually report in to the SyncTask who coordinates the results of the other udpates and feed it to the render thread and then all tasks can go away – they are done ... Not expanded AIBodyTask Task creation order Task completion signals

Performance Profile – MultiThreaded w AI run
Locks & Waits for Multithreaded TBB 2 extra AI tasks Good frame rate Good Core utilization Script: The end result is a good utilization of 4 cores – the time line view looks mostly solid green – indicating cores are doing useful work though there is still some load balance issue – this shows a huge improvement to the core utilization compared to Single Thread or even Win MT version – we are under subscribed to a small degree and we see no evidence of over subscription at all Overall this is a great improvement Good utilization of 4 cores

Limitation for Games Intel® TBB is not intended for
I/O bound processing Hard real-time processing Excessive usage of explicit synchronization However, it is compatible with other threading packages It can be used in concert with Windows* and POSIX threads, etc Script: Some caveats. Threading Building Blocks is not intended for I/O bound processing, Hard real-time processing, or Excessive usage of explicit synchronization - However, it is compatible with other threading packages It can be used in concert with Windows* and POSIX threads, OpenMP etc. * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners.

Easy to achieve scalability
Advantages for Games Generic Parallel Algorithms You specify task patterns instead of threads Cross-Platform implementation Load balancing Adaptive tuning to variable computation Full support for nested parallelism Efficient use of resources One scheduled thread per hardware thread Effective cache reuse Script: Some advantages of using Intel® Threading Building Blocks for games follow. The hightlights are that being able to take advantage of generic parallel algorithms (such as parallel_sort, parallel reduce, parallel_scan, etc). It is cross platform compatible also so if you develop say OpenGl games they woukd be interoperable on linux or windows machines. TBB performed work-stealing on our behalf to automatically take care of load balancing. It can adapt to variable computation situations and it fully supports nested parallelism. It can be used to make efficient use of processor resources because it schedules one thread per core animation The bottom line is that it is easier to achieve scalability using TBB This wraps up the threading strategy analysis – we will now look at where this fits in a University curriculum and point out related resources you may want to adopt for your classroom Easy to achieve scalability

Lab Activity 4 Analyze the TBB version with Parallel Amplifier Script:
This lab activity will have the student Analyze the TBB version with Parallel Amplifier. DO Lab activity 5 in the student book now

Introduction to code instrumentation Parallelization with Windows* Threads What is Intel® Threading Building Blocks? Parallelization with Intel® Threading Building Blocks Curriculum Application & Summary Script: Next we’ll look at curriculum application & summary * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 34

Use this material in your Game or 3D Graphics curriculums now
Summary Serial games get no benefits from multi-core Analyzed two different parallelization strategies using core utilization as a key metric Make you aware of some of the game threading related materials we offer including this 2 hour module, in addition to 8 more hours covering game threading in depth – available through the Intel Software College Academic community Use this material in your Game or 3D Graphics curriculums now * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 35

Call To Action Think about Multi-Threading at the beginning of your project Think about scalable performance (for N cores, not just 2 or 4) for years to come Plan now - how you can use this methodology in your curriculum Script: I would encourage you to take this message back to our courses – Have students think about multi-threading AT THE BEGINNING of their project when changes to design are cheaper. Think about scalable performance – the day is coming when we will have tens or hundreds of cores at our disposal not just 2 or 4 cores So – input from audience – how can you use what you’ve learned here in your curriculums? What courses do your think these topics belong in? Are these strategies and the ability to analyze these effectiveness of a parallelization strategy useful to you? How would you roll this information out to your students – what class do you recommend this for and how much time should be spent on this material in your opinion? 36

? Script:

Game Threading Analysis & Methodology

Similar presentations

Presentation on theme: "Game Threading Analysis & Methodology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Game Threading Analysis & Methodology

Similar presentations

Presentation on theme: "Game Threading Analysis & Methodology"— Presentation transcript:

Similar presentations

About project

Feedback