Download presentation
Presentation is loading. Please wait.
1
Game Threading with OpenMP 3.0 Tasks
2
Objectives At the end of the module you will be able to:
Identify the strategic areas in a game code to place an OpenMP parallel region where to additionally place single constructs how to use OpenMP task construct to pipeline Physics, AI, & Particles in the Destroy the Castle game demo how to use omp taskwait to synchronize tasks how to improve performance by using an event API How substituting WaitForSingleObject in place of one of the omp taskwaits improved performance dramatically We WILL NOT be teaching you how to program using Windows API. We will not be teaching you how to program DirectX or Direct 3D. Script: At the end of the module you will be able to: Describe two strategies to parallelize a game using two different threading implementations Evaluate the effectiveness of each strategy with respect to how each uses the underlying number of cores we WILL NOT be teaching you how to program using Windows API. We will not be teaching you how to program with Threading Building blocks. We will not be teaching you how to program DirectX or Direct 3D. This module is intended primarily to show a higher level method of attack for games and how to use tools to evaluate the effectiveness of the threading strategy.
3
Agenda Parallel Studio Hotspot Analysis of DtC Usual Game Structure
Strategy for parallelizing First Activity – Add OpenMP parallel & single regions Second Activity – Add OMP Tasks & Taskwaits Third Activity - Improve Performance replacing a taskwait w WaitForSingleObject Conjecture on Future Enhancements Caveats & Observations Script: Lets talk a little about the Intel Thread Profiler Next slide * Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners. 3
4
Parallel Studio Hotspot Analysis of DtC
Call stack indicates top the top 4 candidates for parallelism WinMain, DXUTMainLoop, DXUTRender3DEnvironment OnFrameMove Code exploration shows that OnFrameMove is called about once every frame It updates Physics, AI, Particles AND performs tick for Physics, AI, Particles This slide shows the results of running the demo loop of DtC SerialBase version and doing an Intel Parallel Hotspot analysis of the code. The top three call sites that consume the most time in this run are: dGeomPlaneSetParams, OpenSteer…ObstacleGroup, and OpenSteer…VehiclePath But rather than drilling down into these functions, The strategy to use is to look at the call stack for each of these hotspots – what do the callstacks of these three have in common? It becomes apparent that all of these hotspots have the same thing in common as it regards the call stack. Each is called from WinMain->DXUTMainLoop->DXUTRender3DEnvironment->OnFrameMove Also note that Castle->Tick and Bugs->tick are prominent in the higher reaches of the call stack as well One strategy to begin parallelization work is to look at these high level functions to see if one of these might be the best spots to parallelize. Parallelizing at something high in the call stack gives a fair chance of having a large enough grain size to minimize parallel overheads. It turns out, that we will parallelize this application using OpenMP and the primary function of concern for us will be WinMain & OnFrameMove – This will be where we place most of the omp parallel, single & task constructs that are used to parallelize the application. We will have to synchronize these tasks in other fucntions as well such as MsgProc & KeyBoradProc because they allow the application to exit via these locations – and so we will place omp taskwait constructs at these exits in addition to strategic locations in OnframeMove & WinMain.
5
Where to parallelize this Application
Hotspot indicates WinMain & OnFrameMove may be good candidates for parallelism. Rendering & DXUT MainLoop are ignored here because rendering has a physical serial contraint in that only one graphic device actually does the rendering. WinMain is chosen to insert omp parallel & omp single constructs OnFrameMove is chosen to pick out Physics, AI & Particles to be done in parallel using task decomposition. The omp task construct is used here together with strategically placed taskwait commands OnKeyBoard & MsgProc also have taskwait commands in order to synchronize properly when the code is exited or changes state in a disruptive way. This slide shows the results of running the demo loop of DtC SerialBase version and doing an Intel Parallel Hotspot analysis of the code. The top three call sites that consume the most time in this run are: dGeomPlaneSetParams, OpenSteer…ObstacleGroup, and OpenSteer…VehiclePath But rather than drilling down into these functions, The strategy to use is to look at the call stack for each of these hotspots – what do the callstacks of these three have in common? It becomes apparent that all of these hotspots have the same thing in common as it regards the call stack. Each is called from WinMain->DXUTMainLoop->DXUTRender3DEnvironment->OnFrameMove Also note that Castle->Tick and Bugs->tick are prominent in the higher reaches of the call stack as well One strategy to begin parallelization work is to look at these high level functions to see if one of these might be the best spots to parallelize. Parallelizing at something high in the call stack gives a fair chance of having a large enough grain size to minimize parallel overheads. It turns out, that we will parallelize this application using OpenMP and the primary function of concern for us will be WinMain & OnFrameMove – This will be where we place most of the omp parallel, single & task constructs that are used to parallelize the application. We will have to synchronize these tasks in other fucntions as well such as MsgProc & KeyBoradProc because they allow the application to exit via these locations – and so we will place omp taskwait constructs at these exits in addition to strategic locations in OnframeMove & WinMain.
6
Analysis of DtC After brief code review – OnFrameMove looked like there were calculations that could be pipelined or overlapped Specifically, Physics, AI, & Particles are each Updated (data written to global variables) and Ticked (Updates the game's clock and calls Update & Draw) The Update functions basically update global state data and this seems like a big headache to synchronize so I chose to leave this alone The Tick() functions look like fair game Based on previous DtC efforts using TBB and QueuUserWorkItem it seems likely that we can pipeline the physics->tick(), the AI->tick() and the particles->tick() functions
7
Usual Game Structure Render Graphics
Initialize Program Create Window, Load Media, Alloc mem Start Game, Set Variables Get Input: KeyBd, Mouse Game Logic: Physics, AI, Render Graphics Cleanup Free Mem, Close DX Restart, Keep Going, Exit DtC uses a typical game structure like that depicted above
8
DTC – Serial flow Game Logic Render Graphics Physics Update AI Update
Particles Update Render Graphics Physics Tick AI Tick Particles Tick DtC uses a typical game structure like that depicted above
9
Strategy for parallelizing
So where is the parallelism? Considerations: Game is event based – feels unstructured like code with lots of goto’s Mouse clicks, keyboard events, timers,etc can halt OpenMp Parallel Region and most OpenMP constructs require a “structured block” of code – menaing only one entrance and only one exit into an OpenMP structure block There appears to be some parallelism in lower nodes – specifically where parallel for could be useful – but placing the omp parallel region in these lower nodes creates and destroys pools of threads once every frame – no a good idea There appears to be parallelism insofar as Physics, AI & Particle Tick functions could be overlapped Does the dynamic extent of OepnMP regions extend to callback functions?
10
Strategy for parallelizing
A little testing with a simple Win32 application confirmed that OpenMP can be used and the dynamic extent of the parallel region can extend to cover callback functions It’s a good idea to limit the creation and destruction of threads so we should place the omp parallel region in WinMain and have it execute once But this is the region where lots of global variables can be initialized and we want to avoid race conditions so an omp single region is placed immediately inside the parallel region Now consider this main thread as something like a boss, which goes around and commands workers (omp tasks) to perform a task, and we have some latitude now in where we put omp tasks We can utilize an omp taskwait to guarantee that child tasks are completed at certain locations in the code
11
DTC – Parallel flow Game Logic assign tasks Render Graphics
Physics Update AI Update Particles Update Wait for signal Physics Tick AI Tick Particles Tick DtC uses a typical game structure like that depicted above Signal when done
12
Strategy for parallelizing
Problem: Event based nature of program means that I need to put omp taskwaits at every possible exit point, including WinMain & OnFrameMove & the calls to MsgProc, KeyboardProc, OnGUIEvent
13
Lab Activity 0 – Serial Baseline
Build Destroy the Castle And get baseline MIN FPS for later comparisons Right Click on paralleldemo.cpp and remove it from the solution Add an Existing Item (ParallelDemoOpenMPSerialBase.cpp) to the Main Project Build All Run without debugging Press ‘B’ to run the demo and display framerate Record Minimum framerate This is the serial version of the app. On the development system consisting of Intel® Core™2 Quad CPU Q GHz (4 CPU), w 3326 MB Ram, NVidia 9500 GT GPU, Direct X 9.0c This serial version records FPS for the Minimum framerate
14
Lab Activity 1 –Parallel & Single regions
Build Destroy the Castle Load the ParallelDemoActivity1.cpp Code and see simple openmp constructs can be made to run. Add Parallel, Single regions for now Continue to modify ParallelDemoOpenMPSerialBase.cpp adding openmp parallel & single constructs OR add ParallelDemoOpenMPSolutionActivity1.cpp to the to the Main Project, removing the SerialBase Ensure that the Main project compiles with /Qopenmp Build All Run without debugging Press ‘B’ to run the demo and display framerate This simple minded OpenMP version records FPS for the Minimum framerate – no better than the serial version - BUT It turns out that one of the omp taskwaits costs us a lot of performance. It can be replaced by a cascade of 3 WaitForSingleObjects once a system of Events is inserted to set or reset an array of enumerated values that signal when task computations are done. This is the intent of Activity 2 Activity1 got FPS for MIN on the following system consisting of Intel® Core™2 Quad CPU Q GHz (4 CPU), w 3326 MB Ram, NVidia 9500 GT GPU, Direct X 9.0c
15
Lab Activity 2 – Tasks & Taskwaits
Build Destroy the Castle Load the ParallelDemoActivity1.cpp Code and see simple openmp constructs can be made to run. Add omp task and omp taskwait pragmas Continue to modify ParallelDemoOpenMPSerialBase.cpp adding openmp task & taskwait constructs OR add ParallelDemoOpenMPSolutionActivity2.cpp to the to the Main Project, removing the SerialBase Ensure that the Main project compiles with /Qopenmp Build All Run without debugging Press ‘B’ to run the demo and display framerate This simple minded OpenMP version records FPS for the Minimum framerate – no better than the serial version - BUT It turns out that one of the omp taskwaits costs us a lot of performance. It can be replaced by a cascade of 3 WaitForSingleObjects once a system of Events is inserted to set or reset an array of enumerated values that signal when task computations are done. This is the intent of Activity 2 Activity1 got FPS for MIN on the following system consisting of Intel® Core™2 Quad CPU Q GHz (4 CPU), w 3326 MB Ram, NVidia 9500 GT GPU, Direct X 9.0c
16
Lab Activity 3 – Performing OpenMP version
Build Destroy the Castle See how using using enumerated Events in conjunction with OpenMP improves performance dramatically Continue to modify ParallelDemoOpenMPSerialBase.cpp Adding Event Signals & replacing one openmp taskwait with WaitForSingleObject calls OR add ParallelDemoOpenMPSolutionActivity3.cpp to the to the Main Project, removing the SerialBase Ensure that the Main project compiles with /Qopenmp Build All Run without debugging Press ‘B’ to run the demo and display framerate In Activity2, a taskwait in OnFrameMove is replaced by a cascade of WaitForSingleObjects: bProcessPhysics = WaitForSingleObject( s_hTickDoneEvent[EVENT_PHYSICS], .. bProcessAI = WaitForSingleObject( s_hTickDoneEvent[EVENT_AI], .. bProcessParticles = WaitForSingleObject( s_hTickDoneEvent[EVENT_PARTICLES], The WaitFor’s wait for the all three of the event handles to be set s_hTickDoneEvent[EVENT_PHYSICS] s_hTickDoneEvent[EVENT_AI] s_hTickDoneEvent[EVENT_PARTICLES] In OnFrameMove we make sure to Reset these events (turn them red) before we do a tick(), then we set them one the tick() functions are complete We also have to take care to set the values in MsgProc, Keyboardsproc, WinMain & onGuIEvent Activity2 got FPS for MIN FPS (about 5X speedup) on the following system consisting of Intel® Core™2 Quad CPU Q GHz (4 CPU), w 3326 MB Ram, NVidia 9500 GT GPU, Direct X 9.0c
17
Conjecture on Future Enhancements
There are more opportunities to parallelize the AIDemo.cpp code that does the updates in a for loop to several hindered bug objects These domain decomposition opportunities have not yet been exploited This leads to a caveat – Attempts were made to place an orphaned omp for directive strategically in the tick() function of AIDemo.cpp But the code crashed. So far we have no explanation. Testing revealed that the orphaned directive was inside the parallel region but the code did not work correctly.
18
Caveats & Observations
Early attempts were made to use a domain decomposition strategy in addition to the pipeline method using tasks. IN these attempts, an orphaned omp for directive was strategically used in the tick() function of AIDemo.cpp But the code crashed. So far we have no explanation. Testing confirmed that the orphaned directive was inside the parallel region but the code did not work correctly.
19
19
20
Parallel Studio Callstack of Hot Functions
This slide shows the results of running the demo loop of DtC SerialBase version and doing an Intel Parallel Hotspot analysis of the code. The top three call sites that consume the most time in this run are: dGeomPlaneSetParams, OpenSteer…ObstacleGroup, and OpenSteer…VehiclePath But rather than drilling down into these functions, The strategy to use is to look at the call stack for each of these hotspots – what do the callstacks of these three have in common? It becomes apparent that all of these hotspots have the same thing in common as it regards the call stack. Each is called from WinMain->DXUTMainLoop->DXUTRender3DEnvironment->OnFrameMove Also note that Castle->Tick and Bugs->tick are prominent in the higher reaches of the call stack as well One strategy to begin parallelization work is to look at these high level functions to see if one of these might be the best spots to parallelize. Parallelizing at something high in the call stack gives a fair chance of having a large enough grain size to minimize parallel overheads. It turns out, that we will parallelize this application using OpenMP and the primary function of concern for us will be WinMain & OnFrameMove – This will be where we place most of the omp parallel, single & task constructs that are used to parallelize the application. We will have to synchronize these tasks in other fucntions as well such as MsgProc & KeyBoradProc because they allow the application to exit via these locations – and so we will place omp taskwait constructs at these exits in addition to strategic locations in OnframeMove & WinMain.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.