3-1 JMH Associates © 2004, All rights reserved Windows Application Development Chapter 9 Synchronization Performance Impact and Guidelines
3-2 JMH Associates © 2004, All rights reserved OBJECTIVESOBJECTIVES Upon completion of this session, you will be able to: Avoid unnecessary synchronization Avoid SMP performance pitfalls Describe performance impact and factors on single and multiple processor systems Improve multithreaded and SMP performance
3-3 JMH Associates © 2004, All rights reserved ContentsContents 1.CRITICAL_SECTION – Mutex Tradeoffs 2.SMP Impact 3.Semaphores to Reduce Thread Contention 4.Processor Affinity 5.Tuning with CS Spin Counts 6.Performance Guidelines and Pitfalls 7.Lab Exercise 9-1
3-4 JMH Associates © 2004, All rights reserved 1. CRITICAL_SECTION – Mutex Tradeoffs Conventional Wisdom: Critical sections are much faster than mutexes They operate in user, not kernel, space Non-conventional wisdom 5 or more contending threads may be better with a mutex Mutexes have a more linear behavior
3-5 JMH Associates © 2004, All rights reserved CS-Mutex Performance Comparison TimedMutualExclusion.exe See code comment for more explanation Four additional simple programs to compare performance StatsNS.c: No synchronization. Wrong, but runs as fast as possible StatsCS.c: Uses CRITICAL_SECTION StatsIN.c: Uses interlocked functions StatsMX.c: Uses mutex
3-6 JMH Associates © 2004, All rights reserved Experimental Results Critical Sections with multiple threads – 1P – Results vary
3-7 JMH Associates © 2004, All rights reserved More Experimental Results Critical Sections vs. Mutexes – multiple threads – 1P
3-8 JMH Associates © 2004, All rights reserved 2. SMP Impact SMP allows transparent use of multiple processors The kernel scheduler assigns ready threads to processors Intel Xeon® has multiprocessing on a single processor “Hyper-Threading” – also Pentium 4® Note: A thread may run on several processors in its lifetime Result: Performance gain (sometimes) - Example: sortMT Dramatic performance loss (sometimes)
3-9 JMH Associates © 2004, All rights reserved SMP Potential Negative Impact 1, 2, and 4 Processors. CSs and Mutexes CS Mutex
3-10 JMH Associates © 2004, All rights reserved 3. Semaphores to Reduce Thread Contention Scenario: N worker threads contend for a shared resource Using a CS or mutex Performance degradation is severe Distinct worker threads provide a “natural” solution Simple conceptually and to implement Problem: Improve performance Retain the simplicity One solution – “Semaphore throttle” Use a semaphore: Max count limits the running threads
3-11 JMH Associates © 2004, All rights reserved A Semaphore Throttle Boss thread creates a semaphore Max value/initial value set to a “small number” (such as 4) Number of processors, or a tunable value Worker threads get a semaphore unit before working Wait on the semaphore, not the mutex (or CS) while (TRUE) { // Worker loop WFSO (hSem, Infinite); WFSO (hMutex, Infinite); // Get work unit, etc.... ReleaseMutex (hMutex);... ReleaseSemaphore (hSem, 1, NULL); } // End of worker loop
3-12 JMH Associates © 2004, All rights reserved Semaphore Throttle Variations Some workers may acquire multiple units Concept: These workers use more resources Caution: Deadlock risk. Boss thread tunes dynamically Decreases/increases number of active workers By waiting or releasing semaphore units Note: Max value is set once at initialization If max count is 1, mutex is redundant Often the best SMP solution TimedMutualExclusion: Sixth parameter Max # active workers
3-13 JMH Associates © 2004, All rights reserved Semaphore Throttle Results One processor
3-14 JMH Associates © 2004, All rights reserved 4. Processor Affinity Process-Specific “process affinity mask” Thread-Specific “thread processor affinity mask” Threads can only run on permitted processor(s) ThAM <= PrAM System affinity mask: Each bit represents a configured processor Process AM <= System AM BOOL GetProcessAffinityMask( HANDLE hProcess, LPDWORD lpProcessAffinityMask, LPDWORD lpSystemAffinityMask );
3-15 JMH Associates © 2004, All rights reserved Setting Thread Processor Affinity DWORD SetThreadAffinityMask ( HANDLE hThread, DWORD dwThreadAffinityMask ); ThAM <= PrAM Only affects future thread scheduling Target thread could already be running on a prohibited processor Question (Exercise for reader) How are Intel Xeon processors represented? Hyper-Threading runs multiple threads concurrently Distinct threads can have reserved processor(s) Also see SetThreadIdealProcessor
3-16 JMH Associates © 2004, All rights reserved 5. Tuning with CS Spin Counts CS operations run in user, not kernel, space EnterCriticalSection() uses a “spin lock” InterlockedCompareExchange() sets the lock only if it is reset If previously locked: Single processor: Wait in the kernel until unlocked (SC == 0) SMP: Try again – kernel wait only after “spin count” attempts Single processor advantage: Fast if no wait required SMP advantage: Avoid contention between processors Guideline: For short duration, high contention locks Example value: 4000
3-17 JMH Associates © 2004, All rights reserved Setting the Spin Count Value is ignored on a single processor system Initial value – Replace ICS call with: BOOL InitializeCriticalSectionAndSpinCount( LPCRITICAL_SECTION lpCriticalSection, DWORD dwSpinCount ); Dynamic spin count adjustment: DWORD SetCriticalSectionSpinCount( LPCRITICAL_SECTION lpCriticalSection, DWORD dwSpinCount );
3-18 JMH Associates © 2004, All rights reserved 6. Performance Guidelines and Pitfalls Avoid performance problems Beware of conjecture about performance Locking is expensive, use only as required Hold a mutex as long as needed – but no longer High contention hinders performance Beware of global locks Synchronization will impact the program performance -Be especially careful when running on SMP systems Reduce thread contention Avoid too many active threads Use a semaphore to limit worker or server threads
3-19 JMH Associates © 2004, All rights reserved 7. Lab Exercise 9-1 Use TimedMutualExclusion 1. Obtain your own results CS vs. mutex Single processor vs. SMP vs. Xeon (if available) 2. Extend to add spin count tuning TimedMutualExclusionSC Add a parameter for the initial SC Tune performance on an SMP system Alternative: Gather data from statsCS, etc. to assess synchronization performance impact
3-20 JMH Associates © 2004, All rights reserved NotesNotes