Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos Joint work with: D. Cederman, B. Chatterjee, N. Nguyen, M. Papatriantafilou, P. Tsigas Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden
Developing a multithreaded application… Yiannis Nikolakopoulos 2 The boss wants.NET The client wants speed… (C++?) Java is nice Multicores everywhere
Yiannis Nikolakopoulos 3 The worker threads need to access data Concurrent Data Structures Then we need Synchronization. Developing a multithreaded application…
Implementation Coarse Grain Locking Fine Grain Locking Test And SetArray LocksAnd more! Yiannis Nikolakopoulos 4 Implementing Concurrent Data Structures Performance Bottleneck
Implementation Coarse Grain Locking Fine Grain Locking Test And SetArray LocksAnd more!Lock Free Yiannis Nikolakopoulos 5 Implementing Concurrent Data Structures Hardware platform Which is the fastest/most scalable?
Implementing concurrent data structures Yiannis Nikolakopoulos 6
Problem Statement How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures. Yiannis Nikolakopoulos 7
Outline Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos 8
Which data structures to study? Represent different levels of contention: Queue - 1 or 2 contention points Hash table - multiple contention points Yiannis Nikolakopoulos 9
How do we choose implementation? Possible criteria: Framework dependencies Programmability “Good” performance Yiannis Nikolakopoulos 10
Interpreting “good” Throughput: The more operations completed per time unit the better. Is this enough? Yiannis Nikolakopoulos 11
Non-fairness Yiannis Nikolakopoulos 12
What to measure? Yiannis Nikolakopoulos 13 Operations by thread i Average operations per thread
Implementation Parameters Yiannis Nikolakopoulos 14 Programming Environments C++JavaC# (.NET,Mono) Synchronization Methods TAS, TTAS, Lock-free, Array lock PMutex, Lock-free memory management Reentrant, synchronized lock construct, Mutex NUMA Architectures Intel Nehalem, 2 x 6 core (24 HW threads) AMD Bulldozer, 4 x 12 core (48 HW threads) Do they influence fairness?
Experiment Parameters Different levels of contention Number of threads Measured time intervals Yiannis Nikolakopoulos 15
Outline Queue – Fairness – Intel vs AMD – Throughput vs Fairness Hash Table – Intel vs AMD – Scalability Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos 16
Fairness can change along different time intervals 24 Threads, High contention Yiannis Nikolakopoulos 17 Observations: Queue
Significantly different fairness behavior in different architectures 24 Threads, High contention Yiannis Nikolakopoulos 18 Observations: Queue Fairness
Significantly different fairness behavior in different architectures 24 Threads, High contention Lock-free is less affected in this case Yiannis Nikolakopoulos 19 Observations: Queue Fairness
Queue: Throughput vs Fairness Fairness 0.6 s, IntelThroughput Yiannis Nikolakopoulos ,2 0,4 0,6 0, Fairness Threads C++ TTASLock-freePMutex Operations per ms (thousands) Threads C++
Observations: Hash table Operations are distributed in different buckets Things get interesting when #threads > #buckets Tradeoff between throughput and fairness – Different winners and losers – Contention is lowered in the linked list components Yiannis Nikolakopoulos 21
Fairness differences in Hash table across architectures 24 Threads, High contention Yiannis Nikolakopoulos 22 Observations: Hash table
Fairness differences in Hash table across architectures 24 Threads, High contention Lock-free is again not affected Yiannis Nikolakopoulos 23 Observations: Hash table
In C++, custom memory management and lock-free implementations excel in scalability and performance. Yiannis Nikolakopoulos 24
Conclusion Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots Scalability via more complex, inherently parallel designs and implementations Tradeoff between throughput and fairness – LF Hash table – Reentrant lock vs Array Lock vs LF Queue Fairness can be heavily influenced by HW – Interesting exceptions Yiannis Nikolakopoulos 25 Which is the fastest/most scalable? Is fairness influenced by NUMA?