Timothy Zhu and Huapeng Zhou Critical Section Characterization and Acceleration in Real World Applications Timothy Zhu and Huapeng Zhou
Motivation Performance of multithreaded applications is limited by critical sections We provide an analytical model to analyze the impact of critical sections on performance
Model Theoretical Bounds E[R] ≥ max(D, N*Dmax – E[Z]) N = number of threads (e.g. 4) R = time waiting on and executing critical section Z = time executing non-critical section X = throughput = number of iterations around loop per sec Theoretical Bounds E[R] ≥ max(D, N*Dmax – E[Z]) X ≤ min(N/(D+E[Z]), 1/Dmax) Dmax is the maximum duration of executing a critical section D is the sum of durations executing critical sections
Methodology Implemented a hooking library around pthread Interpose common mutex and condition variable calls We also experiment with an alternative spinlock implementation to lower latency Raw measurements mutex address, thread id, return address, start time, lock time, unlock time Benchmarks memcached MySQL (oltp-simple, oltp-complex, oltp-nontrx) Runs on dual socket 6-core Xeon processors (with HT OS sees 24 cores) with 48 Gb RAM
Experimental Results I Memcached has one bottleneck critical section with demand Dmax E[R] increases as N increases since more threads are waiting Overall throughput is also affected by other shared resources N E[R] E[Z] X D Dmax 2 1020 8739 0.000205 867 767 3 2174 7697 0.000304 1373 1197 5 4485 7117 0.000431 1668 1359 7 7710 7462 0.000461 1792 1439 11 21668 9878 0.000349 2385 1876 13 28092 12118 0.000323 2527 2005 21 61033 19086 0.000262 3311 2570
Experimental Results II The operation point falls within theoretical bounds
Next Steps We have a method of identifying a bottleneck critical section Develop a data-driven simulator Manipulates logged critical section data Provides insights on what can be improved Evaluate architecture ideas Accelerated Critical Sections (ACS) Scheduling algorithms