U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science S HERIFF : Precise Detection & Automatic Mitigation of False Sharing Tongping Liu, Emery Berger University of Massachusetts, Amherst
Multi-core: expectation is awesome int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; }
Reality is awful int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; } count[id]++; 13X False sharing kills scaling
Thread 1 Main Memory Core 1 Thread 2 Core 2 Cache Invalidate False sharing = performance problem
Thread 1 Thread 2 Cache Invalidate Interleaved writes cause cache invalidations Main Memory Core 1 Core 2 20X slower False sharing = performance problem
me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields arr[me] = 12; arr[you] = 13; // array indices False sharing is invisible
False sharing detector: instrument every memory access Related work: S.M.Gunther et.al. [WBIA 2009]. C.Liu. [Master thesis 2009]. Q.Zhao et.al. [VEE2011]. Shortcomings: 1.Slow 2.No actionable output 3.False positives
+ 850 lines… False sharing detector: state of the art PTU Shortcomings: 1.Imprecise 2.Too many false positives
No false positives Actionable output Efficient (20%) S HERIFF -D ETECT Object has interleaving writes. The object starts at 0xd5c8e160, length 32. Allocation call stack: 0: word_count.c: 136 1: word_count.c: 444
t1 = spawn f(x); t2 = spawn g(y); sync; if (!fork()) f(x); if (!fork()) g(y); Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]
Process 1 Process 2 Global State Main Memory Cache Core 1 Core 2 Process 1 Process 2 S HERIFF : isolated execution
PthreadsSheriff 1: Lock(); 2: XXX; 3: Unlock(); 4: YYY; 5: Lock(); Begin_isolated_execution Commit_local_changes XXX; //isolated execution Begin_isolated_execution Commit_local_changes YYY; //isolated execution S HERIFF : isolated execution
Snapshot and diffing: local changes
Process 1 Process 2 Main Memory Process 1 Process 2 Global State Cache Core 1 Core 2 S HERIFF -D ETECT : Find false sharing at commit points Interleaved writes
BenchmarksPTU (# shared lines) SHERIFF-DETECT (# shared objects) canneal11 fluidanimate31 kmeans19162 linear_regression51 matrix_multiply4680 pbzip2140 pca450 pfscan30 reverse_indexN/A0 streamcluster91 word_count44 swaptions1960 TOTAL2,66415 kmeans reverse_index N/A 5 Total 2, Output: PTU VS. S HERIFF -D ETECT
Example case study: linear_regression Allocation call stack: 0: linear_regression-pthread.c: line number: 136 Step 1: find allocation site 136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs); Step 2: find references 152: pthread_create(&tid_args[i].tid, &attr, linear_regression_pthread, (void*)&tid_args[i]) != 0);
Example case study: linear_regression void *linear_regression_pthread(void *args_in) { lreg_args* args =(lreg_args*)args_in; …… for (i = 0; i num_elems; i++) { args->SX += args->points[i].x; args->SXX += args->points[i].x*args->points[i].x; …… “lreg_args” is not aligned
Example case study: linear_regression typedef struct { ….. char padding[128]; // Padding to avoid false sharing } lreg_args; Step 3: fix false sharing using padding 9.2X
8.2 S HERIFF -D ETECT performance %
Process 1 Process 2 Global State Main Memory Cache Core 1 Core 2 Process 1 Process 2 Speedup due to isolation
Prevents ALL false sharing S HERIFF -P ROTECT
Basis of S HERIFF -P ROTECT S HERIFF -P ROTECT S HERIFF -D ETECT
%
% g++ myprog.cpp –lsheriffdetect –o myprog % g++ myprog.cpp –lsheriffprotect –o myprog S HERIFF -P ROTECT S HERIFF -D ETECT S HERIFF libraries: easy to use
original program S HERIFF -D ETECT original program libpthread modified program libpthread original program S HERIFF -P ROTECT No false sharing Degrade performance too much memory No source code No time padding, alignment local variables S HERIFF - P ROTECT Workflow: using S HERIFF S HERIFF - D ETECT
28
%
Why no false positives? (1)actual interleaved writes (performance problem) (2)Word status – not true sharing (3) Avoid heap re-usage problems (4) The results of our experiment helps to exemplify the results. 30
Key Optimizations Isolate small heap objects and globals 31 Adaptive false sharing prevention – Protect on long transaction only
Key Optimizations Find sharing pages: false sharing objects shared page Reduce overhead – Using sampling – Sampling only for long transactions ( > 5ms) 32