1 U NIVERSITY OF M ASSACHUSETTS, A MHERST School of Computer Science P REDATOR : Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger* *University of Massachusetts Amherst Huawei US Research Center
2 Parallelism: Expectation is Awesome Runtime (s) Expectation Parallel Program int count[8]; int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++; } int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); }
3 False sharing slows the program by 13X Runtime (s) Parallel Program Expectation Reality Parallelism: Reality is Awful int count[8]; int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++; } int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); } False sharing
4 False Sharing in Real Applications False sharing slows MySQL by 50%
5 Cache Line False Sharing vs. True Sharing
6 Task 3Task 1 Task 2Task 4 False Sharing Task 1 True Sharing Task 2 False Sharing vs. True Sharing
7 Resource Contention at Cache Line Level
8 Thread 1 Main Memory Core 1 Thread 2 Core 2 Cache Invalidate Cache line: basic unit of data transfer False Sharing Causes Performance Problems
9 Thread 1 Thread 2 Cache Invalidate Interleaved accesses cause cache invalidations Main Memory Core 1 Core 2 False Sharing Causes Performance Problems
10 me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields array[me] = 12; array[you] = 13; // array indices False Sharing is Everywhere
11 False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)
12 Problems of Existing Tools No precise information/false positives – WIBA’09, VEE’11, EuroSys’13, SC’13 Accurate & Precise – OOPSLA’11 ( Cannot detect read-write FS) Shared problem: only detect observed false sharing
13 Task 1 Task 2 Cache Invalidat e Main Memory Core 1 Core 2 False Sharing Causes Performance Problems Find cache lines with many cache invalidations Interleaved accesses Cache invalidations Performance problems Detect false sharing causing performance problems
14 Find Lines with Many Invalidations …… Track cache invalidations on each cache line Memory: Global, Heap
15 Track Cache Invalidations Hardware-based approach – Needs hardware support – No portability Simulation-based approach – Needs hardware info such as cache hierarchy, cache capacity – Very slow Conservative Assumptions – Each thread runs on a different core with its private cache. – Infinite cache capacity. P REDATOR : based on memory access history of each cache line
16 Track Cache Invalidations rwrwwrwr T1T2 0 0 # of invalidations Time T2 r r T1 r r T2 w w Each Entry: { Thread ID, Access Type} T2 w w T1 w w T2 w w T1 r r
17 P REDATOR Components Compiler Instrumentation Runtime System Instruments every memory read/write access Collects memory accesses and reports false sharing
18 Detect Problems Correctly & Precisely Correctly: – No false alarms Task 3Task 1 Task 2Task 4 False Sharing Task 1 True Sharing Task 2 Track memory accesses on each word Precisely – Global variables – Heap objects: pinpoint the line of memory allocation
19 P REDATOR ’s Report
20 Why do we need prediction?
21 Necessity of False Sharing Prediction Thread 1Thread 2 Cache line 1Cache line 2 Cache line 1Cache line 2 False Sharing Cache line 1 False Sharing
22 Properties Affecting False Sharing Occurrence 32-bit platform 64-bit platform Different memory allocator Different compiler or optimization Different allocation order by changing the code, e.g., printf Change of memory layout Run on hardware with different cache line size
23 Example of False Sharing Sensitivity Offset = 0Offset = 8Offset = 56 …… Memory Colors represent threads Cache line size = 64 bytes
24 P REDATOR predicts false sharing problems without occurrence Example of False Sharing Sensitivity
25 Prediction Based on Virtual Cache Lines Thread 1Thread 2 Cache line 1Cache line 2 Virtual cache line 1Virtual cache line 2 False Sharing Virtual cache line 1 False Sharing Real case Prediction 1 Prediction 2
26 d Y X (sz-d)/2 Tracked virtual line Non-tracked virtual lines Track Invalidations on Virtual Cache Lines d < the cache line size - sz (X, Y) from different threads && one of them is write
27 Benchmark Results BenchmarksUnknown Problem Without Prediction With PredictionImprovement Histogram ✔✔✔ 46% Linear_regression ✔ 1207% Reverse_index ✔✔ 0.09% Word_count ✔✔ 0.14% Streamcluster-1 ✔✔✔ 4.77% Streamcluster-2 ✔✔ 7.52%
28 Real Applications Results MySQL – Problem: False sharing occurs when different threads update the shared bitmap simultaneously. – Performance improves 180% after fixes. Boost library: – Problem: “there will be 16 spinlocks per cache line” – Performance improves about 100%.
29 Performance Overhead of P REDATOR 5.6X
30 Compiler Instrumentation Runtime System Thread 1 Thread 2 Cache Invalidat e Main Memory Core 1 Core 2 Precise report Thread 1Thread 2 Cache line 1Cache line 2 Virtual cache line 1Virtual cache line 2 False Sharing Virtual cache line 1 False Sharing Real case Prediction 1 Prediction 2
31
32 False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)
33 Detailed Prediction Algorithm 1. Find suspected cache lines
34 Detailed Prediction Algorithm 1. Find suspected cache lines 2. Track detailed memory accesses
35 Detailed Prediction Algorithm 1. Find suspected cache lines 2. Track detailed memory accesses 3. Predict based on hot accesses Y X d d < sz && (X, Y) from different threads, potential false sharing
36 4: Tracking Cache Invalidations on the Virtual Line d Y X (sz-d)/2 Tracked virtual line Non-tracked virtual lines